evaluation.plot {UCS} | R Documentation |
An implementation of evaluation graphs for the empirical
evaluation of association measures in terms of precision and recall,
as described in (Evert, 2004, Ch. 5). Graphs of precision, recall
and local precision for n-best lists, as well as precision-by-recall
graphs are all provided by a single function evaluation.plot
.
evaluation.plot(ds, keys, tp=ds$b.TP, x.min=0, x.max=100, y.min=0, y.max=100, x.axis=c("n.best", "proportion", "recall"), y.axis=c("precision", "local.precision", "recall"), n.first=ucs.par("n.first"), n.step=ucs.par("n.step"), cut=NULL, window=400, show.baseline=TRUE, show.nbest=NULL, show.npair=NULL, conf=FALSE, conf.am=NULL, conf.am2=NULL, test=FALSE, test.am1=NULL, test.am2=NULL, test.step=ucs.par("test.step"), test.relevant=0, usercode=NULL, file=NULL, aspect=1, plot.width=6, plot.height=6, cex=ucs.par("cex"), lex=ucs.par("lex"), bw=FALSE, legend=NULL, bottom.legend=FALSE, title=NULL, ...)
ds |
a UCS data set object, read in from a data set file with
the read.ds.gz function. ds must contain
rankings for the association measures listed in the keys
parameter (use add.ranks to add such rankings to a
data set object). |
keys |
a character vector naming up to 10 association measures to
be evaluated. Each name may be abbreviated to prefix that must
be unique within the measures annotated in ds .
Use the ds.find.am function to obtain a
list of measures annotated in the data set, and see the ucsam
manpage in UCS/Perl for detailed information about the association
measures supported by the UCS system (with the shell command
ucsdoc ucsam ). |
tp |
a logical vector indicating true positives, parallel to the
rows of the data set ds . If tp is not specified, the
data set must contain a variable named b.TP which is used
instead. |
x.min, x.max |
the limits of the x-axis in the plot, used to
“zoom in” to an interesting region. The interpretation of
the values depends on the x.axis parameter below. For
x.axis="n.best" (the default case), x.min and
x.max refer to n-best lists. Otherwise, they refer to
percentages ranging from 0 to 100. By default, the full data set is
shown. |
y.min, y.max |
the limits of the y-axis in the plot, used to
“zoom in” to an interesting region. The values are always
interpreted as percentages, ranging from 0 to 100. By default,
y.max is fitted to the evaluation graphs (unless
y.axis="recall" , where y.max is always set to 100). |
x.axis |
select variable shown on x-axis. Available choices are
the n-best list size n ("n.best" , the default), the
same as a proportion of the full data set ("proportion" ), and
the recall as a percentage ("recall" ). The latter produces
precision-by-recall graphs. Unless you are silly enough to specify
y.axis="recall" at the same time, that is. |
y.axis |
select variable shown on x-axis. Available choices are
the precision ("precision" , the default), an estimate for
local precision ("local.precision" , see details below), and
the recall ("recall" ). All three variables are shown as
percentages ranging from 0 to 100. |
n.first |
the smallest n-best list to be evaluated. Shorter
n-best lists typically lead to highly unstable evaluation graphs.
The standard setting is 100, but a higher value may be necessary for
random sample evaluation (see details below). If n.first is
not specified, the default supplied by ucs.par is
used. |
n.step |
the step width for n-best lists in the evaluation
graphs. Initially, precision and recall are computed for all n-best
lists, but only every n.step -th one is plotted, yielding
graphs that look less jagged and reducing the size of generated
PostScript files (see the file parameter below). If
n.step is not specified, the default supplied by
ucs.par is used. |
cut |
for each association measure, pretend that the data set
consists only of the cut highest-ranked candidates according
to this measure. This trick can be used to perform an evaluation
of n-best lists without having to annotate the full data set. The
candidates from all relevant n-best lists are combined into a single
data set file and cut is set to n. |
window |
number of candidates to consider when estimating local
precision (default: 400), i.e. with the option
y.axis="local" . Values below 400 or above 1000 are rarely
useful. See below for details. |
show.baseline |
if TRUE , show baseline precision as dotted
horizontal line with label (this is the default). Not available
when y.axis="recall" . |
show.nbest |
integer vector of n-best lists that will be
indicated as thin vertical lines in the plot. When
x.axis="recall" , the n-best lists are shown as diagonal
lines. |
show.npair |
when x.axis="proportion" , the total number of
candidates in ds is shown in the x-axis label. Set
show.npair=NULL to suppress this, or set it to an integer
value in order to lie about the number of candidates (rarely
useful). |
conf |
if TRUE , confidence intervals are shown as coloured
or shaded regions around one or two precision graphs. In this case,
the parameter conf.am must also be specified. Alternatively,
conf can be set to a number indicating the significance level
to be used for the confidence intervals (default: 0.05,
corresponding to 95% confidence). See below for details. Note
that conf is only available when y.axis="precision" . |
conf.am |
name of the association measure for which confidence
intervals are displayed (may be abbreviated to a prefix that is
unique within keys ) |
conf.am2 |
optional second association measure, for which confidence intervals will also be shown |
test |
if TRUE , significance tests are carried out for the
differences between the evaluation results of two association
measures, given as test.am1 and test.am2 below.
Alternatively, test can be set to a number indicating the
significance level to be used for the tests (default: 0.05).
n-best lists where the result difference is significant are
indicated by arrows between the respective evaluation graphs (when
x.axis="recall" ) or by coloured triangles (otherwise). See
details below. Note that test is not available when
y.axsis="local" . |
test.am1 |
the first association measure for significance tests
(may be abbreviated to a prefix that is unique within keys ).
Usually, this is the measure that achieves better performance (but
tests are always two-sided). |
test.am2 |
the second association measure for significance tests
(may be abbreviated to a prefix that is unique within keys ) |
test.step |
the step width for n-best lists where significance
tests are carried out, as a multiple of n.step . The standard
setting is 10 since the significance tests are based on the
computationally expensive fisher.test functio and
since the triangles or arrows shown in the plot are fairly large.
If test.step is not specified, the default supplied by
ucs.par is used. |
test.relevant |
a positive number, indicating the estimated precision differences that are considered “relevant” and that are marked by dark triangles or arrows in the plot. See below for details. |
usercode |
a callback function that is invoked when the plot has
been completed, but before the legend box is drawn. This feature is
mainly used to add something to a plot that is written to a
PostScript file. The usercode function is invoked with
parameters region=c(x.min,x.max,y.min,y.max) and pr , a
list of precision/recall tables (as returned by
precision.recall ) for each of the measures in
keys . |
file |
a character string giving the name of a PostScript file.
If specified, the evaluation plot will be saved to file
rather than displayed on screen. See evaluation.file
for a function that combines both operations. |
aspect |
a positive number specifying the desired aspect of the
plot region (only available for PostScript files). In the default
case x.axis="n.best" , aspect refers to the absolute
size of the plot region. Otherwise, it specifies the size ratio
between percentage points on the x-axis and the y-axis. Setting
aspect modifies the height of the plot (plot.height ). |
plot.width, plot.height |
the width and height of a plot that is
written to a PostScript file, measured in inches.
plot.height may be overridden by the aspect parameter,
even if it is set explicitly. |
cex |
character expansion factor for labels, annotations, and
symbols in the plot (see par for details). If cex is
not specified, the default supplied by ucs.par is
used. |
lex |
added to the line widths of evaluation graphs and some
decorations (note that this is not an expansion factor). If
lex is not specified, the default supplied by
ucs.par is used. |
bw |
if TRUE , the evaluation plot is drawn in black and
white, which is mostly used in conjunction with file to
produce figures for articles (defaults to FALSE ). See below
for details. |
legend |
a vector of character strings or expressions, used as
labels in the legend of the plot (e.g. to show mathematical symbols
instead of the names of association measures). Use
legend=NULL to suppress the display of a legend box. |
bottom.legend |
if TRUE , draw legend box in bottom right
corner of plot (default is top right corner). |
title |
a character vector or expression to be used as the main title of the plot (optional) |
... |
any other arguments are set as local graphics parameters
(using par ) before the evaluation plot is drawn |
When y.axis="local.precision"
, the evaluation graphs show
local precision, i.e. an estimate for the density of true
positives around the n-th rank according to the respective association
measure. Local precision is smoothed using a kernel density estimate
with a Gaussian kernel (from the density
function), based on
a symmetric window covering approximately window
candidates
(default: 400). Consequently, the resulting values do not have a
clear-cut interpretation and should not be used to evaluate the
performance of association measures. They are rather a means of
exploratory data analysis, helping to visualise the relation between
association scores and the true positives in a data set (see Evert,
2004, Sec. 5.2 for an example).
In order to generalise evaluation results beyond the specific data set on which they were obtained, it is necessary to compute confidence intervals for the observed precision values and to test whether the observed result differences are significant. See (Evert, 2004, Sec. 5.3) for the methods used and the interpretation of their results.
Confidence intervals are computed by setting
conf=TRUE
and selecting an association measure with the
conf.am
parameter. The confidence intervals are displayed as a
coloured or shaded region around the precision graph of this measure
(confidence intervals are not available for graphs of recall or local
precision). The default confidence level of 95% will rarely need to
be changed. Optionally, a second confidence region can be displayed
for a measure selected with the conf.am2
parameter.
Significance tests for the result differences are activated by
setting test=TRUE
(not available for graphs of local
precision). The evaluation results of two association measures
(specified with test.am1
and test.am2
) are compared for
selected n-best lists, and significant differences are marked by
coloured triangles or arrows (when x.axis="recall"
). The
default significance level of 0.05 will rarely need to be
changed. Use the test.step
parameter to control the spacing of
the triangles or arrows.
A significant difference indicates that measure A is truly better than
measure B, rather than just as a coincidence in a single evaluation
experiment. Formally, this “true performance” can be defined
as the average precision of a measure, obtained by averaging over many
similar evaluation experiments. Thus, a significant difference
means that the average precision of A is higher than that of B, but it
does not indicate how great the difference is. A tiny difference
(say, of half a percent point) is hardly relevant for an
application, even if there is significant evidence for it. If the
test.relevant
parameter is set, the evaluation.plot
function attempts to estimate whether there is significant evidence
for a relevant difference (of at least a many percent points as given
by the value of test.relevant
), and marks such cases by darker
triangles or arrows. This feature should be considered experimental
and used with caution, as the computation involves many approximations
and guesses (exact statistical inference for the difference in true
precision not being available).
It goes without saying that confidence regions and significance tests do not allow evaluation results to be generalised to a different extraction task (i.e. another type of cooccurrences or another definition of true positives), or even to the same task under different conditions (such as a source corpus from a different domain, register, time, or a corpus of different size). The unpredictability of the performance of association measures for different extraction tasks or under different conditions has been confirmed by various evaluation studies.
Generally, evaluation plots can be drawn in two modes: colour
(bw=FALSE
, the default) or black and white
(bw=TRUE
). The styles of evaluation graphs are controlled by
the respective settings in ucs.par
, while the appearance
of various other elements is hard-coded in the evaluation.plot
function. In particular, confidence regions are either filled with a
light background colour (colour mode) or shaded with diagonal lines
(B/W mode). The triangles or arrows used to mark significant
differences are yellow or red (indicating relevance) in colour mode,
and light grey or dark grey (indicating relevance) in B/W mode. B/W
mode is mainly used to produce PostScript files to be included as
figures in articles, but can also be displayed on-screen for testing
purposes.
The evaluation.plot
function supports evaluation based on
random samples, or RSE for short (Evert, 2004, Sec. 5.4). Missing
values (NA
) in the tp
vector (or the b.TP
variable in ds
) are interpreted as unannotated candidates. In
this case, precision, recall and local precision are computed as
maxmium-likelihood estimates based on the annotated candidates.
Confidence intervals and significance tests, which should not be
absent from any RSE, are adjusted accordingly. A confidence interval
for the baseline precision is automatically shown (by thin dotted
lines) when RSE is detected. Note that n-best lists (as shown on the
x-axis) still refer to the full data set, not just to the number of
annotated candidates.
The following functions are provided for compatibility with earlier
versions of UCS/R: precision.plot
, recall.plot
, and
recall.precision.plot
. They are simple front-ends to
evaluation.plot
with the implicit parameter settings
y.axis="recall"
and y.axis="precision", x.axis="recall"
for the latter two.
Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart.
ucs.par
, evaluation.file
,
read.ds.gz
, and precision.recall
.
The R script ‘tutorial.R’ in the ‘script/’ directory
provides a gentle introduction to the wide range of possibilities
offered by the evaluation.plot
function.