evaluation.plot {UCS}R Documentation

Evaluation Graphs for Association Measures (plots)


An implementation of evaluation graphs for the empirical evaluation of association measures in terms of precision and recall, as described in (Evert, 2004, Ch. 5). Graphs of precision, recall and local precision for n-best lists, as well as precision-by-recall graphs are all provided by a single function evaluation.plot.


evaluation.plot(ds, keys, tp=ds$b.TP,
                x.min=0, x.max=100, y.min=0, y.max=100,
                x.axis=c("n.best", "proportion", "recall"),
                y.axis=c("precision", "local.precision", "recall"),
                n.first=ucs.par("n.first"), n.step=ucs.par("n.step"),
                cut=NULL, window=400,
                show.baseline=TRUE, show.nbest=NULL, show.npair=NULL,
                conf=FALSE, conf.am=NULL, conf.am2=NULL,
                test=FALSE, test.am1=NULL, test.am2=NULL,
                test.step=ucs.par("test.step"), test.relevant=0,
                file=NULL, aspect=1, plot.width=6, plot.height=6,
                cex=ucs.par("cex"), lex=ucs.par("lex"), bw=FALSE,
                legend=NULL, bottom.legend=FALSE,
                title=NULL, ...) 


ds a UCS data set object, read in from a data set file with the read.ds.gz function. ds must contain rankings for the association measures listed in the keys parameter (use add.ranks to add such rankings to a data set object).
keys a character vector naming up to 10 association measures to be evaluated. Each name may be abbreviated to prefix that must be unique within the measures annotated in ds. Use the ds.find.am function to obtain a list of measures annotated in the data set, and see the ucsam manpage in UCS/Perl for detailed information about the association measures supported by the UCS system (with the shell command ucsdoc ucsam).
tp a logical vector indicating true positives, parallel to the rows of the data set ds. If tp is not specified, the data set must contain a variable named b.TP which is used instead.
x.min, x.max the limits of the x-axis in the plot, used to “zoom in” to an interesting region. The interpretation of the values depends on the x.axis parameter below. For x.axis="n.best" (the default case), x.min and x.max refer to n-best lists. Otherwise, they refer to percentages ranging from 0 to 100. By default, the full data set is shown.
y.min, y.max the limits of the y-axis in the plot, used to “zoom in” to an interesting region. The values are always interpreted as percentages, ranging from 0 to 100. By default, y.max is fitted to the evaluation graphs (unless y.axis="recall", where y.max is always set to 100).
x.axis select variable shown on x-axis. Available choices are the n-best list size n ("n.best", the default), the same as a proportion of the full data set ("proportion"), and the recall as a percentage ("recall"). The latter produces precision-by-recall graphs. Unless you are silly enough to specify y.axis="recall" at the same time, that is.
y.axis select variable shown on x-axis. Available choices are the precision ("precision", the default), an estimate for local precision ("local.precision", see details below), and the recall ("recall"). All three variables are shown as percentages ranging from 0 to 100.
n.first the smallest n-best list to be evaluated. Shorter n-best lists typically lead to highly unstable evaluation graphs. The standard setting is 100, but a higher value may be necessary for random sample evaluation (see details below). If n.first is not specified, the default supplied by ucs.par is used.
n.step the step width for n-best lists in the evaluation graphs. Initially, precision and recall are computed for all n-best lists, but only every n.step-th one is plotted, yielding graphs that look less jagged and reducing the size of generated PostScript files (see the file parameter below). If n.step is not specified, the default supplied by ucs.par is used.
cut for each association measure, pretend that the data set consists only of the cut highest-ranked candidates according to this measure. This trick can be used to perform an evaluation of n-best lists without having to annotate the full data set. The candidates from all relevant n-best lists are combined into a single data set file and cut is set to n.
window number of candidates to consider when estimating local precision (default: 400), i.e. with the option y.axis="local". Values below 400 or above 1000 are rarely useful. See below for details.
show.baseline if TRUE, show baseline precision as dotted horizontal line with label (this is the default). Not available when y.axis="recall".
show.nbest integer vector of n-best lists that will be indicated as thin vertical lines in the plot. When x.axis="recall", the n-best lists are shown as diagonal lines.
show.npair when x.axis="proportion", the total number of candidates in ds is shown in the x-axis label. Set show.npair=NULL to suppress this, or set it to an integer value in order to lie about the number of candidates (rarely useful).
conf if TRUE, confidence intervals are shown as coloured or shaded regions around one or two precision graphs. In this case, the parameter conf.am must also be specified. Alternatively, conf can be set to a number indicating the significance level to be used for the confidence intervals (default: 0.05, corresponding to 95% confidence). See below for details. Note that conf is only available when y.axis="precision".
conf.am name of the association measure for which confidence intervals are displayed (may be abbreviated to a prefix that is unique within keys)
conf.am2 optional second association measure, for which confidence intervals will also be shown
test if TRUE, significance tests are carried out for the differences between the evaluation results of two association measures, given as test.am1 and test.am2 below. Alternatively, test can be set to a number indicating the significance level to be used for the tests (default: 0.05). n-best lists where the result difference is significant are indicated by arrows between the respective evaluation graphs (when x.axis="recall") or by coloured triangles (otherwise). See details below. Note that test is not available when y.axsis="local".
test.am1 the first association measure for significance tests (may be abbreviated to a prefix that is unique within keys). Usually, this is the measure that achieves better performance (but tests are always two-sided).
test.am2 the second association measure for significance tests (may be abbreviated to a prefix that is unique within keys)
test.step the step width for n-best lists where significance tests are carried out, as a multiple of n.step. The standard setting is 10 since the significance tests are based on the computationally expensive fisher.test functio and since the triangles or arrows shown in the plot are fairly large. If test.step is not specified, the default supplied by ucs.par is used.
test.relevant a positive number, indicating the estimated precision differences that are considered “relevant” and that are marked by dark triangles or arrows in the plot. See below for details.
usercode a callback function that is invoked when the plot has been completed, but before the legend box is drawn. This feature is mainly used to add something to a plot that is written to a PostScript file. The usercode function is invoked with parameters region=c(x.min,x.max,y.min,y.max) and pr, a list of precision/recall tables (as returned by precision.recall) for each of the measures in keys.
file a character string giving the name of a PostScript file. If specified, the evaluation plot will be saved to file rather than displayed on screen. See evaluation.file for a function that combines both operations.
aspect a positive number specifying the desired aspect of the plot region (only available for PostScript files). In the default case x.axis="n.best", aspect refers to the absolute size of the plot region. Otherwise, it specifies the size ratio between percentage points on the x-axis and the y-axis. Setting aspect modifies the height of the plot (plot.height).
plot.width, plot.height the width and height of a plot that is written to a PostScript file, measured in inches. plot.height may be overridden by the aspect parameter, even if it is set explicitly.
cex character expansion factor for labels, annotations, and symbols in the plot (see par for details). If cex is not specified, the default supplied by ucs.par is used.
lex added to the line widths of evaluation graphs and some decorations (note that this is not an expansion factor). If lex is not specified, the default supplied by ucs.par is used.
bw if TRUE, the evaluation plot is drawn in black and white, which is mostly used in conjunction with file to produce figures for articles (defaults to FALSE). See below for details.
legend a vector of character strings or expressions, used as labels in the legend of the plot (e.g. to show mathematical symbols instead of the names of association measures). Use legend=NULL to suppress the display of a legend box.
bottom.legend if TRUE, draw legend box in bottom right corner of plot (default is top right corner).
title a character vector or expression to be used as the main title of the plot (optional)
... any other arguments are set as local graphics parameters (using par) before the evaluation plot is drawn


When y.axis="local.precision", the evaluation graphs show local precision, i.e. an estimate for the density of true positives around the n-th rank according to the respective association measure. Local precision is smoothed using a kernel density estimate with a Gaussian kernel (from the density function), based on a symmetric window covering approximately window candidates (default: 400). Consequently, the resulting values do not have a clear-cut interpretation and should not be used to evaluate the performance of association measures. They are rather a means of exploratory data analysis, helping to visualise the relation between association scores and the true positives in a data set (see Evert, 2004, Sec. 5.2 for an example).

In order to generalise evaluation results beyond the specific data set on which they were obtained, it is necessary to compute confidence intervals for the observed precision values and to test whether the observed result differences are significant. See (Evert, 2004, Sec. 5.3) for the methods used and the interpretation of their results.

Confidence intervals are computed by setting conf=TRUE and selecting an association measure with the conf.am parameter. The confidence intervals are displayed as a coloured or shaded region around the precision graph of this measure (confidence intervals are not available for graphs of recall or local precision). The default confidence level of 95% will rarely need to be changed. Optionally, a second confidence region can be displayed for a measure selected with the conf.am2 parameter.

Significance tests for the result differences are activated by setting test=TRUE (not available for graphs of local precision). The evaluation results of two association measures (specified with test.am1 and test.am2) are compared for selected n-best lists, and significant differences are marked by coloured triangles or arrows (when x.axis="recall"). The default significance level of 0.05 will rarely need to be changed. Use the test.step parameter to control the spacing of the triangles or arrows.

A significant difference indicates that measure A is truly better than measure B, rather than just as a coincidence in a single evaluation experiment. Formally, this “true performance” can be defined as the average precision of a measure, obtained by averaging over many similar evaluation experiments. Thus, a significant difference means that the average precision of A is higher than that of B, but it does not indicate how great the difference is. A tiny difference (say, of half a percent point) is hardly relevant for an application, even if there is significant evidence for it. If the test.relevant parameter is set, the evaluation.plot function attempts to estimate whether there is significant evidence for a relevant difference (of at least a many percent points as given by the value of test.relevant), and marks such cases by darker triangles or arrows. This feature should be considered experimental and used with caution, as the computation involves many approximations and guesses (exact statistical inference for the difference in true precision not being available).

It goes without saying that confidence regions and significance tests do not allow evaluation results to be generalised to a different extraction task (i.e. another type of cooccurrences or another definition of true positives), or even to the same task under different conditions (such as a source corpus from a different domain, register, time, or a corpus of different size). The unpredictability of the performance of association measures for different extraction tasks or under different conditions has been confirmed by various evaluation studies.

Generally, evaluation plots can be drawn in two modes: colour (bw=FALSE, the default) or black and white (bw=TRUE). The styles of evaluation graphs are controlled by the respective settings in ucs.par, while the appearance of various other elements is hard-coded in the evaluation.plot function. In particular, confidence regions are either filled with a light background colour (colour mode) or shaded with diagonal lines (B/W mode). The triangles or arrows used to mark significant differences are yellow or red (indicating relevance) in colour mode, and light grey or dark grey (indicating relevance) in B/W mode. B/W mode is mainly used to produce PostScript files to be included as figures in articles, but can also be displayed on-screen for testing purposes.

The evaluation.plot function supports evaluation based on random samples, or RSE for short (Evert, 2004, Sec. 5.4). Missing values (NA) in the tp vector (or the b.TP variable in ds) are interpreted as unannotated candidates. In this case, precision, recall and local precision are computed as maxmium-likelihood estimates based on the annotated candidates. Confidence intervals and significance tests, which should not be absent from any RSE, are adjusted accordingly. A confidence interval for the baseline precision is automatically shown (by thin dotted lines) when RSE is detected. Note that n-best lists (as shown on the x-axis) still refer to the full data set, not just to the number of annotated candidates.


The following functions are provided for compatibility with earlier versions of UCS/R: precision.plot, recall.plot, and recall.precision.plot. They are simple front-ends to evaluation.plot with the implicit parameter settings y.axis="recall" and y.axis="precision", x.axis="recall" for the latter two.


Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart.

See Also

ucs.par, evaluation.file, read.ds.gz, and precision.recall. The R script ‘tutorial.R’ in the ‘script/’ directory provides a gentle introduction to the wide range of possibilities offered by the evaluation.plot function.

[Package UCS version 0.5 Index]