R: Evaluation Graphs for Association Measures (plots)

evaluation.plot {UCS}

R Documentation

Evaluation Graphs for Association Measures (plots)

Description

An implementation of evaluation graphs for the empirical evaluation of association measures in terms of precision and recall, as described in (Evert, 2004, Ch. 5). Graphs of precision, recall and local precision for n-best lists, as well as precision-by-recall graphs are all provided by a single function evaluation.plot.

Usage

evaluation.plot(ds, keys, tp=ds$b.TP,
                x.min=0, x.max=100, y.min=0, y.max=100,
                x.axis=c("n.best", "proportion", "recall"),
                y.axis=c("precision", "local.precision", "recall"),
                n.first=ucs.par("n.first"), n.step=ucs.par("n.step"),
                cut=NULL, window=400,
                show.baseline=TRUE, show.nbest=NULL, show.npair=NULL,
                conf=FALSE, conf.am=NULL, conf.am2=NULL,
                test=FALSE, test.am1=NULL, test.am2=NULL,
                test.step=ucs.par("test.step"), test.relevant=0,
                usercode=NULL,
                file=NULL, aspect=1, plot.width=6, plot.height=6,
                cex=ucs.par("cex"), lex=ucs.par("lex"), bw=FALSE,
                legend=NULL, bottom.legend=FALSE,
                title=NULL, ...)

Arguments

`ds`	a UCS data set object, read in from a data set file with the `read.ds.gz` function. `ds` must contain rankings for the association measures listed in the `keys` parameter (use `add.ranks` to add such rankings to a data set object).
`keys`	a character vector naming up to 10 association measures to be evaluated. Each name may be abbreviated to prefix that must be unique within the measures annotated in `ds`. Use the `ds.find.am` function to obtain a list of measures annotated in the data set, and see the `ucsam` manpage in UCS/Perl for detailed information about the association measures supported by the UCS system (with the shell command `ucsdoc ucsam`).
`tp`	a logical vector indicating true positives, parallel to the rows of the data set `ds`. If `tp` is not specified, the data set must contain a variable named `b.TP` which is used instead.
`x.min, x.max`	the limits of the x-axis in the plot, used to “zoom in” to an interesting region. The interpretation of the values depends on the `x.axis` parameter below. For `x.axis="n.best"` (the default case), `x.min` and `x.max` refer to n-best lists. Otherwise, they refer to percentages ranging from 0 to 100. By default, the full data set is shown.
`y.min, y.max`	the limits of the y-axis in the plot, used to “zoom in” to an interesting region. The values are always interpreted as percentages, ranging from 0 to 100. By default, `y.max` is fitted to the evaluation graphs (unless `y.axis="recall"`, where `y.max` is always set to 100).
`x.axis`	select variable shown on x-axis. Available choices are the n-best list size n (`"n.best"`, the default), the same as a proportion of the full data set (`"proportion"`), and the recall as a percentage (`"recall"`). The latter produces precision-by-recall graphs. Unless you are silly enough to specify `y.axis="recall"` at the same time, that is.
`y.axis`	select variable shown on x-axis. Available choices are the precision (`"precision"`, the default), an estimate for local precision (`"local.precision"`, see details below), and the recall (`"recall"`). All three variables are shown as percentages ranging from 0 to 100.
`n.first`	the smallest n-best list to be evaluated. Shorter n-best lists typically lead to highly unstable evaluation graphs. The standard setting is 100, but a higher value may be necessary for random sample evaluation (see details below). If `n.first` is not specified, the default supplied by `ucs.par` is used.
`n.step`	the step width for n-best lists in the evaluation graphs. Initially, precision and recall are computed for all n-best lists, but only every `n.step`-th one is plotted, yielding graphs that look less jagged and reducing the size of generated PostScript files (see the `file` parameter below). If `n.step` is not specified, the default supplied by `ucs.par` is used.
`cut`	for each association measure, pretend that the data set consists only of the `cut` highest-ranked candidates according to this measure. This trick can be used to perform an evaluation of n-best lists without having to annotate the full data set. The candidates from all relevant n-best lists are combined into a single data set file and `cut` is set to n.
`window`	number of candidates to consider when estimating local precision (default: 400), i.e. with the option `y.axis="local"`. Values below 400 or above 1000 are rarely useful. See below for details.
`show.baseline`	if `TRUE`, show baseline precision as dotted horizontal line with label (this is the default). Not available when `y.axis="recall"`.
`show.nbest`	integer vector of n-best lists that will be indicated as thin vertical lines in the plot. When `x.axis="recall"`, the n-best lists are shown as diagonal lines.
`show.npair`	when `x.axis="proportion"`, the total number of candidates in `ds` is shown in the x-axis label. Set `show.npair=NULL` to suppress this, or set it to an integer value in order to lie about the number of candidates (rarely useful).
`conf`	if `TRUE`, confidence intervals are shown as coloured or shaded regions around one or two precision graphs. In this case, the parameter `conf.am` must also be specified. Alternatively, `conf` can be set to a number indicating the significance level to be used for the confidence intervals (default: 0.05, corresponding to 95% confidence). See below for details. Note that `conf` is only available when `y.axis="precision"`.
`conf.am`	name of the association measure for which confidence intervals are displayed (may be abbreviated to a prefix that is unique within `keys`)
`conf.am2`	optional second association measure, for which confidence intervals will also be shown
`test`	if `TRUE`, significance tests are carried out for the differences between the evaluation results of two association measures, given as `test.am1` and `test.am2` below. Alternatively, `test` can be set to a number indicating the significance level to be used for the tests (default: 0.05). n-best lists where the result difference is significant are indicated by arrows between the respective evaluation graphs (when `x.axis="recall"`) or by coloured triangles (otherwise). See details below. Note that `test` is not available when `y.axsis="local"`.
`test.am1`	the first association measure for significance tests (may be abbreviated to a prefix that is unique within `keys`). Usually, this is the measure that achieves better performance (but tests are always two-sided).
`test.am2`	the second association measure for significance tests (may be abbreviated to a prefix that is unique within `keys`)
`test.step`	the step width for n-best lists where significance tests are carried out, as a multiple of `n.step`. The standard setting is 10 since the significance tests are based on the computationally expensive `fisher.test` functio and since the triangles or arrows shown in the plot are fairly large. If `test.step` is not specified, the default supplied by `ucs.par` is used.
`test.relevant`	a positive number, indicating the estimated precision differences that are considered “relevant” and that are marked by dark triangles or arrows in the plot. See below for details.
`usercode`	a callback function that is invoked when the plot has been completed, but before the legend box is drawn. This feature is mainly used to add something to a plot that is written to a PostScript file. The `usercode` function is invoked with parameters `region=c(x.min,x.max,y.min,y.max)` and `pr`, a list of precision/recall tables (as returned by `precision.recall`) for each of the measures in `keys`.
`file`	a character string giving the name of a PostScript file. If specified, the evaluation plot will be saved to `file` rather than displayed on screen. See `evaluation.file` for a function that combines both operations.
`aspect`	a positive number specifying the desired aspect of the plot region (only available for PostScript files). In the default case `x.axis="n.best"`, `aspect` refers to the absolute size of the plot region. Otherwise, it specifies the size ratio between percentage points on the x-axis and the y-axis. Setting `aspect` modifies the height of the plot (`plot.height`).
`plot.width, plot.height`	the width and height of a plot that is written to a PostScript file, measured in inches. `plot.height` may be overridden by the `aspect` parameter, even if it is set explicitly.
`cex`	character expansion factor for labels, annotations, and symbols in the plot (see `par` for details). If `cex` is not specified, the default supplied by `ucs.par` is used.
`lex`	added to the line widths of evaluation graphs and some decorations (note that this is not an expansion factor). If `lex` is not specified, the default supplied by `ucs.par` is used.
`bw`	if `TRUE`, the evaluation plot is drawn in black and white, which is mostly used in conjunction with `file` to produce figures for articles (defaults to `FALSE`). See below for details.
`legend`	a vector of character strings or expressions, used as labels in the legend of the plot (e.g. to show mathematical symbols instead of the names of association measures). Use `legend=NULL` to suppress the display of a legend box.
`bottom.legend`	if `TRUE`, draw legend box in bottom right corner of plot (default is top right corner).
`title`	a character vector or expression to be used as the main title of the plot (optional)
`...`	any other arguments are set as local graphics parameters (using `par`) before the evaluation plot is drawn

Details

When y.axis="local.precision", the evaluation graphs show local precision, i.e. an estimate for the density of true positives around the n-th rank according to the respective association measure. Local precision is smoothed using a kernel density estimate with a Gaussian kernel (from the density function), based on a symmetric window covering approximately window candidates (default: 400). Consequently, the resulting values do not have a clear-cut interpretation and should not be used to evaluate the performance of association measures. They are rather a means of exploratory data analysis, helping to visualise the relation between association scores and the true positives in a data set (see Evert, 2004, Sec. 5.2 for an example).

In order to generalise evaluation results beyond the specific data set on which they were obtained, it is necessary to compute confidence intervals for the observed precision values and to test whether the observed result differences are significant. See (Evert, 2004, Sec. 5.3) for the methods used and the interpretation of their results.

Confidence intervals are computed by setting conf=TRUE and selecting an association measure with the conf.am parameter. The confidence intervals are displayed as a coloured or shaded region around the precision graph of this measure (confidence intervals are not available for graphs of recall or local precision). The default confidence level of 95% will rarely need to be changed. Optionally, a second confidence region can be displayed for a measure selected with the conf.am2 parameter.

Significance tests for the result differences are activated by setting test=TRUE (not available for graphs of local precision). The evaluation results of two association measures (specified with test.am1 and test.am2) are compared for selected n-best lists, and significant differences are marked by coloured triangles or arrows (when x.axis="recall"). The default significance level of 0.05 will rarely need to be changed. Use the test.step parameter to control the spacing of the triangles or arrows.

A significant difference indicates that measure A is truly better than measure B, rather than just as a coincidence in a single evaluation experiment. Formally, this “true performance” can be defined as the average precision of a measure, obtained by averaging over many similar evaluation experiments. Thus, a significant difference means that the average precision of A is higher than that of B, but it does not indicate how great the difference is. A tiny difference (say, of half a percent point) is hardly relevant for an application, even if there is significant evidence for it. If the test.relevant parameter is set, the evaluation.plot function attempts to estimate whether there is significant evidence for a relevant difference (of at least a many percent points as given by the value of test.relevant), and marks such cases by darker triangles or arrows. This feature should be considered experimental and used with caution, as the computation involves many approximations and guesses (exact statistical inference for the difference in true precision not being available).

It goes without saying that confidence regions and significance tests do not allow evaluation results to be generalised to a different extraction task (i.e. another type of cooccurrences or another definition of true positives), or even to the same task under different conditions (such as a source corpus from a different domain, register, time, or a corpus of different size). The unpredictability of the performance of association measures for different extraction tasks or under different conditions has been confirmed by various evaluation studies.

Generally, evaluation plots can be drawn in two modes: colour (bw=FALSE, the default) or black and white (bw=TRUE). The styles of evaluation graphs are controlled by the respective settings in ucs.par, while the appearance of various other elements is hard-coded in the evaluation.plot function. In particular, confidence regions are either filled with a light background colour (colour mode) or shaded with diagonal lines (B/W mode). The triangles or arrows used to mark significant differences are yellow or red (indicating relevance) in colour mode, and light grey or dark grey (indicating relevance) in B/W mode. B/W mode is mainly used to produce PostScript files to be included as figures in articles, but can also be displayed on-screen for testing purposes.

The evaluation.plot function supports evaluation based on random samples, or RSE for short (Evert, 2004, Sec. 5.4). Missing values (NA) in the tp vector (or the b.TP variable in ds) are interpreted as unannotated candidates. In this case, precision, recall and local precision are computed as maxmium-likelihood estimates based on the annotated candidates. Confidence intervals and significance tests, which should not be absent from any RSE, are adjusted accordingly. A confidence interval for the baseline precision is automatically shown (by thin dotted lines) when RSE is detected. Note that n-best lists (as shown on the x-axis) still refer to the full data set, not just to the number of annotated candidates.

Note

The following functions are provided for compatibility with earlier versions of UCS/R: precision.plot, recall.plot, and recall.precision.plot. They are simple front-ends to evaluation.plot with the implicit parameter settings y.axis="recall" and y.axis="precision", x.axis="recall" for the latter two.

References

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart.