evaluation.plot {UCS} | R Documentation |

An implementation of evaluation graphs for the empirical
evaluation of association measures in terms of precision and recall,
as described in (Evert, 2004, Ch. 5). Graphs of precision, recall
and local precision for n-best lists, as well as precision-by-recall
graphs are all provided by a single function `evaluation.plot`

.

evaluation.plot(ds, keys, tp=ds$b.TP, x.min=0, x.max=100, y.min=0, y.max=100, x.axis=c("n.best", "proportion", "recall"), y.axis=c("precision", "local.precision", "recall"), n.first=ucs.par("n.first"), n.step=ucs.par("n.step"), cut=NULL, window=400, show.baseline=TRUE, show.nbest=NULL, show.npair=NULL, conf=FALSE, conf.am=NULL, conf.am2=NULL, test=FALSE, test.am1=NULL, test.am2=NULL, test.step=ucs.par("test.step"), test.relevant=0, usercode=NULL, file=NULL, aspect=1, plot.width=6, plot.height=6, cex=ucs.par("cex"), lex=ucs.par("lex"), bw=FALSE, legend=NULL, bottom.legend=FALSE, title=NULL, ...)

`ds` |
a UCS data set object, read in from a data set file with
the `read.ds.gz` function. `ds` must contain
rankings for the association measures listed in the `keys`
parameter (use `add.ranks` to add such rankings to a
data set object). |

`keys` |
a character vector naming up to 10 association measures to
be evaluated. Each name may be abbreviated to prefix that must
be unique within the measures annotated in `ds` .
Use the `ds.find.am` function to obtain a
list of measures annotated in the data set, and see the `ucsam`
manpage in UCS/Perl for detailed information about the association
measures supported by the UCS system (with the shell command
`ucsdoc ucsam` ). |

`tp` |
a logical vector indicating true positives, parallel to the
rows of the data set `ds` . If `tp` is not specified, the
data set must contain a variable named `b.TP` which is used
instead. |

`x.min, x.max` |
the limits of the x-axis in the plot, used to
“zoom in” to an interesting region. The interpretation of
the values depends on the `x.axis` parameter below. For
`x.axis="n.best"` (the default case), `x.min` and
`x.max` refer to n-best lists. Otherwise, they refer to
percentages ranging from 0 to 100. By default, the full data set is
shown. |

`y.min, y.max` |
the limits of the y-axis in the plot, used to
“zoom in” to an interesting region. The values are always
interpreted as percentages, ranging from 0 to 100. By default,
`y.max` is fitted to the evaluation graphs (unless
`y.axis="recall"` , where `y.max` is always set to 100). |

`x.axis` |
select variable shown on x-axis. Available choices are
the n-best list size n (`"n.best"` , the default), the
same as a proportion of the full data set (`"proportion"` ), and
the recall as a percentage (`"recall"` ). The latter produces
precision-by-recall graphs. Unless you are silly enough to specify
`y.axis="recall"` at the same time, that is. |

`y.axis` |
select variable shown on x-axis. Available choices are
the precision (`"precision"` , the default), an estimate for
local precision (`"local.precision"` , see details below), and
the recall (`"recall"` ). All three variables are shown as
percentages ranging from 0 to 100. |

`n.first` |
the smallest n-best list to be evaluated. Shorter
n-best lists typically lead to highly unstable evaluation graphs.
The standard setting is 100, but a higher value may be necessary for
random sample evaluation (see details below). If `n.first` is
not specified, the default supplied by `ucs.par` is
used. |

`n.step` |
the step width for n-best lists in the evaluation
graphs. Initially, precision and recall are computed for all n-best
lists, but only every `n.step` -th one is plotted, yielding
graphs that look less jagged and reducing the size of generated
PostScript files (see the `file` parameter below). If
`n.step` is not specified, the default supplied by
`ucs.par` is used. |

`cut` |
for each association measure, pretend that the data set
consists only of the `cut` highest-ranked candidates according
to this measure. This trick can be used to perform an evaluation
of n-best lists without having to annotate the full data set. The
candidates from all relevant n-best lists are combined into a single
data set file and `cut` is set to n. |

`window` |
number of candidates to consider when estimating local
precision (default: 400), i.e. with the option
`y.axis="local"` . Values below 400 or above 1000 are rarely
useful. See below for details. |

`show.baseline` |
if `TRUE` , show baseline precision as dotted
horizontal line with label (this is the default). Not available
when `y.axis="recall"` . |

`show.nbest` |
integer vector of n-best lists that will be
indicated as thin vertical lines in the plot. When
`x.axis="recall"` , the n-best lists are shown as diagonal
lines. |

`show.npair` |
when `x.axis="proportion"` , the total number of
candidates in `ds` is shown in the x-axis label. Set
`show.npair=NULL` to suppress this, or set it to an integer
value in order to lie about the number of candidates (rarely
useful). |

`conf` |
if `TRUE` , confidence intervals are shown as coloured
or shaded regions around one or two precision graphs. In this case,
the parameter `conf.am` must also be specified. Alternatively,
`conf` can be set to a number indicating the significance level
to be used for the confidence intervals (default: 0.05,
corresponding to 95% confidence). See below for details. Note
that `conf` is only available when `y.axis="precision"` . |

`conf.am` |
name of the association measure for which confidence
intervals are displayed (may be abbreviated to a prefix that is
unique within `keys` ) |

`conf.am2` |
optional second association measure, for which confidence intervals will also be shown |

`test` |
if `TRUE` , significance tests are carried out for the
differences between the evaluation results of two association
measures, given as `test.am1` and `test.am2` below.
Alternatively, `test` can be set to a number indicating the
significance level to be used for the tests (default: 0.05).
n-best lists where the result difference is significant are
indicated by arrows between the respective evaluation graphs (when
`x.axis="recall"` ) or by coloured triangles (otherwise). See
details below. Note that `test` is not available when
`y.axsis="local"` . |

`test.am1` |
the first association measure for significance tests
(may be abbreviated to a prefix that is unique within `keys` ).
Usually, this is the measure that achieves better performance (but
tests are always two-sided). |

`test.am2` |
the second association measure for significance tests
(may be abbreviated to a prefix that is unique within `keys` ) |

`test.step` |
the step width for n-best lists where significance
tests are carried out, as a multiple of `n.step` . The standard
setting is 10 since the significance tests are based on the
computationally expensive `fisher.test` functio and
since the triangles or arrows shown in the plot are fairly large.
If `test.step` is not specified, the default supplied by
`ucs.par` is used. |

`test.relevant` |
a positive number, indicating the estimated precision differences that are considered “relevant” and that are marked by dark triangles or arrows in the plot. See below for details. |

`usercode` |
a callback function that is invoked when the plot has
been completed, but before the legend box is drawn. This feature is
mainly used to add something to a plot that is written to a
PostScript file. The `usercode` function is invoked with
parameters `region=c(x.min,x.max,y.min,y.max)` and `pr` , a
list of precision/recall tables (as returned by
`precision.recall` ) for each of the measures in
`keys` . |

`file` |
a character string giving the name of a PostScript file.
If specified, the evaluation plot will be saved to `file`
rather than displayed on screen. See `evaluation.file`
for a function that combines both operations. |

`aspect` |
a positive number specifying the desired aspect of the
plot region (only available for PostScript files). In the default
case `x.axis="n.best"` , `aspect` refers to the absolute
size of the plot region. Otherwise, it specifies the size ratio
between percentage points on the x-axis and the y-axis. Setting
`aspect` modifies the height of the plot (`plot.height` ). |

`plot.width, plot.height` |
the width and height of a plot that is
written to a PostScript file, measured in inches.
`plot.height` may be overridden by the `aspect` parameter,
even if it is set explicitly. |

`cex` |
character expansion factor for labels, annotations, and
symbols in the plot (see `par` for details). If `cex` is
not specified, the default supplied by `ucs.par` is
used. |

`lex` |
added to the line widths of evaluation graphs and some
decorations (note that this is not an expansion factor). If
`lex` is not specified, the default supplied by
`ucs.par` is used. |

`bw` |
if `TRUE` , the evaluation plot is drawn in black and
white, which is mostly used in conjunction with `file` to
produce figures for articles (defaults to `FALSE` ). See below
for details. |

`legend` |
a vector of character strings or expressions, used as
labels in the legend of the plot (e.g. to show mathematical symbols
instead of the names of association measures). Use
`legend=NULL` to suppress the display of a legend box. |

`bottom.legend` |
if `TRUE` , draw legend box in bottom right
corner of plot (default is top right corner). |

`title` |
a character vector or expression to be used as the main title of the plot (optional) |

`...` |
any other arguments are set as local graphics parameters
(using `par` ) before the evaluation plot is drawn |

When `y.axis="local.precision"`

, the evaluation graphs show
**local precision**, i.e. an estimate for the density of true
positives around the n-th rank according to the respective association
measure. Local precision is smoothed using a kernel density estimate
with a Gaussian kernel (from the `density`

function), based on
a symmetric window covering approximately `window`

candidates
(default: 400). Consequently, the resulting values do not have a
clear-cut interpretation and should not be used to evaluate the
performance of association measures. They are rather a means of
exploratory data analysis, helping to visualise the relation between
association scores and the true positives in a data set (see Evert,
2004, Sec. 5.2 for an example).

In order to generalise evaluation results beyond the specific data set on which they were obtained, it is necessary to compute confidence intervals for the observed precision values and to test whether the observed result differences are significant. See (Evert, 2004, Sec. 5.3) for the methods used and the interpretation of their results.

**Confidence intervals** are computed by setting
`conf=TRUE`

and selecting an association measure with the
`conf.am`

parameter. The confidence intervals are displayed as a
coloured or shaded region around the precision graph of this measure
(confidence intervals are not available for graphs of recall or local
precision). The default confidence level of 95% will rarely need to
be changed. Optionally, a second confidence region can be displayed
for a measure selected with the `conf.am2`

parameter.

**Significance tests** for the result differences are activated by
setting `test=TRUE`

(not available for graphs of local
precision). The evaluation results of two association measures
(specified with `test.am1`

and `test.am2`

) are compared for
selected n-best lists, and significant differences are marked by
coloured triangles or arrows (when `x.axis="recall"`

). The
default significance level of *0.05* will rarely need to be
changed. Use the `test.step`

parameter to control the spacing of
the triangles or arrows.

A significant difference indicates that measure A is truly better than
measure B, rather than just as a coincidence in a single evaluation
experiment. Formally, this “true performance” can be defined
as the average precision of a measure, obtained by averaging over many
similar evaluation experiments. Thus, a significant difference
means that the average precision of A is higher than that of B, but it
does not indicate how great the difference is. A tiny difference
(say, of half a percent point) is hardly **relevant** for an
application, even if there is significant evidence for it. If the
`test.relevant`

parameter is set, the `evaluation.plot`

function attempts to estimate whether there is significant evidence
for a relevant difference (of at least a many percent points as given
by the value of `test.relevant`

), and marks such cases by darker
triangles or arrows. This feature should be considered experimental
and used with caution, as the computation involves many approximations
and guesses (exact statistical inference for the difference in true
precision not being available).

It goes without saying that confidence regions and significance tests do not allow evaluation results to be generalised to a different extraction task (i.e. another type of cooccurrences or another definition of true positives), or even to the same task under different conditions (such as a source corpus from a different domain, register, time, or a corpus of different size). The unpredictability of the performance of association measures for different extraction tasks or under different conditions has been confirmed by various evaluation studies.

Generally, evaluation plots can be drawn in two modes: **colour**
(`bw=FALSE`

, the default) or **black and white**
(`bw=TRUE`

). The styles of evaluation graphs are controlled by
the respective settings in `ucs.par`

, while the appearance
of various other elements is hard-coded in the `evaluation.plot`

function. In particular, confidence regions are either filled with a
light background colour (colour mode) or shaded with diagonal lines
(B/W mode). The triangles or arrows used to mark significant
differences are yellow or red (indicating relevance) in colour mode,
and light grey or dark grey (indicating relevance) in B/W mode. B/W
mode is mainly used to produce PostScript files to be included as
figures in articles, but can also be displayed on-screen for testing
purposes.

The `evaluation.plot`

function supports **evaluation based on
random samples**, or RSE for short (Evert, 2004, Sec. 5.4). Missing
values (`NA`

) in the `tp`

vector (or the `b.TP`

variable in `ds`

) are interpreted as unannotated candidates. In
this case, precision, recall and local precision are computed as
maxmium-likelihood estimates based on the annotated candidates.
Confidence intervals and significance tests, which should not be
absent from any RSE, are adjusted accordingly. A confidence interval
for the baseline precision is automatically shown (by thin dotted
lines) when RSE is detected. Note that n-best lists (as shown on the
x-axis) still refer to the full data set, not just to the number of
annotated candidates.

The following functions are provided for compatibility with earlier
versions of UCS/R: `precision.plot`

, `recall.plot`

, and
`recall.precision.plot`

. They are simple front-ends to
`evaluation.plot`

with the implicit parameter settings
`y.axis="recall"`

and `y.axis="precision", x.axis="recall"`

for the latter two.

Evert, Stefan (2004). *The Statistics of Word Cooccurrences: Word
Pairs and Collocations.* PhD Thesis, IMS, University of Stuttgart.

`ucs.par`

, `evaluation.file`

,
`read.ds.gz`

, and `precision.recall`

.
The **R** script ‘tutorial.R’ in the ‘script/’ directory
provides a gentle introduction to the wide range of possibilities
offered by the `evaluation.plot`

function.

[Package *UCS* version 0.5 Index]