ucsam - Association measures in UCS/Perl

The statistical analysis of cooccurrence data is usually based on **association measures**,
mathematical formulae that compute an **association score** from the joint and marginal frequencies of a pair type (which are called a **frequency signature** in UCS.
This score is a single floating-point number indicating the amount of statistical association between the components of the pair type.
Association measures can often be written conveniently in terms of a **contingency table** of observed frequencies the corresponding expected frequencies under the null hypothesis that there is no association.

For instance,
the word pair *black box* occurs 123 times in the *British National Corpus* (BNC),
so its joint frequency is *f = 123*.
The adjective *black* has a total of 13,168 occurrences,
and the noun *box* has 1,810 occurrences,
giving marginal frequencies of *f1 = 13,168* and *f2 = 1,810*.
From these data,
the **MI** measure computes an association score of *1.4*,
while the **log.likelihood** measure computes a score of *567.72*.
Both scores indicate a clear positive association,
but they cannot be compared directly: each measure has its own scale.

A more detailed explanation of contingency tables and association scores as well as a comprehensive inventory of association measures with equations given in terms of observed and expected frequencies can be found on-line at *http://www.collocations.de/AM/*.
Also see the ucsfile manpage to find out how frequency signatures,
contingency tables and association scores are represented in UCS **data set** files.

**UCS/Perl** supports more than 40 different association measures and variants.
In order to keep them managable,
the measures are organised in several **packages**: a core set of widely-used "standard" measures is complemented by add-on packages for advanced users.
Each package is implemented by a separate Perl module.
Consult the module's manpage for a full listing of measures in the package and detailed descriptions.
Listings of add-on packages,
association measures,
and some additional information can also be printed with the **ucs-list-am** program (see the ucs-list-am manpage).

Currently, there are two add-on packages in addition to the standard measures.

**UCS::AM**(the "standard" measures)-
This core set contains all well-known association measures such as

**MI**,**t-score**, and**log-likelihood**(see the listing in the Section "SOME ASSOCIATION MEASURES" below). These measures are also made available by various other tools (e.g. the NSP toolkit, see*http://www.d.umn.edu/~tpederse/nsp.html*) and they have often been used in applications as well as for scientific research. The**UCS::AM**package also includes several other "simple" measures that are inexpensive to compute and numerically unproblematic.Association measures in the core set can be thought of as the "built-in" measures of UCS/Perl (although the add-on packages are also part of the distribution). They are automatically supported by tools such as

**ucs-add**, while the other packages have to be loaded explicitly (see below).See the UCS::AM manpage for details.

**UCS::AM::HTest**(measures based on hypothesis tests)-
Many association measures are based on asymptotic statistical hypothesis tests. The test statistic is used as an association score and can be interpreted (i.e. translated into a

**p-value**) with the help of its known limiting distribution. The**UCS::AM::HTest**package provides p-values for all such association measures as well as the "original" two-tailed versions of some tests (the core set includes only one-tailed versions).See the UCS::AM::HTest manpage for details.

**UCS::AM::Parametric**(parametric measures)-
A new approach where the equation of a parametric association measure is not completely fixed in advance. One or more parameters can be adjusted to obtain a version of the measure that is optimised for a particular task or data set. Control over the parameters is only available through the programming interface. For command-line use, special versions of these measures are provided with a pre-set parameter value, which is indicated by the name of the measure.

See the UCS::AM::Parametric manpage for details.

In UCS/Perl scripts both the standard measures and the add-on packages have to be loaded with **use** statements (e.g.
`use UCS::AM;`

for the core set).
Association measures are implemented as **UCS::Expression** objects (see the UCS::Expression manpage).
The **UCS** module maintains a registry of loaded measures with additional information and an evaluation function (see Section "ASSOCIATION MEASURE REGISTRY" in the UCS manpage).
When one of the packages above is loaded,
its measures are automatically added to this registry.
Association scores can be computed more efficiently for in-memory data sets,
using the **add** method in the **UCS::DS::Memory** module (see the UCS::DS::Memory manpage).

In the **ucs-add** program,
the standard measures are pre-defined,
and extension packages can be loaded with the `-x`

option.
Only the last part of the package name has to be specified here (e.g.
`HTest`

for the **UCS::AM::HTest** package).
It is case-insensitive and may be abbreviated to a unique prefix (so both `-x htest`

and `-x ht`

work as well).
See the ucs-add manpage for more information on how to compute association scores with the **ucs-add** program.

This section briefly lists the most well-known association measures available in UCS/Perl,
all of which are defined in the "standard" package **UCS::AM**.
See the on-line resource at *http://www.collocations.de/AM/* for fully equations and the UCS::AM manpage for details.

**MI**(Mutual Information)-
The mutual information (MI) measure is a maximum-likelihood for the (logarithmic)

*strength of the statistical association*between the components of a pair type. It was introduced into the field of computational lexicography by Church & Hanks (1990), who derived it from the information-theoretic notion of*point-wise mutual information*. Positive values indicate positive association while negative values indicate dissociation (where the components have a tendency*not*to occur together).Note that unlike the original version of Church & Hanks (1990), the UCS implementation computes a base 10 logarithm.

**t.score**(t-score)-
The MI measure is prone to overestimate association strength, especially for low-frequency cooccurrences. Church

*et al.*(1991) use a version of Student's*t*test (whose test statistics is called a*t-score*) to ensure that the association detected by MI is supported by a*significant*amount of evidence. Although their application of Student's test is highly questionable, the combination of MI and t.score has become a*de facto*standard in British computational lexicography. **chi.squared**,**chi.squared.corr**(chi-squared test)-
Pearson's chi-squared test is the standard test for statistical independence in a

*2 x 2*contingency table, and is much more appropriate as a measure of the*significance of association*than t.score. Despite its central role in mathematical statistics, it has not been very widely used on cooccurrence data. In particular, t.score was found to be much more useful for the extraction of collocations from text corpora (cf. Evert & Krenn, 2001).The "textbook" form of Pearson's chi-squared test is a two-tailed version that does not distinguish between positive and negative association. The chi.squared measure implemented in UCS/Perl has been converted to a one-sided test with the help of a heuristic decision rule. Since contingency tables often contain cells with small values, Yates' continuity correction should be applied to the test statistic (chi.squared.corr).

**log.likelihood**(likelihood ratio test)-
Dunning (1993) showed that the disappointing performance of chi.squared in collocation extraction tasks is due to a drastic overestimation of the significance of low-frequency cooccurrences (because of a approximation to its limiting distribution). He suggested to use a likelihood ratio test instead, whose natural logarithm has the same limiting distribution as chi.squared. Under the name

*log-likelihood*, this association measure has become a generally accepted standard in the field of computational linguistics.Like the chi-squared test, the likelihood ratio test is two-sided, and the log.likelihood measure has been converted to a one-sided test with the same heuristic decision rule. Both chi.squared and log.likelihood return the value of their test statistic, which has to be interpreted in terms of the known limiting distribution. More meaningful

**p-values**for both measures are available in the UCS::AM::HTest package. **Fisher.pv**(Fisher's exact test)-
Although log.likelihood achieves a much better approximation to its limiting distribution than chi.squared (or chi.squared.corr), it is still an asymptotic and provides only an approximate p-value. Pedersen (1996) argued in favour of Fisher's exact test for the independence of rows and columns in a contingency table, in order to remove the remaining inaccuracy of the log-likelihood ratio. A drawback of Fisher's test is that it is numerically expensive and that naive implementations can easily become unstable.

The Fisher.pv measure implements a one-sided test. It returns an exact

**p-value**, which can be compared directly with the p-values of chi.squared and log.likelihood. **Dice**(Dice coefficient)-
The Dice coefficient is a measure from the field of information retrieval, which has been used by Smadja (1993) and others for collocation extraction. Like MI, it is a maximum-likelihood estimate of

*association strength*, but its definition of "strength" differs greatly from point-wise mutual information. It suffers from the same overestimation problem as MI, which is mitigated by its different approach to association strength, though.

Church,
K.
W.
and Hanks,
P.
(1990).
Word association norms,
mutual information,
and lexicography.
*Computational Linguistics* **16**(1),
22-29.

Church,
K.
W.; Gale,
W.; Hanks,
P.; Hindle,
D.
(1991).
Using statistics in lexical analysis.
In: *Lexical Acquisition: Using On-line Resources to Build a Lexicon*,
Lawrence Erlbaum,
pages 115-164.

Dunning,
T.
(1993).
Accurate methods for the statistics of surprise and coincidence.
*Computational Linguistics* **19**(1),
61-74.

Evert,
S.
and Krenn,
B.
(2001).
Methods for the qualitative evaluation of lexical association measures.
In: *Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics*,
Toulouse,
France,
pages 188-195.

Pedersen,
T.
(1996).
Fishing for exactness.
In: *Proceedings of the South-Central SAS Users Group Conference*,
Austin,
TX.

Smadja,
F.
(1993).
Retrieving collocations from text: Xtract.
*Computational Linguistics* **19**(1),
143-177.

UCS/Perl uses some conventions for the names of association measures and the computed association scores, which are described in this section. It is important to be aware of such conventions, especially when they deviate from those used by other software packages.

The **names of association measures** are taken from the on-line inventory at *http://www.collocations.de/AM/*.
Hyphen characters (`-`

) are replaced by periods (`.`

) to conform with the UCS standards (see the ucsfile manpage).
Capitalisation is preserved (`MI`

and `Fisher.pv`

,
but `log.likelihood`

) and subscripts are included in the name,
separated by a period (`chi.squared.corr`

,
where `corr`

is a subscript in the original name).

Association scores are always arranged so that **higher scores** indicate stronger (positive) association,
applying a transformation to the original values if necessary.
In the one-sided versions of two-sided tests (e.g.
`chi.squared`

and `log.likelihood`

),
negative scores indicate negative association (while positive scores indicate positive association).
Scores close to zero are a sign of statistical independence.
Some other measures such as `MI`

also have this property,
but many do not (e.g.
`Fisher.pv`

or `Dice`

).

"Explicit" logarithms in the equation of an association measure are usually taken to the **base 10** (e.g.
in the `MI`

measure).
This is not the case when the association score is not interpreted as a logarithm (e.g.
the `log.likelihoood`

,
which is a test statistic approximating a known limiting distribution) and the natural logarithm is required for correct interpretation.
The use of base 10 logarithms is always pointed out in the documentation (see the UCS::AM manpage).
The logarithm of infinity if represented by a large floating-point value returned by the **inf** function (from the UCS::Expression::Func module).
Comparison with `+inf()`

and `-inf()`

can be used to detect a positive or negative infinite value.

The scores of association measures with the extension `.pv`

represent a p-value (from an exact test or the approximate p-value of an asymptotic test).
Unlike most other scores,
p-values can be compared directly between different measures.
They are represented as **negative base 10 logarithms**,
so the association score 3.0 corresponds to a p-value of 0.001 = 1e-3 (`+inf()`

stands for zero probability,
usually the result of an underflow error).

Copyright (C) 2004 by Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.