NAME

ucsam - Association measures in UCS/Perl

INTRODUCTION

The statistical analysis of cooccurrence data is usually based on association measures, mathematical formulae that compute an association score from the joint and marginal frequencies of a pair type (which are called a frequency signature in UCS. This score is a single floating-point number indicating the amount of statistical association between the components of the pair type. Association measures can often be written conveniently in terms of a contingency table of observed frequencies the corresponding expected frequencies under the null hypothesis that there is no association.

For instance, the word pair black box occurs 123 times in the British National Corpus (BNC), so its joint frequency is f = 123. The adjective black has a total of 13,168 occurrences, and the noun box has 1,810 occurrences, giving marginal frequencies of f1 = 13,168 and f2 = 1,810. From these data, the MI measure computes an association score of 1.4, while the log.likelihood measure computes a score of 567.72. Both scores indicate a clear positive association, but they cannot be compared directly: each measure has its own scale.

A more detailed explanation of contingency tables and association scores as well as a comprehensive inventory of association measures with equations given in terms of observed and expected frequencies can be found on-line at http://www.collocations.de/AM/. Also see the ucsfile manpage to find out how frequency signatures, contingency tables and association scores are represented in UCS data set files.

UCS/Perl supports more than 40 different association measures and variants. In order to keep them managable, the measures are organised in several packages: a core set of widely-used "standard" measures is complemented by add-on packages for advanced users. Each package is implemented by a separate Perl module. Consult the module's manpage for a full listing of measures in the package and detailed descriptions. Listings of add-on packages, association measures, and some additional information can also be printed with the ucs-list-am program (see the ucs-list-am manpage).

Currently, there are two add-on packages in addition to the standard measures.

UCS::AM (the "standard" measures)

This core set contains all well-known association measures such as MI, t-score, and log-likelihood (see the listing in the Section "SOME ASSOCIATION MEASURES" below). These measures are also made available by various other tools (e.g. the NSP toolkit, see http://www.d.umn.edu/~tpederse/nsp.html) and they have often been used in applications as well as for scientific research. The UCS::AM package also includes several other "simple" measures that are inexpensive to compute and numerically unproblematic.

Association measures in the core set can be thought of as the "built-in" measures of UCS/Perl (although the add-on packages are also part of the distribution). They are automatically supported by tools such as ucs-add, while the other packages have to be loaded explicitly (see below).

See the UCS::AM manpage for details.

UCS::AM::HTest (measures based on hypothesis tests)

Many association measures are based on asymptotic statistical hypothesis tests. The test statistic is used as an association score and can be interpreted (i.e. translated into a p-value) with the help of its known limiting distribution. The UCS::AM::HTest package provides p-values for all such association measures as well as the "original" two-tailed versions of some tests (the core set includes only one-tailed versions).

See the UCS::AM::HTest manpage for details.

UCS::AM::Parametric (parametric measures)

A new approach where the equation of a parametric association measure is not completely fixed in advance. One or more parameters can be adjusted to obtain a version of the measure that is optimised for a particular task or data set. Control over the parameters is only available through the programming interface. For command-line use, special versions of these measures are provided with a pre-set parameter value, which is indicated by the name of the measure.

See the UCS::AM::Parametric manpage for details.

In UCS/Perl scripts both the standard measures and the add-on packages have to be loaded with use statements (e.g. use UCS::AM; for the core set). Association measures are implemented as UCS::Expression objects (see the UCS::Expression manpage). The UCS module maintains a registry of loaded measures with additional information and an evaluation function (see Section "ASSOCIATION MEASURE REGISTRY" in the UCS manpage). When one of the packages above is loaded, its measures are automatically added to this registry. Association scores can be computed more efficiently for in-memory data sets, using the add method in the UCS::DS::Memory module (see the UCS::DS::Memory manpage).

In the ucs-add program, the standard measures are pre-defined, and extension packages can be loaded with the -x option. Only the last part of the package name has to be specified here (e.g. HTest for the UCS::AM::HTest package). It is case-insensitive and may be abbreviated to a unique prefix (so both -x htest and -x ht work as well). See the ucs-add manpage for more information on how to compute association scores with the ucs-add program.

SOME ASSOCIATION MEASURES

This section briefly lists the most well-known association measures available in UCS/Perl, all of which are defined in the "standard" package UCS::AM. See the on-line resource at http://www.collocations.de/AM/ for fully equations and the UCS::AM manpage for details.

MI (Mutual Information)

The mutual information (MI) measure is a maximum-likelihood for the (logarithmic) strength of the statistical association between the components of a pair type. It was introduced into the field of computational lexicography by Church & Hanks (1990), who derived it from the information-theoretic notion of point-wise mutual information. Positive values indicate positive association while negative values indicate dissociation (where the components have a tendency not to occur together).

Note that unlike the original version of Church & Hanks (1990), the UCS implementation computes a base 10 logarithm.

t.score (t-score)

The MI measure is prone to overestimate association strength, especially for low-frequency cooccurrences. Church et al. (1991) use a version of Student's t test (whose test statistics is called a t-score) to ensure that the association detected by MI is supported by a significant amount of evidence. Although their application of Student's test is highly questionable, the combination of MI and t.score has become a de facto standard in British computational lexicography.

chi.squared, chi.squared.corr (chi-squared test)

Pearson's chi-squared test is the standard test for statistical independence in a 2 x 2 contingency table, and is much more appropriate as a measure of the significance of association than t.score. Despite its central role in mathematical statistics, it has not been very widely used on cooccurrence data. In particular, t.score was found to be much more useful for the extraction of collocations from text corpora (cf. Evert & Krenn, 2001).

The "textbook" form of Pearson's chi-squared test is a two-tailed version that does not distinguish between positive and negative association. The chi.squared measure implemented in UCS/Perl has been converted to a one-sided test with the help of a heuristic decision rule. Since contingency tables often contain cells with small values, Yates' continuity correction should be applied to the test statistic (chi.squared.corr).

log.likelihood (likelihood ratio test)

Dunning (1993) showed that the disappointing performance of chi.squared in collocation extraction tasks is due to a drastic overestimation of the significance of low-frequency cooccurrences (because of a approximation to its limiting distribution). He suggested to use a likelihood ratio test instead, whose natural logarithm has the same limiting distribution as chi.squared. Under the name log-likelihood, this association measure has become a generally accepted standard in the field of computational linguistics.

Like the chi-squared test, the likelihood ratio test is two-sided, and the log.likelihood measure has been converted to a one-sided test with the same heuristic decision rule. Both chi.squared and log.likelihood return the value of their test statistic, which has to be interpreted in terms of the known limiting distribution. More meaningful p-values for both measures are available in the UCS::AM::HTest package.

Fisher.pv (Fisher's exact test)

Although log.likelihood achieves a much better approximation to its limiting distribution than chi.squared (or chi.squared.corr), it is still an asymptotic and provides only an approximate p-value. Pedersen (1996) argued in favour of Fisher's exact test for the independence of rows and columns in a contingency table, in order to remove the remaining inaccuracy of the log-likelihood ratio. A drawback of Fisher's test is that it is numerically expensive and that naive implementations can easily become unstable.

The Fisher.pv measure implements a one-sided test. It returns an exact p-value, which can be compared directly with the p-values of chi.squared and log.likelihood.

Dice (Dice coefficient)

The Dice coefficient is a measure from the field of information retrieval, which has been used by Smadja (1993) and others for collocation extraction. Like MI, it is a maximum-likelihood estimate of association strength, but its definition of "strength" differs greatly from point-wise mutual information. It suffers from the same overestimation problem as MI, which is mitigated by its different approach to association strength, though.

References

Church, K. W. and Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22-29.

Church, K. W.; Gale, W.; Hanks, P.; Hindle, D. (1991). Using statistics in lexical analysis. In: Lexical Acquisition: Using On-line Resources to Build a Lexicon, Lawrence Erlbaum, pages 115-164.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61-74.

Evert, S. and Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pages 188-195.

Pedersen, T. (1996). Fishing for exactness. In: Proceedings of the South-Central SAS Users Group Conference, Austin, TX.

Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics 19(1), 143-177.

UCS CONVENTIONS

UCS/Perl uses some conventions for the names of association measures and the computed association scores, which are described in this section. It is important to be aware of such conventions, especially when they deviate from those used by other software packages.

The names of association measures are taken from the on-line inventory at http://www.collocations.de/AM/. Hyphen characters (-) are replaced by periods (.) to conform with the UCS standards (see the ucsfile manpage). Capitalisation is preserved (MI and Fisher.pv, but log.likelihood) and subscripts are included in the name, separated by a period (chi.squared.corr, where corr is a subscript in the original name).

Association scores are always arranged so that higher scores indicate stronger (positive) association, applying a transformation to the original values if necessary. In the one-sided versions of two-sided tests (e.g. chi.squared and log.likelihood), negative scores indicate negative association (while positive scores indicate positive association). Scores close to zero are a sign of statistical independence. Some other measures such as MI also have this property, but many do not (e.g. Fisher.pv or Dice).

"Explicit" logarithms in the equation of an association measure are usually taken to the base 10 (e.g. in the MI measure). This is not the case when the association score is not interpreted as a logarithm (e.g. the log.likelihoood, which is a test statistic approximating a known limiting distribution) and the natural logarithm is required for correct interpretation. The use of base 10 logarithms is always pointed out in the documentation (see the UCS::AM manpage). The logarithm of infinity if represented by a large floating-point value returned by the inf function (from the UCS::Expression::Func module). Comparison with +inf() and -inf() can be used to detect a positive or negative infinite value.

The scores of association measures with the extension .pv represent a p-value (from an exact test or the approximate p-value of an asymptotic test). Unlike most other scores, p-values can be compared directly between different measures. They are represented as negative base 10 logarithms, so the association score 3.0 corresponds to a p-value of 0.001 = 1e-3 (+inf() stands for zero probability, usually the result of an underflow error).

COPYRIGHT

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.