UCS::AM - Built-in association measures

use UCS; use UCS::AM; @builtin_AMs = UCS::AM_Keys(); # random # frequency # z.score # z.score.corr # t.score # chi.squared # chi.squared.corr # log.likelihood # Poisson.Stirling # Poisson.pv # Fisher.pv # MI # MI2 # MI3 # relative.risk # odds.ratio # odds.ratio.disc # Dice # gmean # MS # Jaccard # average.MI # local.MI

This module contains definitions for a wide range of **association measures**. When the **UCS::AM** module is imported, the built-in measures are registered with the **UCS** core library (see UCS for details on how to access registered association measures).

The following section gives a full listing of the built-in association measures from the **UCS::AM** module with short explanations. Please refer to *http://www.collocations.de/AM/* for the full equations and references. Further association measures can be imported from **add-on packages** (see the section on Add-On Packages below).

Note that some association measures produce infinite values (*+inf* or *-inf*). The logarithm of infinity is represented by the return value of the built-in **inf** function (see the UCS::Expression::Func manpage). The association scores of measures with the suffix **.pv** can be interpreted as probabilities (the likelihood of the observed data or the p-value of a statistical hypothesis test). Such probabilities are given as **negative base 10 logarithms**, ranging from 0 to *+inf*. Measures with the suffix **.tt** (for *two-tailed*) are derived from two-sided statistical hypothesis tests. One-sided versions of these tests are provided under the same name, but without the suffix.

**random**-
Random numbers between 0 and 1 as association scores simulate random selection of pair types and are used to break ties when sorting a data set.

**frequency**-
Cooccurrence frequency of the pair type. This association measure is used to sort data sets by frequency, but requires some systematic method for breaking ties.

**z.score**-
A z-score for the observed cooccurrence frequency O11 compared to the expected frequency E11. The value represents a standardised normal approximation of the binomial sampling distribution of O11 under the point null hypothesis of independence.

**z.score.corr**-
A z-score for O11 compared to E11 with Yates' continuity correction applied.

**t.score**-
Church et al (1991) use Student's t-test to compare the observed cooccurrence frequency O11 to the null expectation E11 estimated from the sample (which is a random variate as well), applying several approximations to simplify the

**t.score**equation. The computed value is a t-score with degrees of freedom roughly equal to the sample size*N*. This application of the t-test is highly questionable, though, and produces extremely conservative results. **chi.squared**-
One-sided version of Pearson's chi-squared test for the independence of rows and columns in a 2x2 contingency table. Positive scores indicate positive association (

*O11 > E11*), and negative scores indicate negative association (*O11 < E11*). The distinction between positive and negative association is unreliable for small absolute values of the test statistic. Under the null hypothesis, the one-sided**chi.squared**statistic approximates a normal distribution (as the signed root of a chi-squared distribution with one degree of freedom). **chi.squared.corr**-
One-sided version of Pearson's chi-squared test for the independence of rows and columns in a 2x2 contingency table, with Yates' continuity correction applied.

**log.likelihood**-
One-sided version of the log-likelihood statistic suggested by Dunning (1993), a likelihood ratio test for independence of rows and columns in a 2x2 contingency table (Dunning introduced the measure as a test for homogeneity of the table columns, i.e. equal success probabilites of two independent binomial distributions). Positive scores indicate positive association (

*O11 > E11*), and negative scores indicate negative association (*O11 < E11*). The distinction between positive and negative association is unreliable for small absolute values of the test statistic. Under the null hypothesis, the one-sided**log.likelihood**statistic approximates a normal distribution (as the signed root of a chi-squared distribution with one degree of freedom). **Poisson.Stirling**-
Approximation of the likelihood of the observed cooccurrence frequency

*O11*under the point null hypothesis of independence (so that the expected frequency is*E11*). The measure is derived from**Poisson.likelihood**(in the UCS::AM::HTest module) using Stirling's formula, resulting in a simple expression that can easily be evaluated. This measure was proposed by Quasthoff and Wolff (2002) and has been re-scaled to base 10 logarithms to allow a direct comparison with**Poisson.likelihood**. **Poisson.pv**-
Significance (one-sided p-value) of an exact Poisson test for the observed cooccurrence frequency O11 compared to the expected frequency E11 under the point null hypothesis of independence. This test is based on a Poisson approximation of the correct binomial sampling distribution of O11. It is numerically and analytically much easier to handle than the binomial test.

**Fisher.pv**-
Significance (one-sided p-value) of Fisher's exact test for independence of rows and columns in a 2x2 contingency table with fixed marginals. This test is widely accepted as the most appropriate independence test for contingency tables (cf. Yates 1984). Its use as an association measure was suggested by Pedersen (1996).

**MI**-
Maximum-likelihood estimate of the base 10 logarithm of the

*mu*-value, which is identical to pointwise mutual information between the events describing occurrences of a pair's components. Note that mutual information is measured in*decimal units*rather than the customary*bits*. The theoretical range is from*-inf*to*+inf*, but the actural range for a given data set is restricted depending on the sample size*N*. **MI2**-
A heuristic variant of

**MI**where the numerator is squared in order to discount low-frequency pairs. This measure also has some theoretical justification, being the square of the**gmean**measure. **MI3**-
Another heuristic variant of

**MI**where the numerator is cubed, which boosts the discounting effect considerably. **relative.risk**-
Maximum-likelihood estimate of the logarithmic relative risk coefficient of association strength (base 10 logarithm). Ranges from

*-inf*to*+inf*. **odds.ratio**-
Maximum-likelihood estimate of the logarithmic odds ratio as a coefficient of association strength (base 10 logarithm). Ranges from

*-inf*to*+inf*. **odds.ratio.disc**-
A "discounted" version of

**odds.ratio**, adding 0.5 to each factor in the equation. This modification of the odds ratio is commonly used to avoid infinite values, but does not seem to have a theoretical foundation. **Dice**-
Maximum-likelihood estimate of the Dice coefficient of association strength. Ranges from 0 to 1.

**Jaccard**-
Maximum-likelihood estimate of the Jaccard coefficient of association strength, which is equivalent to

**Dice**(i.e., there is a strictly monotonic mapping between the two association scores). Ranges from 0 to 1. **MS**-
Maximum-likelihood estimate of the

*minimum sensitivity*coefficient suggested by Pedersen and Bruce (1996). Ranges from 0 to 1. **gmean**-
Maximum-likelihood estimate of the

*geometric mean*coefficient of association strength. Ranges from 0 to 1. **average.MI**-
Maximum-likelihood estimate of the average mutual information between the indicator variables X and Y marking instances of a pair type's components. This implementation uses base 10 logarithms and multiplies the mutual information value with the sample size

*N*in order to obtain readable values. Interestingly,**average.MI**is identical to Dunning's log-likelihood measure (**log.likelihood**and its variants) except for a scaling factor. **local.MI**-
Contribution of a given pair type to the (maximum-likelihood estimate of the) average mutual information of

*all*cooccurrences. Formally, this is the mutual information between the random variables U and V, which represent the component types of a pair token in the random sample.

The **UCS::AM** module provides a basic set of useful and well-known association measures. Except for the **Poisson.pv** and **Fisher.pv**, all measures have simple equations that can be computed efficiently. Further and more specialised association measures can be imported from add-on packages. Currently, the following packages are available:

UCS::AM::HTest variants of hypothesis tests, likelihood measures UCS::AM::Parametric parametric association measures

These packages are implemented as Perl modules and can simply be loaded with the **use** operator. Alternatively, the **UCS::Load_AM_Package** function provides a convenient interface, where only the last part of the package name has to be specified, is case-insensitive, and may be abbreviated to a unique prefix. For instance, the **UCS::AM::HTest** package can be loaded with the specification `'ht'`

. The empty string `''`

loads **UCS::AM**, and `'ALL'`

imports all available AM packages. (See the UCS manpage for details.)

Copyright 2003 Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.