<<

NAME

UCS::AM - Built-in association measures

SYNOPSIS

  use UCS;
  use UCS::AM;

  @builtin_AMs = UCS::AM_Keys();

  # random
  # frequency
  # z.score
  # z.score.corr
  # t.score
  # chi.squared
  # chi.squared.corr
  # log.likelihood
  # Poisson.Stirling
  # Poisson.pv
  # Fisher.pv
  # MI
  # MI2
  # MI3
  # relative.risk
  # odds.ratio
  # odds.ratio.disc
  # Dice
  # gmean
  # MS
  # Jaccard
  # average.MI
  # local.MI

DESCRIPTION

This module contains definitions for a wide range of association measures. When the UCS::AM module is imported, the built-in measures are registered with the UCS core library (see UCS for details on how to access registered association measures).

The following section gives a full listing of the built-in association measures from the UCS::AM module with short explanations. Please refer to http://www.collocations.de/AM/ for the full equations and references. Further association measures can be imported from add-on packages (see the section on Add-On Packages below).

Note that some association measures produce infinite values (+inf or -inf). The logarithm of infinity is represented by the return value of the built-in inf function (see the UCS::Expression::Func manpage). The association scores of measures with the suffix .pv can be interpreted as probabilities (the likelihood of the observed data or the p-value of a statistical hypothesis test). Such probabilities are given as negative base 10 logarithms, ranging from 0 to +inf. Measures with the suffix .tt (for two-tailed) are derived from two-sided statistical hypothesis tests. One-sided versions of these tests are provided under the same name, but without the suffix.

BUILT-IN ASSOCIATION MEASURES

random

Random numbers between 0 and 1 as association scores simulate random selection of pair types and are used to break ties when sorting a data set.

frequency

Cooccurrence frequency of the pair type. This association measure is used to sort data sets by frequency, but requires some systematic method for breaking ties.

z.score

A z-score for the observed cooccurrence frequency O11 compared to the expected frequency E11. The value represents a standardised normal approximation of the binomial sampling distribution of O11 under the point null hypothesis of independence.

z.score.corr

A z-score for O11 compared to E11 with Yates' continuity correction applied.

t.score

Church et al (1991) use Student's t-test to compare the observed cooccurrence frequency O11 to the null expectation E11 estimated from the sample (which is a random variate as well), applying several approximations to simplify the t.score equation. The computed value is a t-score with degrees of freedom roughly equal to the sample size N. This application of the t-test is highly questionable, though, and produces extremely conservative results.

chi.squared

One-sided version of Pearson's chi-squared test for the independence of rows and columns in a 2x2 contingency table. Positive scores indicate positive association (O11 > E11), and negative scores indicate negative association (O11 < E11). The distinction between positive and negative association is unreliable for small absolute values of the test statistic. Under the null hypothesis, the one-sided chi.squared statistic approximates a normal distribution (as the signed root of a chi-squared distribution with one degree of freedom).

chi.squared.corr

One-sided version of Pearson's chi-squared test for the independence of rows and columns in a 2x2 contingency table, with Yates' continuity correction applied.

log.likelihood

One-sided version of the log-likelihood statistic suggested by Dunning (1993), a likelihood ratio test for independence of rows and columns in a 2x2 contingency table (Dunning introduced the measure as a test for homogeneity of the table columns, i.e. equal success probabilites of two independent binomial distributions). Positive scores indicate positive association (O11 > E11), and negative scores indicate negative association (O11 < E11). The distinction between positive and negative association is unreliable for small absolute values of the test statistic. Under the null hypothesis, the one-sided log.likelihood statistic approximates a normal distribution (as the signed root of a chi-squared distribution with one degree of freedom).

Poisson.Stirling

Approximation of the likelihood of the observed cooccurrence frequency O11 under the point null hypothesis of independence (so that the expected frequency is E11). The measure is derived from Poisson.likelihood (in the UCS::AM::HTest module) using Stirling's formula, resulting in a simple expression that can easily be evaluated. This measure was proposed by Quasthoff and Wolff (2002) and has been re-scaled to base 10 logarithms to allow a direct comparison with Poisson.likelihood.

Poisson.pv

Significance (one-sided p-value) of an exact Poisson test for the observed cooccurrence frequency O11 compared to the expected frequency E11 under the point null hypothesis of independence. This test is based on a Poisson approximation of the correct binomial sampling distribution of O11. It is numerically and analytically much easier to handle than the binomial test.

Fisher.pv

Significance (one-sided p-value) of Fisher's exact test for independence of rows and columns in a 2x2 contingency table with fixed marginals. This test is widely accepted as the most appropriate independence test for contingency tables (cf. Yates 1984). Its use as an association measure was suggested by Pedersen (1996).

MI

Maximum-likelihood estimate of the base 10 logarithm of the mu-value, which is identical to pointwise mutual information between the events describing occurrences of a pair's components. Note that mutual information is measured in decimal units rather than the customary bits. The theoretical range is from -inf to +inf, but the actural range for a given data set is restricted depending on the sample size N.

MI2

A heuristic variant of MI where the numerator is squared in order to discount low-frequency pairs. This measure also has some theoretical justification, being the square of the gmean measure.

MI3

Another heuristic variant of MI where the numerator is cubed, which boosts the discounting effect considerably.

relative.risk

Maximum-likelihood estimate of the logarithmic relative risk coefficient of association strength (base 10 logarithm). Ranges from -inf to +inf.

odds.ratio

Maximum-likelihood estimate of the logarithmic odds ratio as a coefficient of association strength (base 10 logarithm). Ranges from -inf to +inf.

odds.ratio.disc

A "discounted" version of odds.ratio, adding 0.5 to each factor in the equation. This modification of the odds ratio is commonly used to avoid infinite values, but does not seem to have a theoretical foundation.

Dice

Maximum-likelihood estimate of the Dice coefficient of association strength. Ranges from 0 to 1.

Jaccard

Maximum-likelihood estimate of the Jaccard coefficient of association strength, which is equivalent to Dice (i.e., there is a strictly monotonic mapping between the two association scores). Ranges from 0 to 1.

MS

Maximum-likelihood estimate of the minimum sensitivity coefficient suggested by Pedersen and Bruce (1996). Ranges from 0 to 1.

gmean

Maximum-likelihood estimate of the geometric mean coefficient of association strength. Ranges from 0 to 1.

average.MI

Maximum-likelihood estimate of the average mutual information between the indicator variables X and Y marking instances of a pair type's components. This implementation uses base 10 logarithms and multiplies the mutual information value with the sample size N in order to obtain readable values. Interestingly, average.MI is identical to Dunning's log-likelihood measure (log.likelihood and its variants) except for a scaling factor.

local.MI

Contribution of a given pair type to the (maximum-likelihood estimate of the) average mutual information of all cooccurrences. Formally, this is the mutual information between the random variables U and V, which represent the component types of a pair token in the random sample.

ADD-ON PACKAGES

The UCS::AM module provides a basic set of useful and well-known association measures. Except for the Poisson.pv and Fisher.pv, all measures have simple equations that can be computed efficiently. Further and more specialised association measures can be imported from add-on packages. Currently, the following packages are available:

  UCS::AM::HTest         variants of hypothesis tests, likelihood measures
  UCS::AM::Parametric    parametric association measures

These packages are implemented as Perl modules and can simply be loaded with the use operator. Alternatively, the UCS::Load_AM_Package function provides a convenient interface, where only the last part of the package name has to be specified, is case-insensitive, and may be abbreviated to a unique prefix. For instance, the UCS::AM::HTest package can be loaded with the specification 'ht'. The empty string '' loads UCS::AM, and 'ALL' imports all available AM packages. (See the UCS manpage for details.)

COPYRIGHT

Copyright 2003 Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.

<<