Footnotes

¹ Formally, the variables X_ij can be defined as sums over indicator variables:

Contingency table as sum over indicator variables

² Intuitively, the particular arrangement of the pair tokens in the sample cannot provide any meaningful information, since it is presupposed to be random. In particular, all reorderings of the sample must be equally likely. It is therefore sufficient to consider the total (co)occurrence frequencies. In fact, a sufficient statistic is already given by three of the variables in a contingency table (e.g. the joint frequency X₁₁ and the first row and column totals X_R₁ and X_C₁) because the four cells must add up to the sample size: X₁₁ + X₁₂ + X₂₁ + X₂₂ = N.

³ Formally, the sampling distribution is the joint probability distribution of the random variables X₁₁, ..., X₂₂.

⁴ The name maximum-likelihood estimates derives from the fact that the estimated values maximise the probability (or likelihood) of the observed contingency table among the set of all possible parameter values.

⁵ Precisely speaking, the point null hypothesis consists of three conditions: π₁ = p₁, π₂ = p₂, and π = p₁ p₂. Although p₁ and p₂ are random variables, they are treated as constants in the statistical model, which are set to the values computed from the observed data. Hypothesis tests based on the point null hypothesis thus effectively ignore the sampling error of p₁ and p₂.

¹ This likelihood is the probability of an outcome where X₁₁ equals the observed value O₁₁, while the values of X₁₂, X₂₁, and X₂₂ are unspecified. Note that the observed marginal frequencies still have some effect through their influence on the point null hypothesis.

¹ Try the command phyper(99, 1000, 999000, 1000, lower=F), which computes the Fisher score for a contingency table with O₁₁ = 100, R₁ = C₁ = 1,000, and N = 1,000,000. At least on versions up to R-1.9.0 running under Linux/i386, the result is a negative p-value (P < 0)!

¹ In earlier days, this task involved enormous tomes of statistical tables where p-values for many known distributions were tabulated. Back then, without the help of desktop computers, it was impossible to carry out exact hypothesis tests except for the case of very small samples. Such practical considerations were an important reason for the concentration on asymptotic (rather than exact) hypothesis tests during the first half of the 19^th century.

² For instance, common sense dictates that in those cases where a contingency table A is clearly less consistent with the null hypothesis than a table B, the test statistic should assume a greater value for A than for B. In many other cases, where the desired result of the comparison is not obvious, the definition of the test statistic is essentially an intuitive choice.

³ An equivalence proof for the three different versions of the chi-squared measure can be based on the fact that the identity (O₁₁ - E₁₁)² = (O₁₂ - E₁₂)² = (O₂₁ - E₂₁)² = (O₂₂ - E₂₂)² holds for any contingency table.

⁴ The number of degrees of freedom is given by the dimension of the parameter space minus the dimension of the null hypothesis (which is formally a subset of the parameter space). In the case of coocurrence data, the former has dimension 3 (with free parameters π, π₁, π₂), while the latter has dimension 2 (with π₁, π₂ as free parameters, and π determined by H₀). Therefore, the limiting χ² distribution of the likelihood ratio statistic has one degree of freedom.

¹ In particular, odds-ratio does not make any distinction between contingency tables where either O₁₂ = 0 or O₂₁ = 0 (because they are assigned the same infinite score). After discounting, odds-ratio_disc assigns higher scores to tables where both non-diagonal cells are empty (O₁₂ = O₂₁ = 0) rather than just one, and it takes the cooccurrence frequency O₁₁ into account.

² In the same paper, the authors also argue in favour of point estimates, which they interpret as descriptive rather than inferential measures. They state that descriptive statistics are more appropriate when it is feasible to analyse a population exhaustively (which they imply to be the case for the very large corpora that are available today). Interestingly, this argument is followed by an empirical evaluation of the MS measure (as an example of a descriptive measure) on a subset of the Wall Street Journal.

³ Let g be the Dice score for a given contingency table. Then the Jaccard score h for the same table is given by the equation h = g &frasl (2 - g).

¹ Let g be the local-MI score for a given pair type (u,v), and let h be the score of the Poisson-Stirling measure. Then the following equality holds: h = g - O₁₁.

¹ Let g be the gmean association score for a pair type (u,v). Then the score h of the MI² measure is given by h = log(g²) + log N, which is a monotonic transformation (for a fixed sample of size N).