Computational Approaches to Collocations

Latest News: UCS toolkit v0.6 now available from multiword.sf.net (2010-09-12) [separator bar]

Most algorithms for the extraction of collocations from machine-readable text rely on statistical association measures, which are applied to 2-by-2 contingency tables representing the cooccurrence frequencies of word pairs. These pages provide a repository for the large number of association measures that have been suggested in the literature, together with a short discussion of their mathematical background and key references. Equations for all measures are given in terms of observed and expected frequencies so that they can easily be implemented.

Download the UCS toolkit for the statistical analysis of cooccurrence data with association measures and their evaluation in a collocation extraction task.

Some recent publications by Stefan Evert and Brigitte Krenn on computational approaches to collocations, which can be downloaded from this site. Includes slides from the ESSLLI 2003 course Computational Approaches to Collocations.

For the evaluation of association measures and other collocation extraction methods, a list of true positives has to be created, which is usually achieved by manual identification of collocations among the candidate pairs. This page collects guidelines for this task that have been written for different evaluation experiments.

Conferences and Workshops

(under construction)

Selected references to publications on the theory of collocations, their applications, automatic extraction from corpus data, and statistical association measures. Includes citation details for all books and articles mentioned on this web page.

About the author of these pages.