The UCS toolkit.

The UCS toolkit is a collection of libraries and scripts for the statistical analysis of cooccurrence data. Data sets – each one containing a list of word pairs together with their joint and marginal frequencies – are stored in a tabular format in plain (compressed) text files. They can be viewed, printed, manipulated in various ways, annotated with association scores from a wide range of built-in measures, ranked, and sorted with the UCS/Perl subsystem. Additional functionality for the graphical evaluation of association measures in a collocation extraction task (cf. Evert & Krenn, 2001) is provided by the UCS/R subsystem.

The full release of sample code and data sets to accompany my PhD thesis (Evert 2004) – which are announced in the text as UCS version 0.5 – has been delayed and will probably never happen. Please use the latest version of the UCS toolkit (0.6 or newer), preferably installed directly from the SVN repository.

If you would like to replicate a particular analysis, please contact me by e-mail in order to obtain scripts and data in their current state.

NB: Future releases of the UCS toolkit are expected to require Perl version 5.8.1 or newer (for Unicode support) and R version 2.10.1 or newer.

Footnote: The UCS toolkit has been designed for scientific research on the properties of statistical association measures and the relation between cooccurrences and collocations. In my terminology, this involves a close look at the data and a thorough understanding of the theoretical and methodological background. Flexibility is more important than either frills or speed. Therefore, the UCS system is not intended as a number cruncher that extracts and processes cooccurrences from several hundred million words of text in a few minutes. Nor is it a black box that accepts text files from a word processor and produces a list of collocation candidates at the push of a button.

