ucsintro - A first introduction to UCS/Perl
UCS is a set of libraries and tools intended for the empirical study of cooccurrence statistics. Its major uses are to apply such statistics, called association measures, to cooccurrence data obtained from a corpus, and to evaluate the resulting association scores and rankings against (manually annotated) reference data.
The frequency data extracted from a given corpus for a given type of cooccurrences consists of a list of pair types with their frequency signatures (i.e. joint and marginal frequencies), and is referred to as a data set. See (Evert 2004) for a detailed explanation of these concepts, different types of cooccurrences, and correct methods for obtaining frequency data. Data sets, stored in a special .ds file format, are the fundamental objects of the UCS toolkit. Most UCS programs manipulate or display such data set files.
The UCS implementation relies heavily on the programming language Perl (http://www.perl.com/) and the free statistical environment R (http://www.r-project.org/) as a library of mathematical and statistical functions. The core of UCS is written in Perl (the UCS/Perl part), but there is also a small library of R functions for interactive work within R (the UCS/R part). UCS/Perl uses R as a back-end, making the most important statistical functions available through a Perl module.
UCS/Perl is mainly a collection of Perl modules that perform the following tasks:
Most UCS programs will be custom-built scripts, using the library of support functions provided by the UCS/Perl modules. Loading a data set, annotating it with association scores from one or more measures, and sorting it in various ways can be done with a few lines of Perl code. There are also some ready-made programs in UCS/Perl that perform such standard tasks, operating on data set files. A substantial part of the UCS/Perl functionality is thus accessible from the command-line, at the cost of some additional overhead compared to a custom script (which operates on in-memory representations).
Below, you will find a list of the general documentation files, Perl modules, and programs that are included in the UCS/Perl distribution. Manpages for all modules and programs (as well as the general documentation) are easily accessible with the ucsdoc program, and can also be formatted for printing.
ucsdoc ucsintro # this introduction ucsdoc ucsfile # description of the UCS data set file format (.ds) ucsdoc ucsexp # UCS expressions and wildcards ucsdoc ucsam # overview of built-in association measures
use UCS; # core library use UCS::File; # file access utilities use UCS::R; # interface to UCS/R use UCS::SFunc; # special functions and statistical distributions use UCS::Expression; # Perl code interspersed with UCS variables use UCS::Expression::Func; # utility functions available in UCS expressions use UCS::AM; # implementations of various association measures use UCS::AM::HTest; # add-on package: variants of hypothesis tests use UCS::AM::Parametric; # add-on package: parametric association measures use UCS::DS; # data sets ... use UCS::DS::Stream; # i/o streams for data set files use UCS::DS::Memory; # in-memory representation of data sets use UCS::DS::Format; # ASCII formatter (+ other formats)
See the respective manpages (
ucsdoc ModuleName) for more information.
ucsdoc # front-end to perldoc ucs-config # automatic configuration of UCS/Perl scripts ucs-tool # find and run user-contributed UCS/Perl scripts ucs-list-am # list built-in association measures & add-on packages ucs-make-tables # compute frequency signatures from list of pair tokens ucs-merge # merge parts of very large data set ucs-summarize # print (statistical) summaries for selected variables ucs-select # select rows and/or columns from a data set file ucs-add # add variables to a data set file ucs-join # combine rows and/or columns from two data sets ucs-sort # sort data set file by specified attribute(s) ucs-info # display information from header of data set file ucs-print # format data set as ASCII table (for viewing and printing)
See the respective manpages (
ucsdoc ProgramName) for more information.
UCS stands for Utilities for Cooccurrence Statistics.
Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, University of Stuttgart, Germany.
On-line repository of association measures: http://www.collocations.de/(http://www.collocations.de)
Copyright (C) 2004-2010 by Stefan Evert.
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.