UCS Quickstart: A Very Quick Guide

by Viktor Trón

Download

the UCS toolkit by Stefan Evert from the UCS download page.

UCS crucially relies on Perl and the R (a language for statistical computing). UCS/Perl uses R as a backend: important statistical functions provided by R are available through a Perl module.

UCS will carp about any further missing dependencies.

Install

UCS with

tar xzvf UCS-0.3.2.tar.gz
cd UCS
perl System/install.perl


answer the questions and rejoice.

Configure

UCS (assuming bash and you are still in the UCS toplevel directory) with

export UCS=`System/bin/ucs-config --base-dir`
export PATH=$PATH:$UCS/bin


Documentation

in the modules and programs in Perl POD format. Display them with

ucsdoc ProgramName|ModuleName

For GUI viewing (if you had Perl module Tk::Pod at installation) use:

ucsdoc -tk ProgramName|ModuleName

Tutorials

on line:

UCS/Perl tutorial
UCS/R tutorial

Publications

related to UCS by Stefan Evert et al. are here.

Data set

format (with extension .ds)

Fundamental objects of the UCS toolkit are frequency data extracted from a given corpus for a given type of cooccurrences. Examples are

A data set file consists of a list of pair types (as opposed to tokens in a text) with their frequency signatures (i.e. joint and marginal frequencies), see Evert 2004. For more on UCS data set file format (.ds), see [ucsfile]. Data sets files are processed in gzipped form (.ds.gz) Examples are in DataSet/Distrib/

Get info

about the data file with

ucs-info -v glaw.ds.gz

View

the data file through a pager with

zmore $UCS/DataSet/Distrib/dickens.ds.gz

or much more conveniently (with persitent column headers) with

ucs-print -i dickens.ds.gz

Format

the data file as an ASCII table with

ucs-print dickens.ds.gz

Select parts

of the data (and display/save them) with

ucs-select f FROM glaw.ds.gz TO ranks.ds.gz

'f' selects the variable named f; FROM and TO are keywords (not case-sensitive)

Create your own data set

from a set of pairs of tokens standing in any structural relation (examples above). Assuming that you have an extraction tool (YourExtractionTool) printing the instances (in the format ITEM1 TAB ITEM2 NEWLINE representing a pair token) to standard out, you can construct your data set with

YourExtractionTool | ucs-make-tables -v

See [ucs-make-tables]

Example script extracting A+N cooccurrences from IMS Corpus Workbench (CWB). With the CWB/Perl modules and the demo corpus installed, one can re-create the Dickens data set with

$UCS/Perl/tools/ucs-adj-n-from-cwb.perl penn DICKENS | ucs-make-tables -v -f 3 my-dickens.ds.gz

Import data sets

e.g., from the Ngram Statistics Package (NSP). Assumings bigrams.cnt was created with NSP's count.pl tool, create the UCS data set from it with

$UCS/Perl/tools/nsp2ucs.perl -v bigrams.cnt bigrams.ds.gz

Get a statistical summary

(min, max, mean, var, sd of vars) with

ucs-summarize -v

Sort

according to any var along the lines of:

ucs-sort -v dickens.ds.gz BY f+ -r INTO sorted.ds.gz

This sorts a gzipped ds file on var named 'f' in ascending order (+, descending is default) and break ties randomly (-r), the output is also file in gzipped ds. See [usc-sort].

Add association scores

, i.e., annotate a data set with your favourite association measure with

ucs-add -v am.t.score am.log.likelihood TO dickens.ds.gz INTO scores.ds.gz

Add ranks

(based on association measures) to the dataset with

ucs-add -v 'r.%' TO scores.ds.gz INTO ranks.ds.gz

r.% is a wildcard, see [ucsexp]

Count

the number of pair types with cooccurrence frequency >= X with:

ucs-select -v --count FROM ranks.ds.gz WHERE '%f% >= 10'

%f% is a UCS expression, see [ucsexp]

UCS expressions

are snippets of Perl code with special syntax to access data set variables. They have the full power of Perl. E.g., to retrieve all collocates of nouns ending in -ness.

ucs-select -v '*' 'r.%' FROM ranks.ds.gz WHERE '%l2% =~ /ness$/' | ucs-sort by l2 l1 | ucs-print -i

Check overlap

of two ds or adb (annotated database) files with

ucs-join -v fr-pnv.ds.gz pnv.adb.gz

Transfer annotation attributes

accross files with

ucs-join -v fr-pnv.ds.gz WITH b.figur b.fvg FROM pnv.adb.gz INTO fr-annotated.ds.gz

Create new variables

(and add them) with

ucs-add -v -m 'b.TP := %b.figur% or %b.fvg%' TO fr-annotated.ds.gz INTO fr-annotated.ds.gz

Evaluate

recall of an association measure for example by counting true positives with loglikelyhood measure <= 500

ucs-select -v --count FROM fr-annotated.ds.gz WHERE '%b.TP% and %r.log.likelihood% <= 500'