UCS Quickstart: A Very Quick Guide

Download

the UCS toolkit by Stefan Evert from the UCS download page.

UCS crucially relies on Perl and the R (a language for statistical computing). UCS/Perl uses R as a backend: important statistical functions provided by R are available through a Perl module.

UCS will carp about any further missing dependencies.

Install

UCS with


tar xzvf UCS-0.3.2.tar.gz

cd UCS

perl System/install.perl

answer the questions and rejoice.

Configure

UCS (assuming bash and you are still in the UCS toplevel directory) with


export UCS=`System/bin/ucs-config --base-dir`

export PATH=$PATH:$UCS/bin

Documentation

in the modules and programs in Perl POD format. Display them with


   ucsdoc ProgramName|ModuleName

For GUI viewing (if you had Perl module Tk::Pod at installation) use:


    ucsdoc -tk ProgramName|ModuleName

Tutorials

on line:

UCS/Perl tutorial
UCS/R tutorial

Publications

related to UCS by Stefan Evert et al. are here.

Data set

format (with extension .ds)

Fundamental objects of the UCS toolkit are frequency data extracted from a given corpus for a given type of cooccurrences. Examples are

words or other morpho-syntactic units words occurring in the same sentences or within a certain distance from each other
adjectives modifying nouns (as in the Dickens and GLAW data sets)
PPs that are P-objects or adjuncts of a verb (as in the FR-PNV data set)

A data set file consists of a list of pair types (as opposed to tokens in a text) with their frequency signatures (i.e. joint and marginal frequencies), see Evert 2004. For more on UCS data set file format (.ds), see [ucsfile]. Data sets files are processed in gzipped form (.ds.gz) Examples are in DataSet/Distrib/

Get info

about the data file with


     ucs-info -v glaw.ds.gz

View

the data file through a pager with


     zmore $UCS/DataSet/Distrib/dickens.ds.gz

or much more conveniently (with persitent column headers) with


     ucs-print -i dickens.ds.gz

Format

the data file as an ASCII table with


     ucs-print dickens.ds.gz

Select parts

of the data (and display/save them) with

    
     ucs-select f FROM glaw.ds.gz TO ranks.ds.gz

'f' selects the variable named f; FROM and TO are keywords (not case-sensitive)

Create your own data set

from a set of pairs of tokens standing in any structural relation (examples above). Assuming that you have an extraction tool (YourExtractionTool) printing the instances (in the format ITEM1 TAB ITEM2 NEWLINE representing a pair token) to standard out, you can construct your data set with


  YourExtractionTool | ucs-make-tables -v

See [ucs-make-tables]

Example script extracting A+N cooccurrences from IMS Corpus Workbench (CWB). With the CWB/Perl modules and the demo corpus installed, one can re-create the Dickens data set with

$UCS/Perl/tools/ucs-adj-n-from-cwb.perl penn DICKENS | ucs-make-tables -v -f 3 my-dickens.ds.gz

Import data sets

e.g., from the Ngram Statistics Package (NSP). Assumings bigrams.cnt was created with NSP's count.pl tool, create the UCS data set from it with


  $UCS/Perl/tools/nsp2ucs.perl -v bigrams.cnt bigrams.ds.gz

Get a statistical summary

(min, max, mean, var, sd of vars) with


     ucs-summarize -v

Sort

according to any var along the lines of:

   
   ucs-sort -v dickens.ds.gz BY f+ -r INTO sorted.ds.gz

This sorts a gzipped ds file on var named 'f' in ascending order (+, descending is default) and break ties randomly (-r), the output is also file in gzipped ds. See [usc-sort].

Add association scores

, i.e., annotate a data set with your favourite association measure with


    ucs-add -v am.t.score am.log.likelihood TO dickens.ds.gz INTO scores.ds.gz

Add ranks

(based on association measures) to the dataset with


    ucs-add -v 'r.%' TO scores.ds.gz INTO ranks.ds.gz

r.% is a wildcard, see [ucsexp]

Count

the number of pair types with cooccurrence frequency >= X with:


    ucs-select -v --count FROM ranks.ds.gz WHERE '%f% >= 10'

%f% is a UCS expression, see [ucsexp]

UCS expressions

are snippets of Perl code with special syntax to access data set variables. They have the full power of Perl. E.g., to retrieve all collocates of nouns ending in -ness.


       ucs-select -v '*' 'r.%' FROM ranks.ds.gz WHERE '%l2% =~ /ness$/' | ucs-sort by l2 l1 | ucs-print -i

Check overlap

of two ds or adb (annotated database) files with


ucs-join -v fr-pnv.ds.gz pnv.adb.gz

Transfer annotation attributes

accross files with


	 ucs-join -v fr-pnv.ds.gz WITH b.figur b.fvg FROM pnv.adb.gz INTO fr-annotated.ds.gz

Create new variables

(and add them) with


       ucs-add -v -m 'b.TP := %b.figur% or %b.fvg%' 
                TO fr-annotated.ds.gz INTO fr-annotated.ds.gz

Evaluate

recall of an association measure for example by counting true positives with loglikelyhood measure <= 500


ucs-select -v --count FROM fr-annotated.ds.gz 
                WHERE '%b.TP% and %r.log.likelihood% <= 500'