GETTING STARTED WITH UCS/Perl

GETTING STARTED WITH UCS/Perl

This introduction is intended to make you familiar with UCS/Perl, which is the core of the UCS system. The UCS/Perl libraries and tools allow you to create, manipulate, filter, sort, and print cooccurrence data sets. A typical application of such cooccurrence data is to serve as raw material for collocation identification. For this purpose, the pair types of a data set are ranked according to statistical association measures. UCS/Perl can be used both for the annotation of association scores and for the ranking process. A graphical evaluation against a gold standard of true collocations can then be performed in the UCS/R part.

If you only want to use the UCS/R evaluation functions, you can turn directly to the UCS/R tutorial script. Change to the System/R/ directory and follow the instructions in the README file there.

Preparing for the Tutorial

The remainder of this section is a walk-through of the UCS/Perl command-line tools. Most of their functionality (and some additional stuff) is also available through a programmer interface in the form of a set of Perl modules. If you want to write your own UCS/Perl programs, you will have to find your own way through the comprehensive documentation. The UCS/Perl command-line tools and several additional example scripts provide a good starting point for your own work. Note that you can easily configure your scripts (so that they have access to the UCS/Perl libraries) with the help of the ucs-config program.

This tutorial assumes that you have already configured the UCS system and installed the command-line utilities in your search path, as described in the main README file. In this case, you can skip the remainder of this section.

Otherwise, you will have to specify full paths to the tools in each of the examples below. For this purpose, it is convenient do define a shell variable $UCS pointing to the System directory of the UCS installation. Execute one of the following lines, depending on whether your shell is bash or tcsh (if you don't know, type echo $SHELL, or simply try both commands).

  export UCS=`ucs-config --base-dir`  # in sh or bash

  setenv UCS `ucs-config --base-dir`  # in tcsh

Having set this shell variable, you can just type $UCS/bin/ucs-add instead of ucs-add to invoke the ucs-add program in the examples below, and similarly for all other command-line programs.

Tutorial Introduction to UCS/Perl

You should now change to a scratch directory (e.g. in your home directory or in the /tmp directory) where we can put the data files created by the examples in the tutorial. These files can be deleted after you have stepped through the examples.

UCS/Perl comes with fairly comprehensive documentation embedded into the modules and programs in POD format. The ucsdoc program provides a convenient interface to this documentation. Simply type

  ucsdoc <ProgramName>

  ucsdoc <ModuleName>

to read the respective manual page. The starting point for all UCS/Perl documentation is the ucsintro document:

  ucsdoc ucsintro

When you have installed Perl/Tk and the Tk::Pod module, you can also view the manpages in a GUI window:

  ucsdoc -tk ucsintro

Of course, ucsdoc ucsdoc will tell you more about the ucsdoc program and its options. If you prefer paper documentation, you can print the entire UCS/Perl documentation, using one of the additional UCS/Perl scripts provided in the contrib/ directory. Such ``contributed'' scripts can easily be invoked with the ucs-tool program:

  ucs-tool print-documentation --collate UCS-Perl-Doc

This command will create a PostScript file UCS-Perl-Doc.ps in the current directory, which you may delete after printing. In case of any problems you should omit --collate, so that the individual manpages will be saved to separate files UCS-Perl-Doc-001.ps, UCS-Perl-Doc-002.ps, etc. (You can also convert documentation into LaTeX format with the --latex option.)

First of all, you need to understand the UCS data set file format. You should read the ucsfile manpage carefully now (ucsdoc ucsfile). The UCS distribution includes the following example data sets for your first experiments:

dickens.ds.gz: adjective + noun cooccurrences from a corpus of novels by Charles Dickens (3.4 million words)
fr-pnv.ds.gz: German PP+verb cooccurrences from the Frankfurter Rundschau corpus (40 million words)
glaw.ds.gz: German adjective+noun cooccurrences from a small corpus of freely available law texts (< 1 million words), with manual annotation of ``usual combinations''

You will find these data sets in the DataSet/Distrib/ directory. UCS data set files have the form of statistical tables, with rows corresponding to pair types and columns to variables. They are stored in a simple text format which is compatible with the R environment. Data set files are usually compressed with gzip to save space and carry the filename extension .ds.gz. Direct viewing of data set files (e.g. with zmore) is inconvenient. For this purpose, UCS/Perl provides the ucs-info and ucs-print programs.

ucs-info displays information from the header of a data set file. Try:

  ucs-info fr-pnv.ds.gz

  ucs-info glaw.ds.gz

Because these data sets are stored in the global data set directory (or, more precisely, in one of its subdirectories), it is sufficient to enter the name of the data set file without a full path. If no file with the specified name is found in the current directory, the UCS/Perl programs will automatically search the global data set directory for a matching filename. If the data set header does not show its size (i.e. the number of rows in the table) or you do not trust it, you can check the actual size of the data set with the -s option.

  ucs-info -v -s fr-pnv.ds.gz

(The -v option keeps you entertained while the data set is being read.) You can also display a list of all variables defined in the data set with the -l option.

  ucs-info -l fr-pnv.ds.gz

  ucs-info -l glaw.ds.gz

Compare these listings with the documentation in ucsfile. Also note how an explanatory comment is displayed with the user-defined variable n.accept in glaw.ds.gz.

ucs-print formats a data set file as an ASCII table suitable for viewing and printing. It is most useful with the -i option, which sends the formatted table to a pager for interactive viewing (you should install the Term::ReadKey module for optimal results).

  ucs-print -i dickens.ds.gz

  ucs-print -i glaw.ds.gz

You should now be able to page through the data set file by pressing SPACE (one page forward) and BACKSPACE (one page backward). The ucs-print utility has several other options. Like all other UCS/Perl programs, it will display a short usage reminder when called with the -h option:

  ucs-print -h

Enter ucsdoc ucs-print to see the full manual page.

The ucs-summarize program computes statistical summaries for numerical variables, e.g. for the cooccurrence frequency f:

  ucs-summarize -v f FROM dickens.ds.gz

or simply leave out the variable name(s) to compute summaries for all data set variables.

  ucs-summarize -v dickens.ds.gz

Again, check the manual page for additional options and detailed information.

Now that you are familiar with the data set file format, let us manipulate the data sets. The ucs-sort utility changes the order of the rows in a data set by sorting on one or more variables.

  ucs-sort -v dickens.ds.gz BY f- INTO sorted.ds.gz

This sorts the Dickens data set by cooccurrence frequency (decreasing) and creates a new data set file sorted.ds.gz in the current directory. The - character after the variable name f selects decreasing sort order. Without an explicit + or -, the sort order is automatically chosen. When you display the sorted data set, you will notice that there are many ties, i.e. pair types with the same cooccurrence frequency.

  ucs-print -i sorted.ds.gz

You can break such ties randomly with the -r option

  ucs-sort -v -r dickens.ds.gz BY f- INTO sorted.ds.gz
  ucs-print -i sorted.ds.gz

or alphabetically by specifying additional sort keys. In this example, we sort first on the noun, then the adjective:

  ucs-sort -v dickens.ds.gz BY f- l2+ l1+ INTO sorted.ds.gz
  ucs-print -i sorted.ds.gz

When the INTO clause is omitted, the resulting data set is printed on STDOUT (in the data set file format). This feature often allows us to combine UCS/Perl programs into command pipes without having to save intermediate results into files. Here is a single-line version of the above commands:

  ucs-sort dickens.ds.gz BY f- l2+ l1+ | ucs-print -i

If you just got a SGIPIPE warning, don't worry. That is just because you quit the pager without going through the entire data set, so some of the data printed by ucs-sort was discarded.

The two most important tools are probably ucs-add and ucs-select. The ucs-add program allows you to annotate a data set with association scores, rankings, and other variables. Let us add association scores for two well-known association measures to the Dickens data set:

  ucs-add -v am.t.score am.log.likelihood TO dickens.ds.gz INTO scores.ds.gz
  ucs-print -i scores.ds.gz

By the way: if you don't like the uppercase keywords TO and INTO, you are also allowed to type them in lowercase (to, into) or mixed case (To, Into). The default versions are meant to give a better visual subdivision of the command line.

The most ``significant'' cooccurrences are those with the highest association scores. We will now re-sort the data set to put these at the top:

  ucs-sort scores.ds.gz BY am.t.score | ucs-print -i
  ucs-sort scores.ds.gz BY am.log.likelihood | ucs-print -i

(The default sort order for association scores is descending, so we do not have to put an explicit - after the variable name.) Note how the two association measures disagree about which cooccurrences are most significant. The actual differences can be seen more clearly when we add ranks according to each of the association scores to the data set:

  ucs-add -v 'r.%' TO scores.ds.gz INTO ranks.ds.gz

In this example, we have used a UCS wildcard pattern ('r.%') to compute rankings for all available association scores without having to type each one explicitly. Have a look at the ucsexp manpage to learn more about such patterns. We can now sort directly compare the ranks assigned to each pair type:

  ucs-sort ranks.ds.gz BY am.t.score | ucs-print -i 'r.%' '*' FROM -

Note the use of wildcard patterns to display only some of the variables and to re-order the columns. The special filename - can be used to read from standard input (e.g. in a command pipe) when the FROM clause is mandatory. Read the ucs-add manpage to learn about the many other possibilities it offers.

The ucs-select command is used to select rows and/or columns from a data set, or to count rows that satisfy a specified condition. If you are just interested in the rankings, you can select the two relevant variables and save them to a new data set file or display them directly with ucs-print.

  ucs-select 'r.%' FROM ranks.ds.gz | ucs-print -i

This actually has the same effect as

  ucs-print -i 'r.%' FROM ranks.ds.gz

As the next step, let us count the number of pair types with cooccurrence frequency >= 10. This condition is specified in the form of a UCS expression on the command line.

  ucs-select -v --count FROM ranks.ds.gz WHERE '%f% >= 10'

A UCS expression is simply a snippet of Perl code (which is compiled and executed on the fly) with a special syntax to access data set variables. In the example above, %f% is set to the respective value of the f variable as the expression is applied to each row of the data set. UCS expressions are one of the most important elements of UCS/Perl - study the ucsexp manpage carefully now.

Another simple example counts the number of pair types which are among the 500 highest-scoring pairs according to both measures.

  ucs-select -v --count FROM ranks.ds.gz 
             WHERE 'max(%r.t.score%, %r.log.likelihood%) <= 500'

(Of course, this command has to be entered as a single line in the shell.) The built-in utility function max() is automatically available in UCS expressions (cf. the UCS::Expression::Func manpage). We can also save all rows that satisfy this condition to a new data set, selecting all columns with the % wildcard.

  ucs-select -v '%' FROM ranks.ds.gz INTO highest.ds.gz
             WHERE 'max(%r.t.score%, %r.log.likelihood%) <= 500'

  ucs-info -l highest.ds.gz
  ucs-print -i highest.ds.gz

We can now easily work with this new small data set, or re-sort and view it. Such small subsets extracted from a data set are also suitable for printing. Running ucs-print with the --postscript (or -ps) option creates a PostScript file that can be sent to an appropriate printer:

  ucs-print -v -ps -l -p 50 -o highest.ps highest.ds.gz

You can now preview the result with gv highest.ps. Check the ucs-print manpage for an explanation of the options used in the example above.

Thanks to the use of UCS expressions, ucs-select has the full power of Perl, with access to all built-in functions (perldoc perlfunc) and the complete standard library. It is easy e.g. to retrieve all collocates of nouns ending in -ness.

  ucs-select -v '*' 'r.%' FROM ranks.ds.gz WHERE '%l2% =~ /ness$/'
             | ucs-sort by l2 l1 | ucs-print -i

It is often useful to store manual annotations (e.g. variables marking true collocations) in separate files. A data set without frequency information (i.e. without the frequency signature f, f1, f2, and N) is called an ``annotation database'' and conventionally has the extension .adb.gz. The UCS distribution includes an annotation database for German PP+verb pairs, which was kindly provided by Brigitte Krenn (�FAI, Vienna).

  ucs-info -l pnv.adb.gz
  ucs-print -i pnv.adb.gz

We can easily find out the number of pair types that were identified as collocations with the ucs-select command:

  ucs-select -v --count FROM pnv.adb.gz WHERE '%b.figur%'
  ucs-select -v --count FROM pnv.adb.gz WHERE '%b.fvg%'

In order to use these annotations with cooccurrence data extracted from a corpus, the annotation attributes have to be transferred to a data set file. This is achieved with the ucs-join program. Simply calling ucs-join with the two files as arguments will check the coverage of the annotation database:

  ucs-join -v fr-pnv.ds.gz pnv.adb.gz

We can now copy the b.figur and b.fvg attributes to the data set fr-pnv.ds.gz, and save the result into a new data set file.

  ucs-join -v fr-pnv.ds.gz WITH b.figur b.fvg FROM pnv.adb.gz 
              INTO fr-annotated.ds.gz

  ucs-info -l fr-annotated.ds.gz

If any of the pair types are not covered by the annotation database, they will be annotated with missing values (NA).

Once we have added association scores and rankings to the data set, we can easily compute the precision and recall of N-best lists (i.e. the N highest-ranked pairs according to some association measure etc.). Note how the -m option of ucs-add allows us to write back the modified data set to the same file:

  ucs-add -v -m am.log.likelihood r.log.likelihood 
                TO fr-annotated.ds.gz INTO fr-annotated.ds.gz

In the PP+verb annotation database, figurative expressions and support-verb constructions are marked separately. However, we want to accept both as true collocations, so the condition for true positives is %b.figur% or %b.fvg%. It would be convenient to have a single variable marking true positives. We can create such a variable, which we will call b.TP, by evaluating a user-defined UCS expression with the ucs-add program.

  ucs-add -v -m 'b.TP := %b.figur% or %b.fvg%' 
                TO fr-annotated.ds.gz INTO fr-annotated.ds.gz

Now it is easy to evaluate the N-best lists against all true positives:

  ucs-select -v --count FROM fr-annotated.ds.gz 
                WHERE '%b.TP% and %r.log.likelihood% <= 500'

You can also create your own data sets for relational cooccurrences, with the help of the ucs-make-tables program. For relational cooccurrences, each pair token (= instance) represents a structural relation between words (or other morpho-syntactic units). Examples are adjectives modifying nouns (as in the Dickens and GLAW data sets) or PPs that are P-objects or adjuncts of a verb (as in the FR-PNV data set). Positional cooccurrences (words occurring in the same sentences or within a certain distance from each other) are more difficult to count properly and you will have to construct such data sets on your own.

ucs-make-tables takes its input - which is a stream of pair tokens - from an extraction tool that the user has to provide. Each line of this stream represents a pair token and has the format

  <l1> TAB <l2>

where <l1> is the type (= lexeme) of the first component of the pair token, and <l2> is the type of its second component. The extraction tool should print the token stream on standard output so that it can be connected to ucs-make-tables through a pipe:

  <YourExtractionTool> | ucs-make-tables -v <dataset.ds.gz>

Type ucsdoc ucs-make-tables to learn about the available command-line options.

The UCS/Perl distribution includes example scripts in the System/Perl/contrib/ directory tree that extract cooccurrence data from a corpus encoded in the IMS Corpus Workbench (CWB). [ http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ ]

When you have installed the CWB, the CWB/Perl interface modules, and the demonstration corpus provided with the CWB (DICKENS), you can re-create the Dickens data set with the following commands:

  ucs-tool adj-n-from-cwb penn DICKENS 
        | ucs-make-tables -v -f 3 my-dickens.ds.gz

  ucs-info -l my-dickens.ds.gz

You can also import data sets from the Ngram Statistics Package (NSP) [ http://ngram.sourceforge.net/ ]. For instance, if you have a file named bigrams.cnt that was created with NSP's count.pl tool, the following command converts it into a UCS data set:

  ucs-tool nsp2ucs -v bigrams.cnt bigrams.ds.gz
  ucs-info -l bigrams.ds.gz

Note that there are usually no manual pages for such ``contributed'' scripts. Run them with the option -h for a short description of their purpose and usage information:

  ucs-tool adj-n-from-cwb -h
  ucs-tool segment-from-cwb -h
  ucs-tool nsp2ucs -h
  ucs-tool make-dummy-ds -h
  ucs-tool count-collocates -h
  ucs-tool dispersion-test -h

(When a manual page is available, it can be displayed with the --doc option, e.g. ucs-tool --doc nsp2ucs). You can list all contributed scripts with

  ucs-tool --list

or all scripts that import data sets from external programs with

  ucs-tool --list --category=Import