This introduction is intended to make you familiar with UCS/Perl, which is the core of the UCS system. The UCS/Perl libraries and tools allow you to create, manipulate, filter, sort, and print cooccurrence data sets. A typical application of such cooccurrence data is to serve as raw material for collocation identification. For this purpose, the pair types of a data set are ranked according to statistical association measures. UCS/Perl can be used both for the annotation of association scores and for the ranking process. A graphical evaluation against a gold standard of true collocations can then be performed in the UCS/R part.
If you only want to use the UCS/R evaluation functions, you can turn directly
to the UCS/R tutorial script. Change to the System/R/
directory and follow
the instructions in the README
file there.
The remainder of this section is a walk-through of the UCS/Perl command-line tools. Most of their functionality (and some additional stuff) is also available through a programmer interface in the form of a set of Perl modules. If you want to write your own UCS/Perl programs, you will have to find your own way through the comprehensive documentation. The UCS/Perl command-line tools and several additional example scripts provide a good starting point for your own work. Note that you can easily configure your scripts (so that they have access to the UCS/Perl libraries) with the help of the ucs-config program.
This tutorial assumes that you have already configured the UCS system and installed the command-line utilities in your search path, as described in the main README file. In this case, you can skip the remainder of this section.
Otherwise, you will have to specify full paths to the tools in each of the
examples below. For this purpose, it is convenient do define a shell variable
$UCS
pointing to the System directory of the UCS installation. Execute one
of the following lines, depending on whether your shell is bash
or tcsh
(if you don't know, type echo $SHELL
, or simply try both commands).
export UCS=`ucs-config --base-dir` # in sh or bash
setenv UCS `ucs-config --base-dir` # in tcsh
Having set this shell variable, you can just type $UCS/bin/ucs-add
instead
of ucs-add
to invoke the ucs-add program in the examples below, and
similarly for all other command-line programs.
You should now change to a scratch directory (e.g. in your home directory or
in the /tmp
directory) where we can put the data files created by the
examples in the tutorial. These files can be deleted after you have stepped
through the examples.
UCS/Perl comes with fairly comprehensive documentation embedded into the
modules and programs in POD format. The ucsdoc
program provides a
convenient interface to this documentation. Simply type
ucsdoc <ProgramName>
or
ucsdoc <ModuleName>
to read the respective manual page. The starting point for all UCS/Perl documentation is the ucsintro document:
ucsdoc ucsintro
When you have installed Perl/Tk and the Tk::Pod module, you can also view the manpages in a GUI window:
ucsdoc -tk ucsintro
Of course, ucsdoc ucsdoc
will tell you more about the ucsdoc
program and
its options. If you prefer paper documentation, you can print the entire
UCS/Perl documentation, using one of the additional UCS/Perl scripts provided
in the contrib/
directory. Such ``contributed'' scripts can easily be
invoked with the ucs-tool program:
ucs-tool print-documentation --collate UCS-Perl-Doc
This command will create a PostScript file UCS-Perl-Doc.ps
in the current
directory, which you may delete after printing. In case of any problems
you should omit --collate
, so that the individual manpages will be saved to
separate files UCS-Perl-Doc-001.ps
, UCS-Perl-Doc-002.ps
, etc. (You
can also convert documentation into LaTeX format with the --latex
option.)
First of all, you need to understand the UCS data set file format. You should
read the ucsfile manpage carefully now (ucsdoc ucsfile
). The
UCS distribution includes the following example data sets for your first
experiments:
dickens.ds.gz
fr-pnv.ds.gz
glaw.ds.gz
You will find these data sets in the DataSet/Distrib/
directory. UCS data
set files have the form of statistical tables, with rows corresponding to pair
types and columns to variables. They are stored in a simple text format which
is compatible with the R environment. Data set files are usually compressed
with gzip
to save space and carry the filename extension .ds.gz
. Direct
viewing of data set files (e.g. with zmore
) is inconvenient. For this
purpose, UCS/Perl provides the ucs-info
and ucs-print
programs.
ucs-info
displays information from the header of a data set file. Try:
ucs-info fr-pnv.ds.gz
ucs-info glaw.ds.gz
Because these data sets are stored in the global data set directory (or, more
precisely, in one of its subdirectories), it is sufficient to enter the name
of the data set file without a full path. If no file with the specified name
is found in the current directory, the UCS/Perl programs will automatically
search the global data set directory for a matching filename. If the data set
header does not show its size (i.e. the number of rows in the table) or you do
not trust it, you can check the actual size of the data set with the -s
option.
ucs-info -v -s fr-pnv.ds.gz
(The -v
option keeps you entertained while the data set is being read.) You
can also display a list of all variables defined in the data set with the -l
option.
ucs-info -l fr-pnv.ds.gz
ucs-info -l glaw.ds.gz
Compare these listings with the documentation in ucsfile. Also
note how an explanatory comment is displayed with the user-defined variable
n.accept
in glaw.ds.gz
.
ucs-print
formats a data set file as an ASCII table suitable for viewing and
printing. It is most useful with the -i
option, which sends the formatted
table to a pager for interactive viewing (you should install the Term::ReadKey
module for optimal results).
ucs-print -i dickens.ds.gz
ucs-print -i glaw.ds.gz
You should now be able to page through the data set file by pressing SPACE
(one page forward) and BACKSPACE (one page backward). The ucs-print
utility
has several other options. Like all other UCS/Perl programs, it will display
a short usage reminder when called with the -h
option:
ucs-print -h
Enter ucsdoc ucs-print
to see the full manual page.
The ucs-summarize
program computes statistical summaries for numerical
variables, e.g. for the cooccurrence frequency f
:
ucs-summarize -v f FROM dickens.ds.gz
or simply leave out the variable name(s)
to compute summaries for all data set
variables.
ucs-summarize -v dickens.ds.gz
Again, check the manual page for additional options and detailed information.
Now that you are familiar with the data set file format, let us manipulate the
data sets. The ucs-sort
utility changes the order of the rows in a data set
by sorting on one or more variables.
ucs-sort -v dickens.ds.gz BY f- INTO sorted.ds.gz
This sorts the Dickens data set by cooccurrence frequency (decreasing) and
creates a new data set file sorted.ds.gz
in the current directory. The -
character after the variable name f
selects decreasing sort order. Without
an explicit +
or -
, the sort order is automatically chosen. When you
display the sorted data set, you will notice that there are many ties,
i.e. pair types with the same cooccurrence frequency.
ucs-print -i sorted.ds.gz
You can break such ties randomly with the -r
option
ucs-sort -v -r dickens.ds.gz BY f- INTO sorted.ds.gz ucs-print -i sorted.ds.gz
or alphabetically by specifying additional sort keys. In this example, we sort first on the noun, then the adjective:
ucs-sort -v dickens.ds.gz BY f- l2+ l1+ INTO sorted.ds.gz ucs-print -i sorted.ds.gz
When the INTO
clause is omitted, the resulting data set is printed on STDOUT
(in the data set file format). This feature often allows us to combine
UCS/Perl programs into command pipes without having to save intermediate
results into files. Here is a single-line version of the above commands:
ucs-sort dickens.ds.gz BY f- l2+ l1+ | ucs-print -i
If you just got a SGIPIPE warning, don't worry. That is just because you quit
the pager without going through the entire data set, so some of the data
printed by ucs-sort
was discarded.
The two most important tools are probably ucs-add
and ucs-select
. The
ucs-add
program allows you to annotate a data set with association scores,
rankings, and other variables. Let us add association scores for two
well-known association measures to the Dickens data set:
ucs-add -v am.t.score am.log.likelihood TO dickens.ds.gz INTO scores.ds.gz ucs-print -i scores.ds.gz
By the way: if you don't like the uppercase keywords TO
and INTO
, you are
also allowed to type them in lowercase (to
, into
) or mixed case (To
,
Into
). The default versions are meant to give a better visual subdivision
of the command line.
The most ``significant'' cooccurrences are those with the highest association scores. We will now re-sort the data set to put these at the top:
ucs-sort scores.ds.gz BY am.t.score | ucs-print -i ucs-sort scores.ds.gz BY am.log.likelihood | ucs-print -i
(The default sort order for association scores is descending, so we do not
have to put an explicit -
after the variable name.) Note how the two
association measures disagree about which cooccurrences are most significant.
The actual differences can be seen more clearly when we add ranks according to
each of the association scores to the data set:
ucs-add -v 'r.%' TO scores.ds.gz INTO ranks.ds.gz
In this example, we have used a UCS wildcard pattern ('r.%') to compute rankings for all available association scores without having to type each one explicitly. Have a look at the ucsexp manpage to learn more about such patterns. We can now sort directly compare the ranks assigned to each pair type:
ucs-sort ranks.ds.gz BY am.t.score | ucs-print -i 'r.%' '*' FROM -
Note the use of wildcard patterns to display only some of the variables and to
re-order the columns. The special filename -
can be used to read from
standard input (e.g. in a command pipe) when the FROM
clause is mandatory.
Read the ucs-add manpage to learn about the many other
possibilities it offers.
The ucs-select
command is used to select rows and/or columns from a data
set, or to count rows that satisfy a specified condition. If you are just
interested in the rankings, you can select the two relevant variables and save
them to a new data set file or display them directly with ucs-print
.
ucs-select 'r.%' FROM ranks.ds.gz | ucs-print -i
This actually has the same effect as
ucs-print -i 'r.%' FROM ranks.ds.gz
As the next step, let us count the number of pair types with cooccurrence frequency >= 10. This condition is specified in the form of a UCS expression on the command line.
ucs-select -v --count FROM ranks.ds.gz WHERE '%f% >= 10'
A UCS expression is simply a snippet of Perl code (which is compiled and
executed on the fly) with a special syntax to access data set variables. In
the example above, %f%
is set to the respective value of the f
variable
as the expression is applied to each row of the data set. UCS expressions are
one of the most important elements of UCS/Perl - study the ucsexp
manpage carefully now.
Another simple example counts the number of pair types which are among the 500 highest-scoring pairs according to both measures.
ucs-select -v --count FROM ranks.ds.gz WHERE 'max(%r.t.score%, %r.log.likelihood%) <= 500'
(Of course, this command has to be entered as a single line in the shell.)
The built-in utility function max()
is automatically available in UCS
expressions (cf. the UCS::Expression::Func manpage).
We can also save all rows that satisfy this condition to a new data set,
selecting all columns with the %
wildcard.
ucs-select -v '%' FROM ranks.ds.gz INTO highest.ds.gz WHERE 'max(%r.t.score%, %r.log.likelihood%) <= 500'
ucs-info -l highest.ds.gz ucs-print -i highest.ds.gz
We can now easily work with this new small data set, or re-sort and view it.
Such small subsets extracted from a data set are also suitable for printing.
Running ucs-print
with the --postscript
(or -ps
) option creates a
PostScript file that can be sent to an appropriate printer:
ucs-print -v -ps -l -p 50 -o highest.ps highest.ds.gz
You can now preview the result with gv highest.ps
. Check the
ucs-print manpage for an explanation of the options used in the
example above.
Thanks to the use of UCS expressions, ucs-select
has the full power of Perl,
with access to all built-in functions (perldoc perlfunc
) and the complete
standard library. It is easy e.g. to retrieve all collocates of nouns ending
in -ness.
ucs-select -v '*' 'r.%' FROM ranks.ds.gz WHERE '%l2% =~ /ness$/' | ucs-sort by l2 l1 | ucs-print -i
It is often useful to store manual annotations (e.g. variables marking true
collocations) in separate files. A data set without frequency information
(i.e. without the frequency signature f, f1, f2, and N) is called an
``annotation database'' and conventionally has the extension .adb.gz
. The UCS
distribution includes an annotation database for German PP+verb pairs, which
was kindly provided by Brigitte Krenn (ÖFAI, Vienna).
ucs-info -l pnv.adb.gz ucs-print -i pnv.adb.gz
We can easily find out the number of pair types that were identified as
collocations with the ucs-select
command:
ucs-select -v --count FROM pnv.adb.gz WHERE '%b.figur%' ucs-select -v --count FROM pnv.adb.gz WHERE '%b.fvg%'
In order to use these annotations with cooccurrence data extracted from a
corpus, the annotation attributes have to be transferred to a data set file.
This is achieved with the ucs-join
program. Simply calling ucs-join with
the two files as arguments will check the coverage of the annotation database:
ucs-join -v fr-pnv.ds.gz pnv.adb.gz
We can now copy the b.figur
and b.fvg
attributes to the data set
fr-pnv.ds.gz
, and save the result into a new data set file.
ucs-join -v fr-pnv.ds.gz WITH b.figur b.fvg FROM pnv.adb.gz INTO fr-annotated.ds.gz
ucs-info -l fr-annotated.ds.gz
If any of the pair types are not covered by the annotation database, they will be annotated with missing values (NA).
Once we have added association scores and rankings to the data set, we can
easily compute the precision and recall of N-best lists (i.e. the N
highest-ranked pairs according to some association measure etc.). Note how
the -m
option of ucs-add
allows us to write back the modified data set to
the same file:
ucs-add -v -m am.log.likelihood r.log.likelihood TO fr-annotated.ds.gz INTO fr-annotated.ds.gz
In the PP+verb annotation database, figurative expressions and support-verb
constructions are marked separately. However, we want to accept both as true
collocations, so the condition for true positives is %b.figur% or %b.fvg%
.
It would be convenient to have a single variable marking true positives. We
can create such a variable, which we will call b.TP
, by evaluating a
user-defined UCS expression with the ucs-add
program.
ucs-add -v -m 'b.TP := %b.figur% or %b.fvg%' TO fr-annotated.ds.gz INTO fr-annotated.ds.gz
Now it is easy to evaluate the N-best lists against all true positives:
ucs-select -v --count FROM fr-annotated.ds.gz WHERE '%b.TP% and %r.log.likelihood% <= 500'
You can also create your own data sets for relational cooccurrences, with the
help of the ucs-make-tables
program. For relational cooccurrences, each
pair token (= instance) represents a structural relation between words (or
other morpho-syntactic units). Examples are adjectives modifying nouns (as in
the Dickens and GLAW data sets) or PPs that are P-objects or adjuncts of a
verb (as in the FR-PNV data set). Positional cooccurrences (words occurring
in the same sentences or within a certain distance from each other) are more
difficult to count properly and you will have to construct such data sets on
your own.
ucs-make-tables
takes its input - which is a stream of pair tokens - from an
extraction tool that the user has to provide. Each line of this stream
represents a pair token and has the format
<l1> TAB <l2>
where <l1> is the type (= lexeme) of the first component of the pair token,
and <l2> is the type of its second component. The extraction tool should
print the token stream on standard output so that it can be connected to
ucs-make-tables
through a pipe:
<YourExtractionTool> | ucs-make-tables -v <dataset.ds.gz>
Type ucsdoc ucs-make-tables
to learn about the available command-line
options.
The UCS/Perl distribution includes example scripts in the
System/Perl/contrib/ directory tree that extract cooccurrence data from a
corpus encoded in the IMS Corpus Workbench (CWB). [
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
]
When you have installed the CWB, the CWB/Perl interface modules, and the demonstration corpus provided with the CWB (DICKENS), you can re-create the Dickens data set with the following commands:
ucs-tool adj-n-from-cwb penn DICKENS | ucs-make-tables -v -f 3 my-dickens.ds.gz
ucs-info -l my-dickens.ds.gz
You can also import data sets from the Ngram Statistics Package (NSP)
[ http://ngram.sourceforge.net/
]. For instance, if you have a file named
bigrams.cnt
that was created with NSP's count.pl
tool, the following
command converts it into a UCS data set:
ucs-tool nsp2ucs -v bigrams.cnt bigrams.ds.gz ucs-info -l bigrams.ds.gz
Note that there are usually no manual pages for such ``contributed'' scripts.
Run them with the option -h
for a short description of their purpose and
usage information:
ucs-tool adj-n-from-cwb -h ucs-tool segment-from-cwb -h ucs-tool nsp2ucs -h ucs-tool make-dummy-ds -h ucs-tool count-collocates -h ucs-tool dispersion-test -h
(When a manual page is available, it can be displayed with the --doc
option, e.g. ucs-tool --doc nsp2ucs
). You can list all contributed
scripts with
ucs-tool --list
or all scripts that import data sets from external programs with
ucs-tool --list --category=Import