ucs-join - Join rows and variables from two UCS data sets


  ucs-join [--match-on var1,var2,...] data1.ds.gz data2.ds.gz

  ucs-join [--add] [--update] [--multiple] [-m var1,var2,...]
           data1.ds.gz data2.ds.gz INTO new.ds.gz

  ucs-join [--add] [--update] [--multiple] [-m var1,var2,...]
           data1.ds.gz WITH am.% FROM data2.ds.gz INTO new.ds.gz


This program can be invoked in three different ways. The short form

  ucs-join  [-v] [-m <var>,...]  <ds1>  <ds2>

compares two data sets <ds1> and <ds2>. In particular, the number of rows common to both data sets and the numbers of rows unique to either one of the data sets are reported. Rows are matched on the pair types they represent, i.e. the variables l1 and l2. Differences in the id value or any other annotations are ignored. The coverage is the proportion of pair types in <ds1> that are also contained in <ds2>.

With the --verbose (or -v) switch, some progress information is displayed while the program is running. The --match-on (or -m) flag specifies a comma-separated list of variables to use for matching rows (instead of l1 and l2). Note that the combination of their values must be unique for every row within each data set.

The second form

  ucs-join  [-v] [--add] [--update] [--multiple] [-m <var>,...]
            <ds1> <ds2> INTO <ds3>

adds variables and/or rows from the data set <ds2> to <ds1>. Rows from the two data sets are matched on the l1 and l2 variables as above, unless this has been changed with the --match-on (or -m) flag. The combination of their values must uniquely identify rows in <ds2>, while duplicate rows in <ds1> are allowed in combination with the --multiple (or -M) option.

For matching rows, all variables from <ds2> are added to the annotations in <ds1>. Variables that are common to both data sets are overwritten with the values from <ds2> only when they are undefined (NA) in <ds1>, or when the --update (or -u) option has been given. For backward compatibility, the default setting can be explicitly selected with --no-overwrite (or -n). If --add or -a is specified, rows that appear only in <ds2> are added to <ds1> (with all variables that are not defined in <ds2> set to NA). The resulting data set is written to the file <ds3>.

The most general form

  ucs-join  [-v] [--add] [--update] [--multiple] [-m <var>,...]
            <ds1> WITH <variables> FROM <ds2> INTO <ds3>

adds selected variables from <ds2> only. <variables> is a whitespace-separated list of variables names and wildcard patterns, which are matched against the variables of <ds2>. Variables can be renamed with specifiers of the form new.name=old.name (of course, wildcard patterns cannot be used here). The --add switch is rarely useful with this form of the ucs-join command.


The ucs-join program is often used to add (manual) annotations from an annotation database file (.adb) to a data set, and to update annotation databases. For instance, the UCS distribution includes German PP+verb pairs extracted from the Frankfurter Rundschau corpus (fr-pnv.ds.gz) and an annotation database created by Brigitte Krenn (pnv.adb.gz). In order to check the coverage of the annotation database (i.e., how many of the pair types are already contained in the database), type

  ucs-join -v fr-pnv.ds.gz pnv.adb.gz

This will show a coverage of 100%. Annotations from the database can now be added to the fr-pnv.ds.gz data set (the --update option is only relevant if fr-pnv.ds.gz is already annotated with the relevant variables):

  ucs-join -v --update fr-pnv.ds.gz 
           WITH 'b.*' FROM pnv.adb.gz INTO fr-pnv.annot.ds.gz

When an annotation database contains entries that have not been manually examined so far, these should be annotated with missing values (NA). The database can then be updated from a new file (in the same .adb format, say new-pnv.adb) with the following commands

  mv pnv.adb.gz pnv.adb.BAK.gz
  ucs-join -v pnv.adb.BAK.gz new-pnv.adb INTO pnv.adb.gz

If the file new-pnv.adb contains additional pair types (that haven't already been entered into the database), you should also specify the --add flag.

Recall that ucs-join will not overwrite existing annotations by default. If you want to correct mistakes in the annotation database, you need to specify the --update option in the command above. Note that missing values (NA) will never overwrite existing annotations in the first data set.


Copyright 2004-2005 Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.