ucs-join - Join rows and variables from two UCS data sets
ucs-join [--match-on var1,var2,...] data1.ds.gz data2.ds.gz ucs-join [--add] [--update] [--multiple] [-m var1,var2,...] data1.ds.gz data2.ds.gz INTO new.ds.gz ucs-join [--add] [--update] [--multiple] [-m var1,var2,...] data1.ds.gz WITH am.% FROM data2.ds.gz INTO new.ds.gz
This program can be invoked in three different ways. The short form
ucs-join [-v] [-m <var>,...] <ds1> <ds2>
compares two data sets <ds1>
and <ds2>
. In particular, the number of rows common to both data sets and the numbers of rows unique to either one of the data sets are reported. Rows are matched on the pair types they represent, i.e. the variables l1
and l2
. Differences in the id
value or any other annotations are ignored. The coverage is the proportion of pair types in <ds1>
that are also contained in <ds2>
.
With the --verbose
(or -v
) switch, some progress information is displayed while the program is running. The --match-on
(or -m
) flag specifies a comma-separated list of variables to use for matching rows (instead of l1
and l2
). Note that the combination of their values must be unique for every row within each data set.
The second form
ucs-join [-v] [--add] [--update] [--multiple] [-m <var>,...] <ds1> <ds2> INTO <ds3>
adds variables and/or rows from the data set <ds2>
to <ds1>
. Rows from the two data sets are matched on the l1
and l2
variables as above, unless this has been changed with the --match-on
(or -m
) flag. The combination of their values must uniquely identify rows in <ds2>
, while duplicate rows in <ds1>
are allowed in combination with the --multiple
(or -M
) option.
For matching rows, all variables from <ds2>
are added to the annotations in <ds1>
. Variables that are common to both data sets are overwritten with the values from <ds2>
only when they are undefined (NA
) in <ds1>
, or when the --update
(or -u
) option has been given. For backward compatibility, the default setting can be explicitly selected with --no-overwrite
(or -n
). If --add
or -a
is specified, rows that appear only in <ds2>
are added to <ds1>
(with all variables that are not defined in <ds2>
set to NA
). The resulting data set is written to the file <ds3>
.
The most general form
ucs-join [-v] [--add] [--update] [--multiple] [-m <var>,...] <ds1> WITH <variables> FROM <ds2> INTO <ds3>
adds selected variables from <ds2>
only. <variables>
is a whitespace-separated list of variables names and wildcard patterns, which are matched against the variables of <ds2>
. Variables can be renamed with specifiers of the form new.name=old.name
(of course, wildcard patterns cannot be used here). The --add
switch is rarely useful with this form of the ucs-join command.
The ucs-join program is often used to add (manual) annotations from an annotation database file (.adb
) to a data set, and to update annotation databases. For instance, the UCS distribution includes German PP+verb pairs extracted from the Frankfurter Rundschau corpus (fr-pnv.ds.gz) and an annotation database created by Brigitte Krenn (pnv.adb.gz). In order to check the coverage of the annotation database (i.e., how many of the pair types are already contained in the database), type
ucs-join -v fr-pnv.ds.gz pnv.adb.gz
This will show a coverage of 100%. Annotations from the database can now be added to the fr-pnv.ds.gz data set (the --update
option is only relevant if fr-pnv.ds.gz is already annotated with the relevant variables):
ucs-join -v --update fr-pnv.ds.gz WITH 'b.*' FROM pnv.adb.gz INTO fr-pnv.annot.ds.gz
When an annotation database contains entries that have not been manually examined so far, these should be annotated with missing values (NA
). The database can then be updated from a new file (in the same .adb
format, say new-pnv.adb) with the following commands
mv pnv.adb.gz pnv.adb.BAK.gz ucs-join -v pnv.adb.BAK.gz new-pnv.adb INTO pnv.adb.gz
If the file new-pnv.adb contains additional pair types (that haven't already been entered into the database), you should also specify the --add
flag.
Recall that ucs-join will not overwrite existing annotations by default. If you want to correct mistakes in the annotation database, you need to specify the --update
option in the command above. Note that missing values (NA
) will never overwrite existing annotations in the first data set.
Copyright 2004-2005 Stefan Evert.
This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.