ucs-add - Add variables (association scores) to UCS data set


  ucs-add [-v] [-m] am.t.score am.Fisher.pv TO data.ds.gz INTO new.ds.gz 

  ucs-add [-v] [-m] -x HTest am.%.pv TO data.ds.gz INTO new.ds.gz

  ucs-add [-r] r.% TO data.ds.gz INTO new.ds.gz

  ucs-add [-v] [-m] [-f] '<var> := <expression>' TO data.ds.gz INTO new.ds.gz


This program is used to add variables (association scores, rankings, derived variables, or arbitrary UCS expressions entered on the command line) to a UCS data set. If a variable is already defined in the data set, its values will be overwritten.

The general form of the ucs-add command is

  ucs-add [--verbose | -v] [--memory | -m] [--extra=<list> | -x <list>] [-f]
          <variables> [ TO <input.ds> ] [ INTO <output.ds> ]

where <variables> is a whitespace-separated list of variable specifications (see the section on Variable Specifications below for details). An additional --randomize option is only useful when adding rankings:

  ucs-add [--verbose | -v] [--extra=<list> | -x <list>] [--randomize | -r] 
          <variables> [ TO <input.ds> ] [ INTO <output.ds> ]

The data are read from the file <input.ds>, and the resulting data set with the new annotations is written to the file <output.ds>. When they are not specified, the input and output files default to STDIN and STDOUT, respectively.

Variable specifications and file names may need to be quoted individually (when they contain shell metacharacters or whitespace).

Normally, the ucs-add program processes the data set one row at a time, so that <input.ds> and <output.ds> must not refer to the same file. When --memory (or -m) is specified, the entire data set is read into memory, annotated, and then written back to the output file. In this case, <input.ds> and <output.ds> may be identical. This mode is automatically activated when any rankings are added to the data set.

In both modes of operation, variables are added in the order in which they are given on the command-line, so variable specifications (rankings and user-defined expressions) may refer to any of the previously introduced variables.

With the --verbose (or -v) option, some debugging and progress information is displayed while the program is running. The --extra (or -x) option loads additional built-in association measures (see the section on adding Associations Scores below for details).


Association Scores

Variables representing association scores are selected by specifying their variable names (which start with the prefix am.). The names may be given as UCS wildcard patterns (see the ucsexp manpage), which will be matched against the list of all supported association measures. Examples of useful wildcard patterns are am.% (all measures), am.%.pv (all measures that compute probability values), and am.chi.squared.% (all variants of Pearson's chi-squared test).

By default, only the basic association measures defined in UCS::AM are supported. Other AM packages (see the UCS::AM manpage for a list of add-on packages) can be loaded with the --extra (or -x) option. The argument is a comma-separated list of package names (e.g. --extra=HTest,Parametric to load UCS::AM::HTest and UCS::AM::Parametric), which are case-insensitive and may be abbreviated to unique prefixes (so -x htest,par works just as well). Use -x ALL to load all available AM packages.


Variables representing association score rankings are selected by specifying their variable names (which start with the prefix r.). In order to compute a ranking, say r.something, the corresponding association scores (am.something) must be annotated in the data set. UCS wildcard patterns are matched against all association scores in the data set (but not against other built-in association measures). Rankings can also be computed for user-defined measures, provided that their association scores are annotated. In order to compute a ranking for a built-in association measure that is not available in the data set, both the association score and the ranking variable must be specified. The example

  ucs-add -m am.% r.% TO data.ds.gz INTO data.ds.gz

adds associations scores and rankings for the basic built-in association measures to the data set data.ds.gz.

Ties are not resolved in the rankings, so pair types with identical association scores share the same rank. The rank assigned to such a group of pair types is the lowest free rank (as in the Olympic Games) rather than the average of all ranks in the group (as is often done in statistics). With the --random (or -r) option, ties are resolved in a random fashion. When association scores for the random measure are pre-annotated (i.e. the am.random variable is present in the data set), these are used for the randomization so that the ranking is reproducible.

Derived Variables

Any variable names or wildcard patterns that do not match one of the built-in association measures are matched against the list of derived variables, which can be computed automatically from the frequency signatures of pair types. See the ucsfile manpage for a complete list of derived variables. Examples of useful patterns are E* (expected frequencies), lp* (logarithmic coordinates), and e b m ((e,b,m)-coordinates).

User-Defined Expressions

A user-defined variable specification is a UCS expression (see the ucsexp manpage) of the form

  <var> := <expression>

where <var> is the name of a user-defined variable, association score, or ranking (without surrounding % characters). This variable is added to the input data set if necessary and set to the values computed by the UCS expression <expression>. The example below computes association scores for a compound measure mixed from the rankings according to two other measures (which must both be annotated in the data set).

  am.mixed := -max(%r.t.score%, %r.dice%)

Note that it isn't possible to compute the corresponding ranking r.mixed directly. If you want to modify one of the standard or derived variables (l1 l2 f f1 f2 N O11 E11 etc.), you have to specify the --force (or -f) option, since this may create inconsistencies in the data set.


Copyright 2004 Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.