ucs-merge - Combine cooccurrence counts from multiple (sub)corpora


  ucs-merge [-v] [-f <n>] [-N <size>] part1.ds.gz part2.ds.gz ... [ INTO combined.ds.gz ]


Cooccurrence data sets from large corpora often comprise more than 100 million pair types and cannot be compiled in memory. In this case, it is necessary to compile separate data sets for smaller (sub)corpora and combine them with the ucs-merge program. It is important that these data sets have been sorted with the --sort (or -s) option of ucs-make-tables, and ucs-merge will exit with an error message if it detects incorrect ordering.

The input data sets part1.ds.gz, part2.ds.gz, etc. must contain pair types with complete frequency signatures (variables l1 l2 f f1 f2 N). Cooccurrence and marginal frequencies will be added up for all parts that contain a given pair type, or its respective components. All other variables in the input data sets are silently discarded.

If the input data sets provide dispersion information (in the n.disp variable), the dispersion counts will also be added up and included in the output. Note that it is an error to mix inputs with and without the n.disp variable.

The combined data set is written to the file specified in the INTO clause, or to standard output.

If the --threshold (or -f) option is specified, only pair types whose total cooccurrence frequency is equal to or greater than the requested threshold will be included in the combined data set. Keep in mind that no frequency thresholds must be imposed on the input data sets, since a pair type that occurs only once in each part may still have a relatively high total cooccurrence frequency (depending on the number of parts).

Sample size is computed by adding up N variables from all parts. If necessary, this value can be overridden with the --size (or -N) option. In order to maintain consistent frequency signatures, the new value should never be smaller than the automatically determined sample size, and ucs-merge will print a warning message if it is.

With the --verbose (or -v) option, copious progress information is displayed on standard error while the program is running.


Copyright 2004-2010 Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.