NAME

ucsfile - The UCS data set file format

INTRODUCTION

UCS data sets are stored in a simple tabular format, similar to that of a statistical table. Each row in the table corresponds to a pair type, and its individual fields (columns) provide various kinds of information about the pair type:

a unique ID number (unique within the data set)
the component lexemes
the pair type's frequency signature
[optional] contingency tables of observed and expected frequencies computed from the frequency signature
[optional] coordinates computed from the frequency signature
association scores and rankings for various association measures
arbitrary user-defined attributes, especially for the manual annotation of true positives in an evaluation study

Following statistical terminology, the table columns are referred to as the variables of a data set (each of which assumes a specific value for each pair type). Columns are separated by a TAB character ("\t"), and the first row lists the variable names as table headings (see the section "VARIABLES" below for naming conventions).

The actual data table may be preceded by an optional header of Perl-style comment lines (beginning with a # character). Lines with the special format

  ##:: <variable> = <value>

define global variables, which may be interpreted by some of the UCS/Perl programs (see the section "GLOBAL VARIABLES" below). The variable name (variable) may only contain alphanumeric characters (A-Z a-z 0-9) and the period (.). The value may contain arbitrary characters, including whitespace (but leading and trailing whitespace will be ignored). Variable definitions must not span multiple lines.

UCS data set files must have the filename extension .ds. They may be compressed with gzip (and they usually are), in which case they carry the extension .ds.gz. UCS library functions will automatically recognise and uncompress data set files with this extension.

A special subtype of data sets are the annotation database files with extension .adb (uncompressed) or .adb.gz (compressed). Annotation databases omit all frequency information and association scores, listing only component lexemes and user-defined attributes. They are used as repositories of lexical information (such as manually annotated true positives for evaluation purposes) that applies to data sets extracted from different corpora (or with different methods).

GLOBAL VARIABLES

  size        number of pair types in a data set

The only global variable that is currently supported is size, an integer specifying the number of pair types in a data set. Availability of the data set size in the header may give a slight performance improvement when loading data set files into memory. If size is set to an incorrect value, the behaviour of UCS/Perl programs and modules is undefined.

A global variable whose name is identical to that of a variable defined in the data set (i.e. a table column) is interpreted as an explanatory note. Such notes should typically be given for all user-defined variables, and also for user-defined association measures.

Unsupported variables will simply be ignored and will not raise errors or warnings when a data set file is parsed.

DATA TYPES

The UCS system supports four different data types:

  BOOL      a logical (Boolean) value
  INT       a signed integer value (>= 32 bits)
  DOUBLE    a floating-point value (IEEE double precision)
  STRING    an arbitrary string (ISO-8859-1 or UTF-8)

Boolean values are represented by 1 (true) and 0 (false). String values may contain blanks (but no TAB characters) and are neither quoted nor escaped. Full support for Unicode strings (UTF-8) is only available within the UCS/Perl subsystem.

The UCS/R subsystem will interpret Boolean values as logical variables, and strings (except for the component lexemes) as factor variables with a fixed set of levels (which are automatically determined from the data).

User-defined attributes may assume the special value NA for missing values. (Note that the string NA will always be interpreted as a missing value rather than a literal character string!) UCS/R has built-in support for missing values, whereas UCS/Perl represents them by undef entries. Programs that do not support missing values may replace them by 0 (BOOL and INT), 0.0 (DOUBLE), or the empty string "" (STRING).

The data type of a variable is uniquely determined by the variable name, as detailed in the section "VARIABLES" below.

VARIABLES

In order to be compatible with the R language, variable names may only contain alphanumeric characters (A-Z a-z 0-9) and periods (.), and they must begin with a letter. The main function of periods is to delimit words in complex variable names, replacing blanks, hyphens, and underscores. UCS variable names are case-sensitive.

Periods are not allowed in Perl variable names, but UCS expressions provide a special syntax for direct access to data set variables (see the ucsexp and UCS::Expression manpages). In the rare case where plain Perl variables are used to store information from a data set, periods should be replaced by underscores (_) in the variable names.

There are strict naming conventions for data set variables, which are detailed in the following subsections. Apart from a fixed list of core variables (whose names do not contain the . character), all variable names begin with a period-separated prefix that determines the data type of the variable.

Core Variables

Core variables represent the minimal amount of information that must be present in a data set file (i.e. evidence for cooccurrences extracted from a corpus). All core variables are mandatory, except in the case of annotation database files (.adb), which omit frequency signatures (f f1 f2 N). For relational cooccurrences, frequency signatures can be computed with the ucs-make-tables utility from a stream of pair tokens (cf. the ucs-make-tables manpage).

  INT    id    a numerical ID value (unique within the data set)
  STRING l1    first component type of the pair
  STRING l2    second component type of the pair

  INT    f     cooccurrence frequency of pair type
  INT    f1    marginal frequency of first component
  INT    f2    marginal frequency of second component
  INT    N     sample size (identical for all pair types)

id is a numerical ID value, which must be unique within a data set. Its intended uses are to identify pair types in subsets selected from a given data set, and to validate line numbers when attributes or association scores are computed by an external program and re-integrated into the data set file.

The lexemes l1 and l2 are the component (word) types that uniquely identify a pair type. Consequently, a data set file must not contain multiple rows with identical l1 and l2 values. UCS/Perl should provide reasonably good support for Unicode strings as lexemes (in UTF-8 encoding), at least when running on Perl version 5.8.0 or newer.

The quadruple f f1 f2 N is called the frequency signature of a pair type. It contains all the frequency information used by association measures and is equivalent to a contingency table. Note that the sample size N is identical for all pair types in a data set and is included here mainly for convenience' sake (so that association scores can be computed from the row data without reference to a global variable). See (Evert 2004) for more information on lexemes and frequency signatures.

Derived Variables

Derived variables can be computed from the frequency signatures of pair types, providing different "views" of the frequency information. Normally, they are not annotated explicitly but are accessible through UCS expressions, which compute the required values automatically (see the ucsexp and UCS::Expression manpages).

  INT    O11   contingency table of observed frequencies
  INT    O12     (computed from frequency signature)
  INT    O21
  INT    O22

  INT    R1    row sums in observed contingency table
  INT    R2
  INT    C1    column sums in observed contingency table
  INT    C2

The variables O11 O12 O21 O22 represent the observed contingency table of a pair type. Note that their frequency information is equivalent to the frequency signature of the pair type. In addition, the row sums (R1 R2) and column sums (C1 C2) of the contingency table are also made available.

  DOUBLE E11   contingency table of expected frequencies
  DOUBLE E12     under point null hypothesis
  DOUBLE E21     (computed from row and column sums)
  DOUBLE E22

The variables E11 E12 E21 E22 represent the contingency table of expected frequencies, i.e. the expectations of the multinomial sampling distribution under the point null hypothesis of independence. Most association measures compare observed frequencies to expected frequencies in some way.

In a geometric interpretation of a data set, each pair type can be interpreted as a point x in a three-dimensional coordinate space P. Since the sample size N is a constant parameter within the data set, the coordinates of x are given by the joint and marginal frequencies f f1 f2.

  DOUBLE lf    logarithmic coordinates 
  DOUBLE lf1     (base 10 logarithm)
  DOUBLE lf2

Since the coordinates usually have a skewed distribution across several orders of magnitude, it is often more convenient to visualise them on a logarithmic scale. The variables lf lf1 lf2 give the base ten logarithms of the coordinate triple f f1 f2.

  DOUBLE e     ebo-coordinates
  DOUBLE b       (expected, balance, observed)
  DOUBLE o

  DOUBLE le    logarithmic ebo-coordinates
  DOUBLE lb      (base 10 logarithm)
  DOUBLE lo

Theoretical and empirical studies of the properties of association measures will often be based on transformed coordinate systems in the coordinate space. The most useful system are the ebo-coordinates e b o (for expected, balance, observed). All three coordinates range from 0 to infinity (constrained by the sample size parameter N). The base 10 logarithms le lb lo of the ebo-coordinates are convenient for visualisation purposes. le and lb range from -infinity to +infinity, while lo ranges from 0 to infinity (all constrained by N).

For backward compatibility, a transformation of the coordinate system to relative frequencies, which were used in earlier versions of this software, is also supported. The relative cooccurence (p) and marginal (p1 p2) frequencies are computed from the frequency signature according to the equations p = f/N, p1 = f1/N, and p2 = f2/N. Note that the logarithmic versions lp lp1 lp2 are negative base 10 logarithms, ranging from 0 to infinity.

Association Scores and Rankings

These variables store association scores and rankings for an arbitrary number of association measures. Each association measure is identified by a key, which is appended to the respective variable name prefix (resulting in the names am.key and r.key). See the UCS::AM manpage (and the manpages of the add-on packages listed there) for a wide range of built-in association measures.

  DOUBLE am.*  association scores from measure identified by *
  INT    r.*   ranking for this measure (ties are allowed)

Rankings are often computed on the fly, but they may also be annotated in data set files. Note that the r.* variables should not break ties but report identical ranks (and skip an appropriate number of subsequent ranks). The ucs-sort program (cf. the ucs-sort manpage) can be used to resolve ties in various ways (using other association scores, lexical sort order, or randomisation).

User-Defined Variables

User-defined variables may contain arbitrary information, which is typically used for filtering data sets and to determine true positives in evaluation tasks. However, some special-purpose association measures may also base their association scores on their values. In order to allow a minimal amount of automatic processing (such as sorting by user-defined attributes), the variable name prefix of a user-defined variable is used to determine its data type, according to the following list.

  BOOL   b.*   user-defined Boolean variable
  INT    n.*   user-defined integer variable (n=number)
  DOUBLE x.*   user-defined floating-point variable
  STRING f.*   user-defined string variable (f=factor)

User-defined variables with the additional prefix ucs (corresponding to variable names b.ucs.*, n.ucs.*, x.ucs.*, and f.ucs.*) are reserved for internal use by UCS modules and programs.

REFERENCES

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, University of Stuttgart, Germany.

COPYRIGHT

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.