ucsfile - The UCS data set file format

**UCS data sets** are stored in a simple tabular format,
similar to that of a statistical table.
Each row in the table corresponds to a **pair type**,
and its individual fields (columns) provide various kinds of information about the pair type:

- a unique
**ID number**(unique within the data set) - the component
**lexemes** - the pair type's
**frequency signature** - [optional] contingency tables of
**observed**and**expected frequencies**computed from the frequency signature - [optional]
**coordinates**computed from the frequency signature **association scores**and**rankings**for various association measures- arbitrary
**user-defined attributes**, especially for the manual annotation of*true positives*in an evaluation study

Following statistical terminology,
the table columns are referred to as the **variables** of a data set (each of which assumes a specific value for each pair type).
Columns are separated by a TAB character (`"\t"`

),
and the first row lists the **variable names** as table headings (see the section "VARIABLES" below for naming conventions).

The actual data table may be preceded by an optional **header** of Perl-style comment lines (beginning with a `#`

character).
Lines with the special format

##:: <variable> = <value>

define **global variables**, which may be interpreted by some of the UCS/Perl programs (see the section "GLOBAL VARIABLES" below). The variable name (*variable*) may only contain alphanumeric characters (`A-Z a-z 0-9`

) and the period (`.`

). The *value* may contain arbitrary characters, including whitespace (but leading and trailing whitespace will be ignored). Variable definitions must not span multiple lines.

UCS data set files must have the filename extension **.ds**. They may be compressed with **gzip** (and they usually are), in which case they carry the extension **.ds.gz**. UCS library functions will automatically recognise and uncompress data set files with this extension.

A special subtype of data sets are the **annotation database** files with extension **.adb** (uncompressed) or **.adb.gz** (compressed). Annotation databases omit all frequency information and association scores, listing only component lexemes and user-defined attributes. They are used as repositories of lexical information (such as manually annotated *true positives* for evaluation purposes) that applies to data sets extracted from different corpora (or with different methods).

size number of pair types in a data set

The only global variable that is currently supported is **size**, an integer specifying the number of pair types in a data set. Availability of the data set size in the header may give a slight performance improvement when loading data set files into memory. If **size** is set to an incorrect value, the behaviour of UCS/Perl programs and modules is undefined.

A global variable whose name is identical to that of a variable defined in the data set (i.e. a table column) is interpreted as an **explanatory note**. Such notes should typically be given for all user-defined variables, and also for user-defined association measures.

Unsupported variables will simply be ignored and will not raise errors or warnings when a data set file is parsed.

The UCS system supports four different data types:

BOOL a logical (Boolean) value INT a signed integer value (>= 32 bits) DOUBLE a floating-point value (IEEE double precision) STRING an arbitrary string (ISO-8859-1 or UTF-8)

**Boolean** values are represented by 1 (true) and 0 (false). **String** values may contain blanks (but no TAB characters) and are neither quoted nor escaped. Full support for Unicode strings (UTF-8) is only available within the UCS/Perl subsystem.

The UCS/R subsystem will interpret Boolean values as logical variables, and strings (except for the component lexemes) as *factor* variables with a fixed set of levels (which are automatically determined from the data).

User-defined attributes may assume the special value `NA`

for **missing values**. (Note that the string `NA`

will always be interpreted as a missing value rather than a literal character string!) UCS/R has built-in support for missing values, whereas UCS/Perl represents them by **undef** entries. Programs that do not support missing values may replace them by 0 (BOOL and INT), 0.0 (DOUBLE), or the empty string "" (STRING).

The **data type** of a variable is uniquely determined by the variable name, as detailed in the section "VARIABLES" below.

In order to be compatible with the **R** language, variable names may only contain alphanumeric characters (`A-Z a-z 0-9`

) and periods (`.`

), and they must begin with a letter. The main function of periods is to delimit words in complex variable names, replacing blanks, hyphens, and underscores. UCS variable names are case-sensitive.

Periods are not allowed in **Perl** variable names, but **UCS expressions** provide a special syntax for direct access to data set variables (see the ucsexp and UCS::Expression manpages). In the rare case where plain Perl variables are used to store information from a data set, periods should be replaced by underscores (`_`

) in the variable names.

There are strict **naming conventions** for data set variables, which are detailed in the following subsections. Apart from a fixed list of core variables (whose names do not contain the `.`

character), all variable names begin with a period-separated **prefix** that determines the data type of the variable.

Core variables represent the minimal amount of information that must be present in a data set file (i.e. evidence for cooccurrences extracted from a corpus). All core variables are mandatory, except in the case of annotation database files (.adb), which omit frequency signatures (`f f1 f2 N`

). For relational cooccurrences, frequency signatures can be computed with the **ucs-make-tables** utility from a stream of pair tokens (cf. the ucs-make-tables manpage).

INT id a numerical ID value (unique within the data set) STRING l1 first component type of the pair STRING l2 second component type of the pair INT f cooccurrence frequency of pair type INT f1 marginal frequency of first component INT f2 marginal frequency of second component INT N sample size (identical for all pair types)

`id`

is a numerical ID value, which must be unique within a data set. Its intended uses are to identify pair types in subsets selected from a given data set, and to validate line numbers when attributes or association scores are computed by an external program and re-integrated into the data set file.

The **lexemes** `l1`

and `l2`

are the component (word) types that uniquely identify a pair type. Consequently, a data set file must not contain multiple rows with identical `l1`

and `l2`

values. UCS/Perl should provide reasonably good support for Unicode strings as lexemes (in UTF-8 encoding), at least when running on Perl version 5.8.0 or newer.

The quadruple `f f1 f2 N`

is called the **frequency signature** of a pair type. It contains all the frequency information used by **association measures** and is equivalent to a contingency table. Note that the **sample size** `N`

is identical for all pair types in a data set and is included here mainly for convenience' sake (so that association scores can be computed from the row data without reference to a global variable). See (Evert 2004) for more information on lexemes and frequency signatures.

Derived variables can be computed from the frequency signatures of pair types, providing different "views" of the frequency information. Normally, they are not annotated explicitly but are accessible through **UCS expressions**, which compute the required values automatically (see the ucsexp and UCS::Expression manpages).

INT O11 contingency table of observed frequencies INT O12 (computed from frequency signature) INT O21 INT O22 INT R1 row sums in observed contingency table INT R2 INT C1 column sums in observed contingency table INT C2

The variables `O11 O12 O21 O22`

represent the observed **contingency table** of a pair type. Note that their frequency information is equivalent to the frequency signature of the pair type. In addition, the **row sums** (`R1 R2`

) and **column sums** (`C1 C2`

) of the contingency table are also made available.

DOUBLE E11 contingency table of expected frequencies DOUBLE E12 under point null hypothesis DOUBLE E21 (computed from row and column sums) DOUBLE E22

The variables `E11 E12 E21 E22`

represent the contingency table of **expected frequencies**, i.e. the expectations of the multinomial sampling distribution under the point null hypothesis of independence. Most association measures compare observed frequencies to expected frequencies in some way.

In a **geometric interpretation** of a data set, each pair type can be interpreted as a point *x* in a three-dimensional **coordinate space** ** P**. Since the sample size

`N`

is a constant parameter within the data set, the coordinates of `f f1 f2`

.DOUBLE lf logarithmic coordinates DOUBLE lf1 (base 10 logarithm) DOUBLE lf2

Since the coordinates usually have a skewed distribution across several orders of magnitude, it is often more convenient to visualise them on a logarithmic scale. The variables `lf lf1 lf2`

give the **base ten logarithms** of the coordinate triple `f f1 f2`

.

DOUBLE e ebo-coordinates DOUBLE b (expected, balance, observed) DOUBLE o DOUBLE le logarithmic ebo-coordinates DOUBLE lb (base 10 logarithm) DOUBLE lo

Theoretical and empirical studies of the properties of association measures will often be based on transformed coordinate systems in the coordinate space. The most useful system are the **ebo-coordinates** `e b o`

(for *expected*, *balance*, *observed*). All three coordinates range from 0 to infinity (constrained by the sample size parameter `N`

). The base 10 logarithms `le lb lo`

of the ebo-coordinates are convenient for visualisation purposes. `le`

and `lb`

range from -infinity to +infinity, while `lo`

ranges from 0 to infinity (all constrained by `N`

).

For backward compatibility, a transformation of the coordinate system to **relative frequencies**, which were used in earlier versions of this software, is also supported. The relative cooccurence (`p`

) and marginal (`p1 p2`

) frequencies are computed from the frequency signature according to the equations `p = f/N`

, `p1 = f1/N`

, and `p2 = f2/N`

. Note that the logarithmic versions `lp lp1 lp2`

are *negative* base 10 logarithms, ranging from 0 to infinity.

These variables store association scores and rankings for an arbitrary number of **association measures**. Each association measure is identified by a *key*, which is appended to the respective variable name prefix (resulting in the names `am.`

and *key*`r.`

). See the UCS::AM manpage (and the manpages of the add-on packages listed there) for a wide range of built-in association measures.*key*

DOUBLE am.* association scores from measure identified by * INT r.* ranking for this measure (ties are allowed)

Rankings are often computed on the fly, but they may also be annotated in data set files. Note that the `r.*`

variables should *not* break ties but report identical ranks (and skip an appropriate number of subsequent ranks). The **ucs-sort** program (cf. the ucs-sort manpage) can be used to resolve ties in various ways (using other association scores, lexical sort order, or randomisation).

User-defined variables may contain arbitrary information, which is typically used for filtering data sets and to determine true positives in evaluation tasks. However, some special-purpose association measures may also base their association scores on their values. In order to allow a minimal amount of automatic processing (such as sorting by user-defined attributes), the variable name prefix of a user-defined variable is used to determine its data type, according to the following list.

BOOL b.* user-defined Boolean variable INT n.* user-defined integer variable (n=number) DOUBLE x.* user-defined floating-point variable STRING f.* user-defined string variable (f=factor)

User-defined variables with the additional prefix `ucs`

(corresponding to variable names `b.ucs.*`

, `n.ucs.*`

, `x.ucs.*`

, and `f.ucs.*`

) are reserved for internal use by UCS modules and programs.

Evert, Stefan (2004). *The Statistics of Word Cooccurrences: Word Pairs and Collocations.* PhD Thesis, University of Stuttgart, Germany.

Copyright (C) 2004 by Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.