ucsexp - Introduction to UCS expressions and wildcard patterns


UCS expressions and wildcard patterns are two central features of the UCS/Perl system, which are to a large part responsible for its convenience and flexibility.

UCS wildcard patterns are used by most command-line tools to select data set variables with the help of shell-like wildcard characters (?, *, and %). A programmer interface is provided by the UCS::Match function from the UCS module (see the UCS manpage).

UCS expressions give easy access to data set variables from Perl code. With only a basic knowledge of Perl syntax, users can compute association scores and select rows from a data set (using the ucs-add and ucs-select utilities). The programmer interface is provided by the UCS::Expression module (see the UCS::Expression manpage for details). Before reading "UCS EXPRESSIONS", you should become familiar with the UCS data set format and variable naming conventions as described in the ucsfile manpage.

When used on the command line, wildcard patterns usually have to be quoted to keep the shell from expanding wildcards (the GNU Bash shell knows better, though, unless there happen to be matching files in the current directory). Note that when a list of variable names and patterns is passed to one of the UCS/Perl utilities, each name or wildcard pattern has to be quoted individually. UCS expressions (almost) always have to be quoted on the command-line. Single quotes ('...') are highly recommended to avoid interpolation of variables and other meta-characters. The UCS/Perl utilities expect a UCS expression to be passed as a single argument, so the expression must be written as one string. In particular, any expression containing whitespace must be quoted.


As described in the ucsfile manpage, UCS variable names may only contain the alphanumeric characters (A-Z a-z 0-9) and the period (.), which serves as a general-purpose word delimiter. There is a fixed set of core variables, whose names do not contain a period. All other variable names must begin with a prefix (one of am. r. b. n. x. f.) that determines the data type of the variable. The three wildcard characters take the special role of the period into account. Their meanings are

  ? ... a single character, except "."
  * ... a string that does NOT contain a "."
  % ... an arbitrary string of characters

The % wildcard is typically used to select variable names with a specific prefix or suffix, while * matches the individual words (or parts of words) in a complex variable name.



An UCS expression consists of ordinary Perl code extended with a special syntax to access data set variables. This code is compiled on the fly and applied to the rows of a data set one at a time. The return value of a UCS expression is the value of the last statement executed, unless there is an explicit return statement. When the expression is used as a condition to select rows from a data set, it evaluates to true or false according to the usual Perl rules (the empty string '' and the number 0 are false, everything else is true).

Data set variables are accessed by their variable name enclosed in % characters. They evaluate to the respective value for the current row in the data set and can be used like ordinary scalar variables in Perl. Thus, %f% corresponds to the cooccurrence frequency f of a pair type, %l1% and %l2% to its component lexemes, and %am.log.likelihood% to an association score from the log-likelihood measure. Derived variables (see the ucsfile manpage) do not have to be annotated explicitly in a data set. When necessary, they are computed on the fly from a pair type's frequency signature. Variable references should be treated as read-only (they are automatically localised so that assigning a new value to a UCS variable reference does not modify the original data set).

Any temporary variables needed by the Perl code should be made lexical by declaring them with the my keyword. Variable names beginning with an underscore (such as $_f or $_n_total) are reserved for internal use. Please don't use global variables, which pollute the namespaces and might interfere with other parts of the program. If you feel that you absolutely need a variable to carry information from one row to the next, use a fully qualified variable name in your own namespace.

Since a UCS expression is compiled by the Perl interpreter, it offers the full power and flexibility of Perl, but it also shares its idiosyncrasies and traps for the unwary. You should have a good working knowledge of Perl in order to write UCS expressions. If you don't know the difference between == and eq, now is the time to type perldoc perl and start reading the Perl documentation.

Just as in Perl, data types are automatically converted as necessary. Missing values (which appear as NA in data set files) are represented by undef in Perl. When there may be missing values in a data set, test for definedness (e.g. with defined(%b.colloc%)) to avoid warning messages. UCS expression can use all standard Perl functions (described on the perlfunc manpage). In addition, the utility functions from UCS::Expression::Func (see the UCS::Expression::Func manpage for a detailed description) and a range of special mathematical and statistical functions defined in the UCS::SFunc module (see the UCS::SFunc manpage for a complete listing and details) are imported automatically and can be used without qualification.

UCS Expressions for Programmers

The programmer interface to UCS expressions is provided by the UCS::Expression module (see the UCS::Expression manpage), with functions for compiling and evaluating UCS expressions. The UCS::DS::Memory module includes several methods that apply a UCS expression to the in-memory representation of a UCS data set. Note that all built-in association measures are implemented as UCS expressions (see the UCS and UCS::AM manpages for more information, or have a look at the source files).

When you want to use external functions (either defined by your own module or imported from a separate module), they must be fully qualified. For instance, you must write Math::Trig::atan(1) instead of just atan(1). Make sure that the module is loaded (with use Math::Trig;) before the expression is evaluated for the first time. You can just put the use statement in the Perl script or module where the UCS expression is defined, and it is probably also safe to include it in the expression itself (which allows you to use external libraries even in UCS expression typed on the command line).

An advanced feature of UCS expressions that is only available through the programmer interface are parameters. Parameters play the role of constants in UCS expressions: they can be accessed like data set variables, but their values are fixed and stored within the UCS::Expression object. Parameter names must be valid UCS identifiers and should be all uppercase in order to avoid conflicts with variable names. Parameters must be declared and intialised when the UCS expression is compiled. Their values can be changed with the set_param method. See the UCS::Expression manpage for more information.


Dirty Tricks

Things not to do ...


Copyright (C) 2004 by Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.