Eckhard Bick, vislcg how-to 6/2006

Basic how-to for vislcg

1.Command-line usage:

standard call: vislcg --grammar rulesfile
without mapping rules: --no-mapping
with rule-number traces for debugging: --verbosity minimal
limited number of n least heuristic constraint sections: --sections=n
special mapping prefix (default ='@'): --prefix='...'

Ordinarily input is piped from a lexicon-based morphological multitagger, but input from probabilistic taggers (Treetagger, TnT, Brill etc.) can also be used, in which case the first rule section typically will be a correction grammar rather than a morphological disambiguation grammar. In order to prevent syntactic rules from interfering with morphological ones (by being run on morphologically not-yet disambiguated input), it is recommended to run vislcg twice - first without, then with syntactic mapping. Finally, disambiguated/tagged output can be piped directly to a file, or processed with layout filters or further grammars in other formalisms (constituent grammar, dependency grammar, field grammar etc.).

cat textfile | multitagger | vislcg --grammar rulesfile --no-mapping | vislcg --grammar rulesfile | postfilter > textfile.cg
with tracing: cat textfile | multitagger | vislcg --grammar rulesfile --no-mapping --verbosity minimal | vislcg --grammar rulesfile --verbosity minimal | postfilter > textfile.cg

Multitagger or other input has to deliver so-called verticalized text, i.e. one token pr. line, with non-punctuation tokens followed by a cohort of one or more possible analyses, indented, one pr. line. Conventionally, cohort lines start with the lexeme or base-form (in quotes), followed by word class (PoS) and inflexion tags in upper case. Secondary tags, meant to be used as disambiguation context, but not intended for disambiguation themselves, such as subclass, valency and semantic tags, should be placed in <...> brackets between lexeme and word class tags:

ordform

“lexeme-1” <valency> .. <semantics> .. POS-1 INFLEXION

“lexeme-1” <valency> .. <semantics> .. POS-2 INFLEXION

“lexeme-2” <valency> .. <semantics> .. POS-3 INFLEXION

“lexeme-2” <valency> .. <semantics> .. POS-4 INFLEXION

2.The rules file

A vislcg rules file consists of the following sections:

DELIMITERS (1 line, defines sentence boundaries)

SETS (1 or more sections of set definitions, compiled as one)

MAPPINGS (1 section of mapping rules, adding tags at the end of a reading line)

CORRECTIONS (1 section of correction rules, replacing tags anywhere in a reading)

CONSTRAINTS (1 or more sections of REMOVE or SELECT rules, with each section compiled and run seperately)

END

Set sections contain LIST definitions of sets, written as lists of ORed tags or tag chains (in parentheses). Once defined, sets may be combined into new sets with a SET definition.

Mapping and Correction sections have MAP/ADD and SUBSTITUTE rules, respectively. These rules are applied in strict sequential order. But while MAP/ADD rules can't "see" in their context conditions what earlier mapping rules have mapped, this is not true of SUBSTITUTE rules, which do interact with the result previous substitution rules.

Constraint sections will be interpretad as heuristicity batches, with safer rules in the first sections, and more heuristic rules in later sections. Each section is repeated until no further of its rules can be instantiated (i.e. meet their context conditions), then the next section is run and the first section re-run after second-section disambiguation to check for changed contexts. After that, a third section is run, and the lower ones rerun, etc.

Within one and the same constraint section, rules should be regarded as "simultaneous", since their order may be changed by the compiler for optimisation purposes. However, word form rules will be run first, and SELECT (due to their greater disambiguation potential) have priority over REMOVE rules with the same target.

Each set definition or rule is terminated with a semicolon, but can run over serveral lines. As in several programming languages, the #-symbol marks the rest of a line as a comment.

3.The individual operators

3.2.Delimiters

The vislcg compiler applies rules within a certain context window, defined by delimiters. Typically, delimiters will be sentence boundary markers (i.e. punctuation), but paragraphs, corpus section markers or even specific stop-words could be used. Rules can refer to the boundaries with the reserved symbols >>> (left boundary) and <<< (right boundary).

DELIMITERS = “<.>” “<!>” "<?>" ;

The example defines a fullstop, exclamation mark or question mark as a delimiter. Note that punctuation notation follows wordform notation, with quotes and angle brackets.

3.3.Set definitions

In both their targes and context conditions, CG rules can refer not only to words, lexemes and tags, but also sets of words, lexemes or tags, or even combinations of these three types. Two kinds of set definitions are used:

(a) LIST set-name =

followed by a list of tags or tag combinations (the latter in parentheses), separated by spaces. The list constitutes the set, and a rule targeting a set is equivalent to a batch of rules targeting each set element separately.

(b) SET set-name =

defining a new set as a mathematical operation on existing sets. Sets used in a SET definition, must occur earlier in the grammar. Tags can be used as sets on the fly by enclosing them in parentheses.

A set element can be:

a tag, word form or lexeme, e.g. N [for noun], "<bought>" [word form] or "buy" [lexeme]
a combination of (1), as a kind of "snapshot" from a reading, in parentheses. The snapshot may have "holes" (i.e. interfering tags appearing in the reading but not in the set element). For instance, (N M P) [for noun masculine plural], or (“eat” INF).

In a SET definition (b), sets can be combined with the following operators:

union: OR or | , e.g. set1 OR set2 OR (tag3) OR (N F S)

concatenation: + , e.g. set1 + set2, yields all possible combinations of the 2 sets' elements. Thus, a concatenation of SET set1 = V and SET set2 = INF GER PCP covers all non-finite verb forms: (V INF) (V GER) (V PCP).

negation: - , e.g. set1 but not set2, means set1 as long as the reading in question does not contain elements from set2. Thus, rather than just a removal of set2 elements form the set1 list (i.e. set difference, as used in Tapanainen's cg2), vislcg interprets the minus operation as a kind of NOT condition, so the presence of a set2 element in a reading will block and override the presence of a set1 reading. Thus, (N) - (P) means non-plural nouns. In the upcoming visl-cg3, a clear distinction will be made between negation and set difference.

The + and - operators have precedence over OR.

Note that the same operators, as well as the parenthesis convention for creating sets on-the-fly, can be used in targets and context conditions of REMOVE and SELECT rules in the CONSTRAINTS section.

3.4.Constraints

Constraint rules are ordered in sections, usually in order to separate safer rules (to be used earlier) from more heuristic rules (to be used later). Within one section, rules should be regarded as simultaneous, though REMOVE rules will be used after SELECT rules. One and the same grammar can be run at different levels of heuristicity by using the --sections=n flag when calling vislcg, meaning that only the first (=safest) n constraint sections of the grammar will be used.

A CG rule has the following general form, with [] brackets indicating optional elements:

["<Wordform>"] OPERATION TARGET [[IF] (CONTEXT-1) (CONTEXT-2) ...] ;

OPERATION:

(a) REMOVE

Removes a reading from a cohort, if it contains a TARGETed tag - unless this reading is the last surviving reading. In the case of morphological or PoS tag this means that one (entire) reading line, in a cohort of readings for a given token, will be removed - for instance the reading line "comer" V PR 1S IND will be removed from the analysis cohort of "como", if either the V (verb) or PR (present tense) tags are TARGETed by a successful REMOVE rule, leaving the "como" ADV reading to survive. If the target is a MAPped tag (i.e. a @-tag), it is removed from the reading line, and if it is the only or last surviving MAPped tag, the whole reading line will be removed.

(b) SELECT

Selects a reading, if it contains a TARGETed tag. In practice, selection is equivalent to a removal of all other readings. In the case of @-tag target, the reading line is cleared of all other @-tags.

WORDFORM:

Optional part of a rule, restricting the rule to the wordform in question. Since the operation is case sensitive, preprocessing (lowercasing) is necessary, if a rule targeting e.g. an English noun also is to apply if the noun occurs in sentence-initial position. VISL grammars use lowercasing of initials, storing the uppercase information as a tag (<*>) instead.

WORDFROM may be a set of wordforms, but the set must not include other tag types. Otherwise, the WORDFORM condition works like a context condition for position 0 (self).

TARGET:

Obligatory part of a rule. A target is always a set, either a predefined set from the SETS section, or a tag string defined as a set on-the-fly by using parentheses, e.g. NOMINAL (defined by LIST = N ADJ PCP) or (N) or (N F P). Using predefined sets as targets, effectively fuses what in the cg-1 formalism was a same-context batch of multiple rules, into one rule:

SELECT NOMINAL IF (-1C DET) ;

(same as 3 rules targeting (N), (ADJ) and (PCP) separately).

CONTEXT:

One or more contexts can be used, but (heuristic) rules without any context are allowed, too. Each context is enclosed in parentheses. Contexts are applied as AND-linked conditions, i.e. all conditions of a given rule must be true ("instantiated") for the rule to apply. A context condition may contain the following elements:

An obligatory position marker, consisting of a number indicating relative distance in tokens. The default (positive number) is a right context, while a negative number indicates a left context. A context can be negated by using NOT in front of the position marker. An asterisk (*), prefixed to the position marker number means "unbounded context". In this case, a context condition has to be true all the way to the left (-) or right (+) sentence boundary - even if the context search should cross the TARGET position (position 0). A positive unbounded context condition is instantiated at the closest possible position - unless a double asterisk (**) is used, which will allow instantiation at the second or later occurrence. Later instantiation is relevant only in the presence of LINKed contexts (which might not be true of the first, but yes a later occurrence of the original condition). An at-sign (@) in front of a position number means absolute context, e.g. @1 for the first token/cohort, @2 for the second, and @-2 for the second-but-last token/cohort in the sentence.
An obligatory context condition consists of a (position-restricted) set (or set-ified tags or tag sequences). As elsewhere, sets may be combined by set operators: OR (union), + (concatenation in one and the same reading) or AND (intersection, both tags in the same cohort). A C (careful) condition attached to the position number means that the context condition has to be a safe (i.e. the only) reading of the cohort in question. For instance, (-1C N) denotes an unambiguous noun one position to the left (i.e. left adjacent). A word with both a noun (N) and a verb (V) reading in this position would not fulfill the context condition.
An optional linked context, where the word LINK chains 2 contexts (within the same context parenthesis). The second, linked context condition is written in the same fashion as the first one, but its relative position is calculated from the instantiated first context rather than the rule target. In other words, each LINK resets the context position to 0. In this way, it is possible to to create arbitrarily long chains of LINKed context conditions. In practice, all links in achain point to the same side (i.e. either right or left), but in theory, a change of direction is allowed.
An optional barrier context, where the word BARRIER is used right after an unboundad context (*-context). A barrier context blocks the preceding context search, if the barrier condition is instantiated before the unbounde context can be instantiated. As usual, barrier contexts may consist of sets, set-ified tags or set combinations, but do not need a postion marker. For instance, (*1 VFIN BARRIER CLB) looks for a finite verb (VFIN) anywhere to the right (*1), but only if there is no interfering clause boundary (CLB) in between. A subordinator or comma would thus block further VFIN-searching.

3.5.Mappings

A MAPPING-rule has the following gerneal layout:

OPERATION (MAPTAG-1 MAPTAG-2 ...) (TARGET) IF (CONTEXT-1) ... (CONTEXT-n)

Mapping rules add tags to a cohort line (i.e. reading), if that line contains a certain TARGET and if certain optional CONTEXTs are fulfilled. Context conditions are expressed as in the CONSTRAINT section, and sets are used and constructed in the usual way. Any kind of tag may be added. However, only mapped tags with a special mapping-prefix (by default, @) will be treated as real @tags. @tags are traditionally syntactic tags, added and disambiguated on one cohort line (itself representing a PoS/inflexion reading). During disambiguation, @tags will be cut down to the last reading. If there is only one reading in the cohort, this last @tag is untouchable, otherwise the whole reading line dies together with its last @tag. When calling a grammar with vislcg, the @-prefix may be changed by using the --prefix='...' flag.

The following OPERATIONs are allowed in mapping rules:

MAP: This is the general mapping operator. It is a feature of the special @tags, that MAP rules cannot apply if the targeted cohort line already contains one or more @tags (from an earlier MAP rule or the lexicon). Thus, if ambiguity is desired, the @tags in question have to be MAPped at the same time (i.e. by the same rule). In order to allow further mapping, ADD rules have to be used instead of MAP rules.
ADD: Mapping of @tags is performed independently of the presence of other @tags on the cohort line. Thus, @-mapping may continue until a MAP rule "closes" the @tag-list for a given cohort line.
REPLACE: This is a CG-2 operator deprecated in vislcg in favour of the more powerfull SUBSTITUTE operator. REPLACE deletes all tags but the first one (normally the lexeme tag), and adds the mapped tags instead.

Unlike constraint rules, mapping-rules are applied in exactly the order they are given in. Mapping rules in the same grammar (section?) cannot use earlier mapped tags as contexts.

3.6.Corrections

Correction rules are used to correct faulty input - for instance from a probabilistic tagger, or in a spell checker - by replacing tags with other tags. Deletion can be handled by nil-replacements, and insertion by replacing a tag with an appended version containing also the new, inserted tag.

The general shape of a correction rule is the following:

SUBSTITUTE (TAG-1) (TAG-2) TARGET (TAG-3) IF (CONTEXT-1) ... (CONTEXT-2)

Here, TAG-1 is replaced with TAG-2 in cohort lines that contain the target tag TAG3 with (optional) context conditions structured in the usual fashion. As usual, on-the-fly sets (as in the example) can be used on par with predefined or combined sets.

4.Sample rules file

DELIMITERS = "<.> "<!>" "<?>" ; # sentence window

SETS

LIST NOMINAL = N PROP ADJ PCP ; # nominals, i.e. potentieal nominal heads

LIST PRE-N = DET ADJ PCP ; # prenominals

LIST P = P S/P ; # plural

SET PRE-N-P = PRE-N + P ; # plural prenominals, equivalent to (DET P) (DET S/P) (ADJ P) (ADJ S/P) (PCP P) (PCP S/P)

LIST CLB = "<,>" KS (ADV <rel>) (ADV <interr>) ; # clause boundaries

LIST ALL = N PROP ADJ DET PERS SPEC ADV V PRP KS KC IN ; # all word classes

LIST V-SPEAK = "dizer" "falar" "propor" ; # speech verbs

LIST @MV = @FMV @IMV ; # main verbs

CONSTRAINTS

REMOVE (N S) IF (-1C PRE-N-P) ; # remove a singular noun reading if there is a safe plural prenominal directly to the left.

REMOVE NOMINAL IF (NOT 0 P) (-1C (DET) + P) ; # remove a nominal if it isn't plural but preceded by a safe plural determiner.

REMOVE (VFIN) IF (*1 VFIN BARRIER CLB OR (KC) LINK *1 VFIN BARRIER CLB OR (KC)) ; # remove a finite verb reading if there are to more finite verbs to the right none of them barred by a clause boundary (CLB) and coordinating conjunction (KC).

"<que>" SELECT (KS) (*-1 V-SPEAK BARRIER ALL - (ADV)) ; # select the conjunction reading for the word form 'que', if there is a speech-verb to the left with nothing but advers in between.

MAPPINGS

MAP (@SUBJ> @ACC>) TARGET (PROP) IF (*1C VFIN BARRIER ALL - (ADV)) (NOT -1 PROP OR PRP) (NOT *-1 VFIN) ; # a proper noun can be either forward subject or forward direct object, if there follows a finite verb to the right with nothing but adverbs in between, provided there is no proper noun or preposition directly to the left, and a finite verb anywhere to the left.

CONSTRAINTS

REMOVE (@SUBJ>) IF (*1 @MV BARRIER CLB LINK *1C @<SUBJ BARRIER @MV) ; # remove a forward subject if there is a safe backward subject to the right with only one main verb in between