Eckhard Bick, vislcg how-to 6/2006 Basic how-to for vislcg 1.Command-line usage:
Ordinarily input is piped from a lexicon-based morphological multitagger, but input from probabilistic taggers (Treetagger, TnT, Brill etc.) can also be used, in which case the first rule section typically will be a correction grammar rather than a morphological disambiguation grammar. In order to prevent syntactic rules from interfering with morphological ones (by being run on morphologically not-yet disambiguated input), it is recommended to run vislcg twice - first without, then with syntactic mapping. Finally, disambiguated/tagged output can be piped directly to a file, or processed with layout filters or further grammars in other formalisms (constituent grammar, dependency grammar, field grammar etc.).
Multitagger or other input has to deliver so-called verticalized text, i.e. one token pr. line, with non-punctuation tokens followed by a cohort of one or more possible analyses, indented, one pr. line. Conventionally, cohort lines start with the lexeme or base-form (in quotes), followed by word class (PoS) and inflexion tags in upper case. Secondary tags, meant to be used as disambiguation context, but not intended for disambiguation themselves, such as subclass, valency and semantic tags, should be placed in <...> brackets between lexeme and word class tags:
ordform “lexeme-1” <valency> .. <semantics> .. POS-1 INFLEXION “lexeme-1” <valency> .. <semantics> .. POS-2 INFLEXION “lexeme-2” <valency> .. <semantics> .. POS-3 INFLEXION “lexeme-2” <valency> .. <semantics> .. POS-4 INFLEXION 2.The rules fileA vislcg rules file consists of the following sections:
DELIMITERS (1 line, defines sentence boundaries) SETS (1 or more sections of set definitions, compiled as one) MAPPINGS (1 section of mapping rules, adding tags at the end of a reading line) CORRECTIONS (1 section of correction rules, replacing tags anywhere in a reading) CONSTRAINTS (1 or more sections of REMOVE or SELECT rules, with each section compiled and run seperately) END Set sections contain LIST definitions of sets, written as lists of ORed tags or tag chains (in parentheses). Once defined, sets may be combined into new sets with a SET definition. Mapping and Correction sections have MAP/ADD and SUBSTITUTE rules, respectively. These rules are applied in strict sequential order. But while MAP/ADD rules can't "see" in their context conditions what earlier mapping rules have mapped, this is not true of SUBSTITUTE rules, which do interact with the result previous substitution rules. Constraint sections will be interpretad as heuristicity batches, with safer rules in the first sections, and more heuristic rules in later sections. Each section is repeated until no further of its rules can be instantiated (i.e. meet their context conditions), then the next section is run and the first section re-run after second-section disambiguation to check for changed contexts. After that, a third section is run, and the lower ones rerun, etc. Within one and the same constraint section, rules should be regarded as "simultaneous", since their order may be changed by the compiler for optimisation purposes. However, word form rules will be run first, and SELECT (due to their greater disambiguation potential) have priority over REMOVE rules with the same target. Each set definition or rule is terminated with a semicolon, but can run over serveral lines. As in several programming languages, the #-symbol marks the rest of a line as a comment. 3.The individual operators3.2.DelimitersThe vislcg compiler applies rules within a certain context window, defined by delimiters. Typically, delimiters will be sentence boundary markers (i.e. punctuation), but paragraphs, corpus section markers or even specific stop-words could be used. Rules can refer to the boundaries with the reserved symbols >>> (left boundary) and <<< (right boundary). DELIMITERS = “<.>” “<!>” "<?>" ;The example defines a fullstop, exclamation mark or question mark as a delimiter. Note that punctuation notation follows wordform notation, with quotes and angle brackets. 3.3.Set definitionsIn both their targes and context conditions, CG rules can refer not only to words, lexemes and tags, but also sets of words, lexemes or tags, or even combinations of these three types. Two kinds of set definitions are used: (a) LIST set-name =followed by a list of tags or tag combinations (the latter in parentheses), separated by spaces. The list constitutes the set, and a rule targeting a set is equivalent to a batch of rules targeting each set element separately.(b) SET set-name =defining a new set as a mathematical operation on existing sets. Sets used in a SET definition, must occur earlier in the grammar. Tags can be used as sets on the fly by enclosing them in parentheses. A set element can be:
In a SET definition (b), sets can be combined with the following operators: union: OR or | , e.g. set1 OR set2 OR (tag3) OR (N F S)concatenation: + , e.g. set1 + set2, yields all possible combinations of the 2 sets' elements. Thus, a concatenation of SET set1 = V and SET set2 = INF GER PCP covers all non-finite verb forms: (V INF) (V GER) (V PCP). negation: - , e.g. set1 but not set2, means set1 as long as the reading in question does not contain elements from set2. Thus, rather than just a removal of set2 elements form the set1 list (i.e. set difference, as used in Tapanainen's cg2), vislcg interprets the minus operation as a kind of NOT condition, so the presence of a set2 element in a reading will block and override the presence of a set1 reading. Thus, (N) - (P) means non-plural nouns. In the upcoming visl-cg3, a clear distinction will be made between negation and set difference. The + and - operators have precedence over OR. Note that the same operators, as well as the parenthesis convention for creating sets on-the-fly, can be used in targets and context conditions of REMOVE and SELECT rules in the CONSTRAINTS section. 3.4.ConstraintsConstraint rules are ordered in sections, usually in order to separate safer rules (to be used earlier) from more heuristic rules (to be used later). Within one section, rules should be regarded as simultaneous, though REMOVE rules will be used after SELECT rules. One and the same grammar can be run at different levels of heuristicity by using the --sections=n flag when calling vislcg, meaning that only the first (=safest) n constraint sections of the grammar will be used. A CG rule has the following general form, with [] brackets indicating optional elements: ["<Wordform>"] OPERATION TARGET [[IF] (CONTEXT-1) (CONTEXT-2) ...] ; OPERATION: (a) REMOVERemoves a reading from a cohort, if it contains a TARGETed tag - unless this reading is the last surviving reading. In the case of morphological or PoS tag this means that one (entire) reading line, in a cohort of readings for a given token, will be removed - for instance the reading line "comer" V PR 1S IND will be removed from the analysis cohort of "como", if either the V (verb) or PR (present tense) tags are TARGETed by a successful REMOVE rule, leaving the "como" ADV reading to survive. If the target is a MAPped tag (i.e. a @-tag), it is removed from the reading line, and if it is the only or last surviving MAPped tag, the whole reading line will be removed. (b) SELECTSelects a reading, if it contains a TARGETed tag. In practice, selection is equivalent to a removal of all other readings. In the case of @-tag target, the reading line is cleared of all other @-tags. WORDFORM:Optional part of a rule, restricting the rule to the wordform in question. Since the operation is case sensitive, preprocessing (lowercasing) is necessary, if a rule targeting e.g. an English noun also is to apply if the noun occurs in sentence-initial position. VISL grammars use lowercasing of initials, storing the uppercase information as a tag (<*>) instead. WORDFROM may be a set of wordforms, but the set must not include other tag types. Otherwise, the WORDFORM condition works like a context condition for position 0 (self). TARGET: Obligatory part of a rule. A target is always a set, either a predefined set from the SETS section, or a tag string defined as a set on-the-fly by using parentheses, e.g. NOMINAL (defined by LIST = N ADJ PCP) or (N) or (N F P). Using predefined sets as targets, effectively fuses what in the cg-1 formalism was a same-context batch of multiple rules, into one rule: SELECT NOMINAL IF (-1C DET) ; (same as 3 rules targeting (N), (ADJ) and (PCP) separately). CONTEXT:One or more contexts can be used, but (heuristic) rules without any context are allowed, too. Each context is enclosed in parentheses. Contexts are applied as AND-linked conditions, i.e. all conditions of a given rule must be true ("instantiated") for the rule to apply. A context condition may contain the following elements:
3.5.MappingsA MAPPING-rule has the following gerneal layout: OPERATION (MAPTAG-1 MAPTAG-2 ...) (TARGET) IF (CONTEXT-1) ... (CONTEXT-n)
Mapping rules add tags to a cohort line (i.e. reading), if that line contains a certain TARGET and if certain optional CONTEXTs are fulfilled. Context conditions are expressed as in the CONSTRAINT section, and sets are used and constructed in the usual way. Any kind of tag may be added. However, only mapped tags with a special mapping-prefix (by default, @) will be treated as real @tags. @tags are traditionally syntactic tags, added and disambiguated on one cohort line (itself representing a PoS/inflexion reading). During disambiguation, @tags will be cut down to the last reading. If there is only one reading in the cohort, this last @tag is untouchable, otherwise the whole reading line dies together with its last @tag. When calling a grammar with vislcg, the @-prefix may be changed by using the --prefix='...' flag. The following OPERATIONs are allowed in mapping rules:
Unlike constraint rules, mapping-rules are applied in exactly the order they are given in. Mapping rules in the same grammar (section?) cannot use earlier mapped tags as contexts. 3.6.CorrectionsCorrection rules are used to correct faulty input - for instance from a probabilistic tagger, or in a spell checker - by replacing tags with other tags. Deletion can be handled by nil-replacements, and insertion by replacing a tag with an appended version containing also the new, inserted tag.
The general shape of a correction rule is the following:
SUBSTITUTE (TAG-1) (TAG-2) TARGET (TAG-3) IF (CONTEXT-1) ... (CONTEXT-2)
Here, TAG-1 is replaced with TAG-2 in cohort lines that contain the target tag TAG3 with (optional) context conditions structured in the usual fashion. As usual, on-the-fly sets (as in the example) can be used on par with predefined or combined sets.
4.Sample rules fileDELIMITERS = "<.> "<!>" "<?>" ; # sentence window SETS LIST NOMINAL = N PROP ADJ PCP ; # nominals, i.e. potentieal nominal heads LIST PRE-N = DET ADJ PCP ; # prenominals LIST P = P S/P ; # plural SET PRE-N-P = PRE-N + P ; # plural prenominals, equivalent to (DET P) (DET S/P) (ADJ P) (ADJ S/P) (PCP P) (PCP S/P) LIST CLB = "<,>" KS (ADV <rel>) (ADV <interr>) ; # clause boundaries LIST ALL = N PROP ADJ DET PERS SPEC ADV V PRP KS KC IN ; # all word classes LIST V-SPEAK = "dizer" "falar" "propor" ; # speech verbs LIST @MV = @FMV @IMV ; # main verbs CONSTRAINTS REMOVE (N S) IF (-1C PRE-N-P) ; # remove a singular noun reading if there is a safe plural prenominal directly to the left. REMOVE NOMINAL IF (NOT 0 P) (-1C (DET) + P) ; # remove a nominal if it isn't plural but preceded by a safe plural determiner. REMOVE (VFIN) IF (*1 VFIN BARRIER CLB OR (KC) LINK *1 VFIN BARRIER CLB OR (KC)) ; # remove a finite verb reading if there are to more finite verbs to the right none of them barred by a clause boundary (CLB) and coordinating conjunction (KC). "<que>" SELECT (KS) (*-1 V-SPEAK BARRIER ALL - (ADV)) ; # select the conjunction reading for the word form 'que', if there is a speech-verb to the left with nothing but advers in between. MAPPINGS MAP (@SUBJ> @ACC>) TARGET (PROP) IF (*1C VFIN BARRIER ALL - (ADV)) (NOT -1 PROP OR PRP) (NOT *-1 VFIN) ; # a proper noun can be either forward subject or forward direct object, if there follows a finite verb to the right with nothing but adverbs in between, provided there is no proper noun or preposition directly to the left, and a finite verb anywhere to the left. CONSTRAINTS REMOVE (@SUBJ>) IF (*1 @MV BARRIER CLB LINK *1C @<SUBJ BARRIER @MV) ; # remove a forward subject if there is a safe backward subject to the right with only one main verb in between
|
||