Chapter 7. Input/Output Stream Format

Table of Contents

Apertium Format
HFST/XFST Format
VISL CG Format
Niceline CG Format
Plain Text Format

The cg-conv tool converts between various stream formats.

Apertium Format

The cg-proc front-end processes the Apertium stream format, or can convert for use via cg-conv.

HFST/XFST Format

HFST/XFST input can be converted for use via cg-conv.

VISL CG Format

The VISL CG stream format is a verticalized list of word forms with readings and optional plain text in between. For example, the sentence "They went to the zoo to look at the bear." would in VISL format look akin to:

        "<They>"
            "they" <*> PRON PERS NOM PL3 SUBJ
        "<went>"
            "go" V PAST VFIN
        "<to>"
            "to" PREP
        "<the>"
            "the" DET CENTRAL ART SG/PL
        "<zoo>"
            "zoo" N NOM SG
        "<to>"
            "to" INFMARK>
        "<look>"
            "look" V INF
        "<at>"
            "at" PREP
        "<the>"
            "the" DET CENTRAL ART SG/PL
        "<bear>"
            "bear" N NOM SG
        "<.>"
      

Or in CG terms:

        "<word form>" static_tags
            "base form" tags
      

Also known as:

        "<surface form>" static_tags
            "lexeme" tags
      

In more formal rules:

  • If the line begins with "< followed by non-quotes and/or escaped quotes followed by >" (regex /^"<(.|\\")*>"/) then it opens a new cohort.

  • If the line begins with whitespace followed by " followed by non-quotes and/or escaped quotes followed by " (regex /^\s+"(.|\\")*"/) then it is parsed as a reading, but only if a cohort is open at the time. Thus, any such lines seen before the first cohort is treated as text.

  • Any line not matching the above is treated as text. Text is handled in two ways: If no cohort is open at the time, then it is output immediately. If a cohort is open, then it is appended to that cohort's buffer and output after the cohort. Note that text between readings will thus be moved to after the readings. Re-arranging cohorts will also re-arrange the text attached to them. Removed cohorts will still output their attached text.

This means that you can embed all kinda of extra information in the stream as long as you don't hit those exact patterns. For example, we use <s id="unique-1234"> </s> tags around sentences to keep track of them for corpus markup.

Niceline CG Format

Niceline input can be converted for use via cg-conv.

The Niceline format is primarily used in VISL and GrammarSoft chains to make the output more readable. Using the same example as for VISL CG format, that would look like:

        They  [they] <*> PRON PERS NOM PL3 SUBJ
        went  [go] V PAST VFIN
        to    [to] PREP
        the   [the] DET CENTRAL ART SG/PL
        zoo   [zoo] N NOM SG
        to    [to] INFMARK>
        look  [look] V INF
        at    [at] PREP
        the   [the] DET CENTRAL ART SG/PL
        bear  [bear] N NOM SG
        .
      

Or in CG terms:

        word form TAB [base form] tags TAB [base form] tags
        ...or quotes...
        word form TAB "base form" tags TAB "base form" tags
        ...or mixed...
        word form TAB "base form" tags TAB [base form] tags
      

In more formal rules:

  • If the line does not begin with < and contains a tab (\t, 0x09), then it is a cohort. Anything up to the first tab is the word form. Readings are tab delimited, where if the first tag is contained in [] or "" then it is taken as the base form. Tags are otherwise whitespace delimited.

  • Any line not matching the above is treated as text, same rules as for VISL CG format. Note that a tab character is required for it to be a cohort - a word or punctuation without the tab will be treated as text.

Plain Text Format

Plain text can be tokenized for use via cg-conv. It is a naive tokenizer that you should not use, and is only included as a last resort. Five minutes in any scripting language should give you a much better tokenizer.

The tokenization rules are simple:

  • Split tokens on any kind of whitespace.

  • Split punctuation from the start and end of tokens into tokens. Each punctuation character becomes a separate token.

  • Detect whether the token is ALLUPPER, Firstupper, or MiXeDCaSe and add a tag denoting it.

  • The token then becomes a cohort with one reading with a lower-case variant of the token as base form.