University of Southern Denmark
World of VISL -> Corpus Linguistics  Visual Interactive Syntax Learning  
Syddansk Universitet
 
 

Corpus search interface

VISL's grammatical and NLP research are both largely corpus based. On the one hand, VISL develops taggers, parsers and computational lexica based on corpus data, on the other hand these tools - once functional - are used for the grammatical annotation of large running text corpora, often with or for external partners (project list 1999-2009. The main methodological approach for automatic corpus annotation is Constraint Grammar (CG), a word based annotation method. Hybrid systems, making use of both function based phrase structure and dependency grammar, are used to create syntactic treebanks from CG output. VISL is involved in many aspects of corpus linguistics:
  • Corpus compilation
  • Automatic corpus annotation
  • Manual linguistic corpus revision
  • Providing internet access for searching corpora (CorpusEye)
  • Language specific corpus based linguistic research
  • The following is an overview over various ongoing or concluded corpus annotation projects in VISL's various research languages, with overall corpus size given in million words: Danish (160M), English (334M), Esperanto (19M), Estonian(<1M), French (71M), German (99M), Italian (19M), Norwegian (31M), Portuguese (257M), Romanian (21M), Spanish (53M), Swedish (85M). Below the tables a number of relevant publications is listed and linked for download.

    Language Corpus Type Size (words) Grammatical annotation Manual revision Partners/Projects
    Corpus90/2000 News text, prose 2 x 26 Million PoS, morphology, syntax, CG-dep. 400.000 words DSL
    Arboretum News text, prose 10 Million Treebank (dep. & psg, TIGER-compatible) 400.000 words Nordic Treebank Network
    Information News text 80 Million PoS, morphology, syntax, CG-dep. - Dagbladet Information
    Europarl-da Parliamentary debates 21 Million PoS, morphology, syntax, CG-dep. - Ref.: P. Koehn
    Wikipedia-da Encyclopedia 3.7 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    dfk-Skalk Journal of Archeology 600.000 PoS, morphology, syntax, CG-dep. - Skalk
    dfk-folketing Parliamentary debates 7 Million PoS, morphology, syntax, CG-dep. - Source: Folketing
    Floresta Sintá(c)tica Newspaper 1 Million Treebank (TIGER-compatible) 185.000+ words Linguateca
    CETEMPúblico Portuguese newspaper 192 Million PoS, morphology, syntax, CG-dep. cp. Floresta sintá(c)tica AC/DC project, Linguateca, Ref.: Público
    CETENFolha Brazilian newspaper 24 Million PoS, morphology, syntax, CG-dep. cp. Floresta sintá(c)tica AC/DC project, Linguateca, Ref.: Folha de São Paulo
    Europarl-pt Parliamentary debates 29 Million PoS, morphology, syntax, CG-dep. - Ref.: P. Koehn
    Wikipedia-pt Encyclopedia 11.3 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    Cartas-LR Historical letters to/by the editor 200.000 words PoS, morphology, syntax, treebank 10.000 words Ref.: Projeto para a História do Português Brasileiro
    Various Dialectal speech data, Historical Portuguese 70.000 PoS, morphology, syntax, CG-dep. - (1) The CORDIAL-SIN project (2) The Tycho Brahe Project

    Language Corpus Type Size (words) Grammatical annotation Manual revision Partners/Projects
    Arboratoire/Freebank News text, prose 130.000 PoS, morphology, syntax, CG-dep. 30.000 words ATILF
    ECI-FR1 Newspaper 4.4 Million PoS, morphology, syntax, CG-dep. - Ref.: Le Monde, ECI/EACL
    Europarl-fr Parliamentary debates 29 Million PoS, morphology, syntax, CG-dep. - Ref.: P. Koehn
    Wikipedia-fr Encyclopedia 37.8 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    ECI-DE1 Newspaper (Frankfurter Rundschau) 34 Million PoS, morphology, syntax, CG-dep. - Ref.: Frankfurter Rundschau, ECI/EACL
    BZK-tag Newspaper 4 Million PoS, morphology, syntax, CG-dep. - Bonner Zeitungskorpus
    MAK-tag Newspaper 3 Million PoS, morphology, syntax, CG-dep. - Mannheimer Korpus
    Europarl-de Parliamentary debates 29 Million PoS, morphology, syntax, CG-dep. - Ref.: P. Koehn
    Wikipedia-de Encyclopedia 28.7 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    BNC-tag News text, prose 35 Million PoS, morphology, syntax, CG-dep. - Ref.: British National Corpus
    Europarl-en Parliamentary debates 29 Million PoS, morphology, syntax, CG-dep. - Ref.: P. Koehn
    Wikipedia-en Encyclopedia 115.1 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    KEMPE Early modern play texts 8.9 Million PoS, morphology, syntax, CG-dep. - Lene Petersen, University of the West of England
    Chat corpus Chat logs 2002-2004 23.5 Million PoS, morphology, syntax, CG-dep. - Ref.: Project JJ
    Enron corpus E-mails 83 Million PoS, morphology, syntax, CG-dep. - History & credits

    Language Corpus Type Size (words) Grammatical annotation Manual revision Partners/Projects
    Göteborgsposten Newspaper
    (1992-2003)
    1.4 Million PoS, morphology, syntax, CG-dep. - Ref.: Göteborgsposten
    Europarl-sv Parliamentary debates 29 Million PoS, morphology, syntax, CG-dep. - Ref.: P. Koehn
    Leipzig-sv Internet Corpus 2.0 Million PoS, morphology, syntax, CG-dep. - Source: Leipzig Corpus Collection
    Wikipedia-no Wikipedia 26 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    Leipzig-no Internet Corpus 4.65 Million PoS, morphology, syntax, CG-dep. - Source: Leipzig Corpus Collection
    ECI-ES2 Newspaper 1.4 Million PoS, morphology, syntax, CG-dep. - Ref.: El Diario Sur, ECI/EACL
    Europarl-es Parliamentary debates 29 Million PoS, morphology, syntax, CG-dep. - Ref.: P. Koehn
    Wikipedia-es Encyclopedia 22.3 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    Monato News magazine 2 Million PoS, morphology, syntax, CG-dep. - Ref.: Monato
    Eventoj Electronic News letter 1.6 Million PoS, morphology, syntax, CG-dep. - Ref.: Eventoj
    Wikipedia-eo Encyclopedia 3.2 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    Elibrejo Literature 7 Million PoS, morphology, syntax, CG-dep. - Ref.: eLibrejo
    Zamenhof Esperanto Classics 1.5 Million PoS, morphology, syntax, CG-dep. - -
    TTT Internet 3.6 Million PoS, morphology, syntax, CG-dep. - -
    Wikipedia-it Encyclopedia 18.9 Million PoS, morphology, syntax, CG-dep. - Source: Wikipedia, The Free Encyclopedia (v. 12/2005)
    Adevarul Business news
    (1998-2005)
    18.9 Million PoS, morphology, syntax - Source: Adevarul Economic
    Arborest News text, prose 3.500 Treebank (TIGER-compatible) at CG-level Nordic Treebank Network, Ref.: CG Annotated corpus of Estonian

    Some relevant publications on the VISL corpora, the Constraint Grammar and Treebank annotation schemes and parsers, the CorpusEye search interface etc.:

     


    With the most recent Java update, Oracle has decided to set the default Java security settings to block all unsigned applets.
    Until we can fix this on our end by signing the applets, you can lower your security settings from Control Panel -> Java -> Security and set the slider to Medium instead of High.
    If something else isn't working properly, contact Tino Didriksen.


    Copyright 1996-2014 | Report a Problem / Contact Us | Printable Version