Parsing Nordic Languages (PaNoLa)

Project Description

PaNoLa will be devoted to Internet development and applications involving the automatic analysis of Danish, Finnish, Norwegian, and Swedish based on a Constraint Grammar (CG) formalism — with the view of contributing to distance education regarding the nature, structure, and use of these four Nordic languages.

Though the present application is geared to a time span of only two years, this will be sufficient, we believe, to put the project on a very solid footing and at the same time produce important and highly visible results — making new information about Nordic languages and language structure freely available, via the Internet, to the international community. This optimistic view is due to the fact that the project will build upon existing systems in two important respects:

  1. In the first place, there already exists, for each of the four languages in question, a solid Constraint Grammar basis.
  2. In the second place, the Internet interface for making use of these grammars is already in place as part of the VISL education infrastructure developed in Odense at the University of Southern Denmark over the past five years.

VISL, which stands for "Visual Interactive Syntax Learning", has received financial support from various Danish government institutions since 1996. During that time, VISL has developed a wide range of teaching, learning, and research tools which are freely available to the world community over the Internet (URL: edu.visl.dk).

The goal of PaNoLa is to enhance the Nordic element within the VISL system. Currently, Danish is the only Nordic language among the fifteen VISL languages — which consist of Arabic, Bosnian, Danish, Dutch, English, Esperanto, French, German, Greek, Italian, Japanese, Latin, Portuguese, Russian, and Spanish.

To this end it is important to link up with research communities in Norway, Sweden, and Finland which also work within the Constraint Grammar paradigm. This can be done by combining the efforts of the four scholars named in this application: Eckhard Bick (University of Southern Denmark), Janne Bondi Johannessen (University of Oslo), Fred Karlsson (University of Helsinki), and Torbjörn Lager (Uppsala University).

Through the joint efforts of these four scholars, existing CG-systems (some portions of which are available from the Finnish firm Lingsoft) can be enhanced and co-ordinated to yield a powerful unified electronic education and research network for these four Nordic languages.

The importance of developing computational tools for language learning and language processing for Nordic languages is actually mirrored in a joint decision by the European Council, the European Union and UNESCO to declare 2001 the European Year of Languages. The Norwegian coordinator Arne Aarseth explains the goal this way: "Språkåret har som føremål å skapa merksemd om og auka innsikt i det europeiske språklege mangfaldet. Ein ønskjer, gjennom ei rekkje tiltak, å motivera alle europeiske innbyggjarar til å læra språk, gjerne med vekt på dei såkalla minst brukte. Eit av desse verkemidla er større vekt på livslang språklæring."

 

About the participants and the participating institutions

As can be seen from the descriptions below, all four scholars participating in PaNoLa are well-versed in the CG formalism and have already contributed to the development of CG-systems for their respective languages.

Denmark

The project leader, Dr. Eckhard Bick, is Senior Researcher in the Institute of Language and Communication at the University of Southern Denmark. He has a cand.med. degree from the University of Bonn (1984), a cand.mag. degree in Nordic languages/literature and Portuguese from Aarhus University, Denmark (1993) and a dr.phil. in Linguistics from the same university (2000). His dissertation is entitled The Parsing System "Palavras" Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Since 1996 he has been head of the VISL project at the University of Southern Denmark. Over the past five years he has worked on enhancing his Portuguese CG, written preliminary Spanish and French CG's, upgraded the English CG which the VISL project purchased from Lingsoft, and — for the past year— has been engaged in writing a CG-system for Danish.

The chairman of the Institute of Language and Communication has pledged his support to PaNoLa. In particular, he has agreed to make available the necessary working space and computer facilities for the project. In addition, the Dean of Humanities has agreed to contribute DKK 25,000 to the upgrading of the server park to help support PaNoLa activities.

Sweden

Dr. Torbjörn Lager is a member of the Department of Linguistics at Uppsala University. His research in tagging and parsing of Swedish is related to his interest in rule-based taggers in general — an interest and expertise which subsumes Constraint Grammar tagging. He has considerable experience in the design and implementation of software tools for learning and revising rule-based taggers based on input from tagged corpora. This has resulted in a system for generalized Transformation-Based Learning — the MUTBL system — in which rule-based taggers of different kinds can be learned, among them CG-style taggers. A description of the system can be found at the following address: http://stp.ling.uu.se/~torbjorn/mutbl.html. His goals within PaNoLa will be to further refine the MUTBL system, to build an interactive development environment for Constraint Grammars on top of it, and to use this environment to support the development of a sizable Constraint Grammar for Swedish, parts of which already exist.

Norway

Professor Janne Bondi Johannessen is manager of the Text Laboratory at the Department of Linguistics, University of Oslo. She has been in charge of a project for developing a Constraint Grammar tagger for Norwegian (Bokmål and Nynorsk) at the Text Lab. This tagger has achieved very good results, and has been used to tag the Oslo Corpus, which has become very popular among scholars and students of the Norwegian language around the world. The Bokmål corpus contains roughly 18,500,000 words, the Nynorsk corpus about 3,800,000 words (see http://www.tekstlab.uio.no/norsk/bokmaal). The staff at the Text Lab — not least, Kristen Hagen, who has been instrumental in developing the Norwegian Constraint Grammar tagger — have solid experience and expertise in CG, and welcome the opportunity PaNoLa will provide for improving the tagger and modifying it in ways which will considerably increase both its range of uses and number of users.

The Linguistics Department at the University of Oslo is willing to support PaNoLa by providing working space and computer facilities. In addition, the Text Laboratory can provide an assistant to the project, worth NOK 30,000. It should be mentioned that the goals of PaNoLa will correlate very nicely with the goals of a newly started project in the Linguistics Department. This project aims at making the introductory linguistics course (which is obligatory for all language students) available for long distance learning and teaching. Indeed, the tools that will be developed by PaNoLa will be directly usable by the other project.

Finland

Fred Karlsson is Professor of General Linguistics at the University of Helsinki and Dean of the Faculty of Humanities. He is the inventor of the Constraint Grammar Formalism, which he defined in 1990. He programmed the first full-scale Constraint Grammar engine. Fred Karlsson has designed a morphological analyzer for Swedish (SWETWOL) and a Constraint Grammar for Finnish. He was Director of the Research Unit for Computational Linguistics (1985-1994) and the Research Unit for Multilingual Language Technology (1995-1999), both at the University of Helsinki.

The Department of General Linguistics at the University of Helsinki and Lingsoft, Inc. will support the project with a sum amounting to FIM 20,000.

Distance learning tools

From the point of view of application programs, PaNoLa will tap into and enhance existing VISL technology, which already provides a wide range of tools and services for language learning, language teaching and language research. These tools and services are all freely available to the world community via the Internet. The basis for these services is an ongoing implementation and enhancement of a wide range of language modules at the morphological, syntactic and semantic levels — supported by the development and maintenance of the necessary lexicographic databases for each language. The interaction of these modules has already resulted in a number of concrete applications, primary among which are:

  1. Internet-based grammar teaching interface. This is currently highly functional at the university level, and is under development for primary and secondary schools. Syntactic and morphological analyses of individual sentences are visually displayed in a variety of formats, according to user specifications. A growing number of linguistic glossaries (terms and definitions) is being made available for different languages.
  2. Interactive grammar games and quizzes, as well as interactive course design and implementation. One of the most recent grammar games to be implemented is the "Paintbox", which allows users as young as 4th graders to learn about and improve their understanding of word classes. In the Paintbox, the user dips the cursor into colors on a palette and then paints individual words according to a color code: nouns are blue, verbs are red, and so on. Since the game is based on automatic CG parsing of free text, the game is fully interactive. Users can type in their own sentences and use the computer as their game-master.
  3. Corpus research facilities and Internet-based search tools for use in areas such as on-line investigation of various linguistic phenomena, monitoring of language change, lexicographic updates, and the testing of linguistic hypotheses. Each of these areas has clear relevance for classroom teaching, and for designing and carrying out research projects at all educational levels.
  4. Morphological, syntactic and semantic annotation of running text for use in such enterprises as man-machine interfaces, intelligent websites, education programs, corpus production, tree-banks, information extraction, and linguistic research. Several projects are currently underway using these services — particularly for Portuguese and English; for example, an immense Portuguese tree-bank is being constructed with the support of the Portuguese government, and attribution studies of Shakespearean authorship are being carried out by two Ph.D. students at Bristol University in England. PaNoLa will make such services available for Nordic languages as well.

Current educational uses

The educational materials and software currently available at the VISL website for fifteen languages is already in worldwide use. Until recently, the users were primarily university teachers and students, and members of adult education classes. However, the tools are increasingly being modified to appeal to younger users, with the result that both secondary and primary school teachers have begun to take an interest in the system and to introduce the tools in their classrooms. For example, an IT-subcommittee under the administrative board for the island of Funen in Denmark contracted last autumn for members of the VISL group to provide a 30-hour course for secondary school teachers of English to upgrade their competence in English sentence analysis using the VISL tools. On an even larger scale, representatives from the Danish Ministry of Education arranged with the VISL group to hold a day-long seminar in Odense on January 11, 2001 for 21 secondary school teachers from all over Denmark to be introduced to VISL's educational software. The group of teachers represented eight different languages currently being taught in Danish secondary schools: Danish, English, French, German, Greek, Latin, Russian, and Spanish. As a result of this seminar, interest is growing rapidly for the introduction of these tools into Danish classrooms at the secondary school level. The most recent link between VISL and Danish educational institutions was established on July 1, 2001, when a subcommittee under the Danish ministry of education provided financial support over a two-year period to help introduce the Danish and English VISL systems to students and staff in Denmark's 54 business schools at the secondary school level (HHX). If PaNoLa is implemented, a strong Nordic language component, integrated into the VISL system, will be made available to users at all education levels — both inside and outside Scandinavia.

Timetable

The following timetable provides an estimate of the individual milestones and deadlines that make up the project as a whole:

months 1-6

  • Hold a 3-day organizational and planning seminar in Denmark for all project participants.
  • Work out a contract with Lingsoft for the acquisition of copies of their Swedish and Finnish CG-systems.
  • Begin integrating the existing CG-systems for Norwegian, Swedish, and Finnish into the VISL educational network.
  • Start improving and enhancing the CG-taggers for all four Nordic languages.
  • Adopt a common symbol set for the display of linguistic information for the four languages.
  • Start refining the Transformation-Based Learning system MUTBL, adding to it an interactive module for Constraint Grammars.

 

months 7-12

  • Hold a 3-day seminar in Norway for all project participants to inaugurate a brainstorming phase with regard to additional language games, quizzes, and other graphical applications, and establish partial prototypes (programmed solutions) of these new modules for use with the four Nordic languages.
  • Begin work on the databases containing the pre-analyzed sentences for each of the four languages.

  • Test and continue the improvement of the four CG-taggers and the MUTBL-modules.

months 13-18

  • Hold a 3-day seminar in Finland for all project participants to evaluate the results and achievements of the first year's activities, and to plan the developmental stages for the second year.

  • For those taggers which are far enough advanced by this stage, begin the addition of semantic tags to lexical items with a view to the later incorporation of this information in CG-rules for further syntactic and semantic disambiguation.
  • Begin testing the robustness of the broadband servicing system — at provider, user, and connection line levels.
  • Upgrade the VISL server structure (multiple systems involving a load-balanced server cluster, program optimization, etc.) to deal with the increased service load.
  • Test and upgrade the computer programs for the new applications.
  • Begin the construction of on-line language glossaries for all four languages.

months 19-24

  • Continued upgrading and extension of all four CG-taggers and the MUTBL-modules.
  • Proofread and test all sentences in the four databases.
  • Test and upgrade the programs for all graphical applications.
  • Prepare on-line documentation, user manuals, and help menus for the growing array of PaNoLa tools.
  • Work out suggestions for local hardware configuration solutions for schools, and evaluate possible Internet bottlenecks relating to education networks at all levels.
  • Hold a 3-day seminar in Sweden for all project participants, with the goal of planning for the future of the project after the termination of NorFA funding.
  • Inaugurate public relations phase targeted at supplying information about the PaNoLa tools to educational institutions interested in distance learning. This will include, as well, countries outside Scandinavia where Nordic languages are taught.

Implementation

It is important to stress that PaNoLa is not a project which will start from scratch, nor is it a project that will end when NorFA funding ends. Since CG-taggers already exist, in varying degrees of readiness, for each of the four Nordic languages, and since the VISL educational infrastructure is already in place on the Internet, the integration of CG-taggers for Nordic languages within the VISL framework will be a vital and ongoing process. NorFA funding will initiate the production of new education software for Danish, Finnish, Norwegian and Swedish, making it possible to have a rich and varied selection of new Nordic language materials freely available via the Internet by January 2003 — not only to Scandinavian users of all ages, but to the world community at large.

 

Budget (January 1, 2002 — December 31, 2003)

 

Budget item

(NOK)

1) For each language group:

 
  1. Danish

190,000

  • Finnish
  • 190,000

  • Norwegian
  • 190,000

  • Swedish
  • 190,000

    2 System development at the VISL site

    156,000

    3) Secretarial help (project co-ordinator )

    70,000

    4) Materials (server)

    60,000

    5) Purchases from Lingsoft

    81,000

    6) Travel

    104,000

    Subtotal

    1,231,000

    7) Overhead (10%)

    123,100

    Total NorFA budget

    1,354,100

     

     

    Additional support has been promised from the following institutions:

    • Dean of the Humanities, University of Southern Denmark

     

    25,000 DKK

    • Linguistics Department, University of Helsinki, and Lingsoft, Finland

     

    20,000 FIM

    • The Text Laboratory, University of Oslo, Norway

     

    30,000 NOK

     

     

     

    Budgetary notes:

    1. The figure of 190,000 NOK corresponds to the Norwegian salary level 44 (including social fees) for one person working full-time over a six-month period. Each language group can distribute this sum as it sees fit. For example, a researcher could be hired part-time over the full two years, or hired full-time for two three-month periods (one period for each year of the project).
    2. System development at the VISL site involves design, implementation and maintenance of the website, as well as software development for maximizing user-friendly access to the Nordic language materials. The figure of 156,000 NOK is arrived at as follows: 650 hrs/year over two years = 1300 hrs x 120 NOK/hr = 156,000 NOK.
    3. Secretarial help and project co-ordination (Anette Wulff, SDU-Odense): This involves, among other activities, a) the maintenance of smooth and continued communication among all members of the 4 Nordic language groups, b) planning and arranging on-site visits and seminars which will bring the project members together at strategic periods during the project, c) documentation, record-keeping and writing of minutes in connection with project meetings, d) coordinating system development at the VISL site, e) handling payrolls and other expenses, and f) overseeing the fulfillment of project deadlines and commitments. The figure of 70,000 NOK is arrived at as follows: 140 hrs/yr = 280 hrs over two years at 250 NOK/hr = 70,000 NOK.
    4. Server: The Institute of Language and Communication at SDU-Odense will put its IT-Center and its server at the disposal of the project, but the addition of new services and users from all the Nordic countries will necessitate increasing both the speed and the capacity of the system. This will cost about 80,000 DKK. The Dean of the Humanities at SDU-Odense has offered to contribute 25,000 DKK toward this goal (see under "Additional support" in the budget), leaving approximately 60,000 NOK.
    5. Lingsoft: 81,000 NOK will be used to purchase existing CG-software from Lingsoft and other commercial suppliers. In particular, Lingsoft's systems for Swedish and Finnish will provide solid bases for the development of educational tools in these two languages.
    6. Travel: During the course of the two-year project, we have planned four 3-day seminars - one in each of the participating Nordic countries. For each seminar we have allotted 26,000 NOK, which will cover most of the travel and accommodation expenses for the participants. 26,000 x 4 = 104,000 NOK.

     

     

     

    Information about the applicants

    Denmark

    Eckhard Bick (lead applicant):

    Institute of Language and Communication

    University of Southern Denmark — Odense Campus

    Campusvej 55

    5230 Odense M

    Denmark

    phone: (45) 86 28 35 24

    fax: 86 28 13 97

    e-mail: lineb@hum.au.dk

    Finland

    Fred Karlsson

    Department of General Linguistics

    University of Helsinki

    P.O. Box 4

    Finland

    phone: 358 9 19 12 35 12

    e-mail: fkarlsso@ling.helsinki.fi

    Norway

    Janne Bondi Johannessen

    The Text Laboratory

    Department of Linguistics

    University of Oslo

    P.O. Box 1102 Blindern

    N-0137 Oslo

    Norway

    phone (47 ) 22 85 68 14

    e-mail: jannebj@hedda.uio.no

    Sweden

    Torbjörn Lager

    Department of Linguistics

    Uppsala University

    S-751 20 Uppsala

    Sweden

    phone: (46) 18 471 7860

    e-mail: Torbjorn.Lager@ling.uu.se

    Some relevant publications and presentations

    Bick, Eckhard. 1996. "Automatic parsing of Portuguese". Proceedings of the Second Workshop on Computational Processing of Written Portuguese. Curitiba, Brazil.

    Bick, Eckhard. 1997a. "Dependensstrukturer i Constraint Grammar syntaks for Portugisisk. In: Brønsdsted, Tom and Inger Lytje (eds.), Sprog og multimedier. Aalborg Universitetsforlag, Denmark, pp. 39-57.

    Bick, Eckhard. 1997b. "Automatisk analyse af portugisisk skriftsprog". In: Jensen, Per Anker, Stig W. Jørgensen and Annette Hørning (eds.), Danske ph.d-prosjeker i datalingvistikk, formel lingvistikk og sprogteknologi. Kolding, Denmark, pp. 22-30.

    Bick, Eckhard. 1997c. "Internet-based grammar teaching". In: Christoffersen, Ellen and Bradley Music (eds.), Datalingvistisk Forenings Årsmøde 1997 i Kolding. Proceedings, pp. 86-106.

    Bick, Eckhard. 1998. "Structural lexical heuristics in the automatic analysis of Portuguese". In: Maegaard, Bente (ed.), Proceedings of the 11th Nordic Conference on Computational Linguistics (NODALIDA-98). Copenhagen, January 28-29, pp. 44-56.

    Bick, Eckhard and Diana Santos. 2000. "Providing Internet access to Portuguese corpora: the AC/DC project". In: Maria Gavrilidou et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000). Athens, 31 May - 2 June 2000, pp. 205-210.

    Bick, Eckhard. 2000. The parsing system "Palavras" — automatic grammatical analysis of Portuguese in a constraint grammar framework. Aarhus University Press, Denmark.

    Johannesen, Janne Bondi and Kristin Hagen. 1998. "Disambiguering uten syntaks." In: Faarlund, J. T., B. Mæhlum and T. Nordgård (eds.), MONS 7, pp. 68-79. Novus forlag, Oslo.

    Johannessen, Janne Bondi. 1998. "Tagging and the case of pronouns". Computers and the Humanities 32, pp. 1-38.

    Johannessen, Janne Bondi and Helge Hauglin. 1998. "An automatic analysis of compounds". In: T. Haukioja (ed.), Papers from the 16th Scandinavian Conference of Linguistics, Turku, Finland (1996), pp. 209-220.

    Johannessen, Janne Bondi. 1998. Coordination. Oxford University Press, New York, Oxford.

    Johannessen, Janne Bondi, Kristin Hagen and Anders Nøklestad. 2000. "A Constraint-based tagger for Norwegian. In: Lindberg, Carl-Erik and Steffen Nordahl Lund (eds.), 17th Scandinavian Conference of Linguistics, Odense Working Papers in Language and Communication 19, University of Southern Denmark, Odense, Denmark (1998), Vol. 1, pp. 31-47.

    Johannessen, Janne Bondi, Kristin Hagen and Anders Nøklestad. 2000. "A web-based advanced and user friendly system: the Oslo corpus of tagged Norwegian texts." In: Bavrilidou, M. G. Carayannis, S. Markantonatou, S. Piperidis and G. Stainhaouer (eds.). Proceedings, Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, pp. 1725-1729.

    Johannessen, Janne Bondi. 2001. "Sammensatte ord". Norsk Lingvistisk Tidsskrift, pp. 59-92.

    Karlsson, Fred. 1994. "Robust parsing of unconstrained text". In: P. de Haan and N. Oostdijk (eds.), Corpus-based research into language. In honour of Jan Aarts. Rodopi, Amsterdam and Atlanta, pp. 121-142.

    Karlsson, Fred, Atro Voutilainen, Juha Heikkilä, and Arto Anttila (eds.). 1995a. Constraint Grammar — a language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin and New York,.

    Karlsson, Fred. 1995b. "Designing a parser for unrestricted text". In Karlsson et al. (eds.) 1995a, pp. 1-40.

    Karlsson, Fred. 1995c. "The formalism and environment of Constraint Grammar Parsing". In Karlsson et al. (eds.) 1995a, pp. 41-88.

    Karlsson, Fred and Lauri Karttunen. 1997. "Sub-sentential processing". In: G. B. Varile and A. Zampolli (eds.), Survey of the state of the art in human language technology. Cambridge University Press.

    Karlsson, Fred. 2000. Finnish: an essential grammar. 2nd ed. (1st ed. 1999). Routledge, London and New York.

    Karlsson, Fred, Even Hovdhaugen, Carol Henriksen and Bengt Sigurd. 2000. The history of linguistics in the Nordic countries. Societas Scientiarum Fennica. Helsinki.

    Lager, Torbjörn. 1995. A logical approach to computational corpus linguistics. Doctoral dissertation, University of Göteborg: Department of Linguistics.

    Lager, Torbjörn. 1998. "Logic for part of speech tagging and shallow parsing". In: Proceedings of the 11th Nordic Conference on Computational Linguistics (NODALIDA-98), Copenhagen, January 28-29, 1998.

    Lager, Torbjörn. 1999. "The µ-TBL system: logic programming tools for Transformation-Based Learning". In: Proceedings of the Third International Workshop on Computational Natural Language Learning (CoNLL-99), Bergen, June 12, 1999.

    Lager, Torbjörn. 1999. "µ-TBL lite: a small, extensible Transformation-Based Learner". In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL-99), Bergen, June 8-12, 1999.

    Lager, Torbjörn and Natalia Zinovjeva. 1999. "Training a dialogue act tagger with the µ-TBL system". In: Proceedings of the Third Swedish Symposium on Multimodal Communication, Linköping University Natural Language Processing Laboratory (NLPLAB), Linköping, October 16-17, 1999.

    Lager, Torbjörn. 2000. "A logic programming approach to word expert engineering." In: Proceedings of ACIDCA 2000: Workshop on Corpora and Natural Language Processing. Monastir, Tunisia, March 22-24, 2000, pp. 182-189.

    Lager, Torbjörn and Joakim Nivre. 2001. "Part of speech tagging from a logical point of view". In: P. de Groote, G. Morril and C. Retoré (eds.), Logical Aspects of Computational Linguistics. Lecture Notes in Artificial Intelligence. Springer-Verlag, Berlin-Heidelberg-New York.