The Project

The INTERA project has essentially two pillars: (1) to build an integrated European language resource area by connecting international, national and regional data centers and (2) to produce new multilingual language resources.

The first goal involves the integration of a critical mass of different types of language resources with help of metadata descriptions and the interlinking of the resulting distributed resource repository with an existing tool repository thus enabling users to directly start suitable tools on the included resources. INTERA anticipates that this integrated and interlinked metadata description domain will facilitate the access to language resources in Europe and help professionals in industry, the eContent business, research and education, and increase the usage of the resources already available.

The second goal addresses the lack of quality of multilingual resources, especially for less widely spoken languages, including Balkan ones, which are of crucial importance to the development of the eContent business. INTERA goes further ahead by developing examplary methods for their business attractive production.

The Results

For more details we refer to the major web-site of the INTERA project: www.elda.org/intera

This site briefly summarizes the essential results from the Work Packages 2 and 3.

INTERA Goals

The Integrated European language data Repository Area (INTERA) is an eContent Project based essentially on two pillars :

è    to build an integrated European language resource area by connecting international, national and regional data centres,

è    to produce new multilingual language resources.

The first goal involves the integration of a critical mass of different types of language resources with the help of metadata descriptions and the interlinking of the resulting distributed resource repository with an existing tool repository, thus enabling users to directly start suitable tools on the included resources. INTERA anticipates that this integrated and interlinked metadata description domain will facilitate the access to language resources in Europe and help professionals in industry, the eContent business, research and education, and increase the usage of the resources already available.

The second goal addresses the lack of quality of multilingual resources, especially for the less widely spoken languages, including Balkan ones, which are of crucial importance to the development of the eContent business. INTERA goes further ahead by developing exemplary methods for their business attractive production.

Summary of WP2 and WP3 Activities

The INTERA project started in January 2003 and lasted 2 years. The major achievements of the work done in WP2 and WP3 within the life time of the project include:

WP2: Integrated Resource Domain

Major Goals

·        Development of a mature IMDI metadata framework existing of a metadata set, controlled vocabularies, creation and exploration tools

·        Creation of a large metadata domain of language resource with in total now about 50 participating institutions world-wide. In INTERA some major data centers contributed to this domain as sub-contractors.

Work on Metadata Sets and Tools

·        The new IMDI version was demonstrated and explained at various meetings. With the end of the INTERA meeting we can say that a complete, mature and robust framework for metadata was delivered.

·        The controlled vocabularies were completed.

·        Special profiles were developed for the Sign Language community and for the Dutch Spoken Corpus

·        The IMDI Editor has been finished and supports all IMDI features.

·        The IMDI Browser has been finished and also supports all IMDI features.

·        The IMDI search component has been finished and also supports all IMDI features.

·        The manuals were extended to contain the full specifications.

·        A system was developed to allow to handle IPR and to turn IPR requirements into access rights in an efficient manner; this Access Rights management shell operates on a central server in this first phase (a distributed version will be developed in the coming DAM-LR project); the mechanisms are based on the IMDI metadata framework.

·        A first version of a Tree-Building was developed.

Creation of IMDI Metadata Domain

The following matrix gives an overview about the resources that are being integrated within the INTERA project – partly in form of sub-contracting:

Partner

Subcontractor

Corpus

Type

 

MPI

BAS

Smartkom

multimodal

integrated

MPI

BAS

Verbmobil and others

Speech, text

integrated

MPI

Meertens

Dialect Corpus

speech

integrated

MPI

U Florence

Lablita

speech text

integrated

MPI

U Florence

CORAL ROM

Semantics ext

integrated

MPI

 

Dutch Spoken Corpus

speech text

integrated

MPI

 

Gesture corpus

multimodal

integrated

MPI

 

ESF Second Learner Corpus

speech text

integrated

MPI

 

PMOLL Corpus

speech text

integrated

MPI

 

various others

sign speech text

integrated

USAAR

DFKI

Negra, Tiger

annotated text

to be integrated

USAAR

CLPP Bulg

HPSG

treebank

to be integrated

USAAR

U Iasi

1984

text

to be integrated

LORIA

ATILF

Frantext, etc

text

to be integrated

ELDA

 

catalogue resources

various

integrated

ILC

 

lexica

various

integrated

ILSP

 

Parallel corpora

Various

to be integrated

In total there are now about 50 institutions world-wide from whom we know that they produce IMDI metadata descriptions and therefore contribute to the searchable domain. The OLAC harvesting machine has about 35 registered metadata providers – one is the IMDI domain. In general about 85 institutions make their language resources visible. That is not sufficient, however, it is a good start. The following institutions produce IMDI metadata descriptions in some form:


Europe

•          ELRA Paris

•          INALF Nancy

•          DFKI Saarbrücken

•          University of Saarland

•          Bavarian Speech Archive Munich

•          Meertens Institute Amsterdam

•          University of Florence

•          ILSP Athens

•          ILC Pisa

•          University of Madrid 

•          Max-Planck-Institute Nijmegen

•          University of Kiel

•          University of Bochum

•          Free University of Berlin

•          University of Bonn

•          University of Bielefeld

•          University of Helsinki

•          Phonogrammarchiv Vienna

•          University of Groningen

•          Kotus Project Helsinki

•          Sweden’s National Dialect Archive Lund

•          European Sign Language Communities (Se, UK NL, D)

•          University of Utrecht

•          University of Uppsala

•          University of Stavanger

•          University of Lund

•          University of Leipzig

•          University of Erfurt

•          University of Leiden

•          University of Frankfurt

•          …

International

•          Federal University of Rio de Janeiro

•          University of Colorado

•          University of Buenos Aires

•          University of Kansas

•          University of Victoria

•          University of Sydney

•          University of Melbourne

•          E Michigan University

•          Wayne State University

•          AILLA Austin

•          …


Much dissemination of the IMDI concept was carried out in form of workshop and conference contributions and many training courses were organized. Here we want to briefly mention the major events:

·        Open Forum for Metadata Registries, Santa Fe, January 2003

·        SOAS Workshop, London, March 2003

·        E-Meld Conference, Ypsilanti – Michigan, July 2003

·        International Linguistics Congress, Prague, July 2003

·        DRH Conference, Cheltenham, August 2003

·        ENABLER Workshop, August 2003

·        International Sign Language Meeting, Nijmegen, January 2004

·        ISO TC37/SC4 Meeting, Jeju – Korea, February, 2004

·        UNESCO Training Course on Digital Archiving, Vilnius, March 2004

·        Lingua Pax Conference, Barcelona, May 2004

·        LREC Conference, Lissabon, May 2004

·        E-Meld Conference, Detroit, July 2004

·        ACL Conference, Barcelona, July 2004

·        IASA/IAML Conference, Oslo, August 2004

·        DOBES Conference and Summerschool, Frankfurt, September 2004

·        ISO TC37/SC4 Meeting Pisa, November 2004

·        International DELAMAN Workshop, Nijmegen, November 2004

The following workshops were organized/co-organized by the MPI team where IMDI related concepts were a major part:

·        INTERA Preparation Workshop on IMDI, Nijmegen, November 2001

·        Sign Language Workshop, Nijmegen, May 2003

·        Meeting on Bilingual Databases, Nijmegen, May 2003

·        Sign Language Meeting, Nijmegen, January 2004

·        Lexicon Workshop, Hahn, March 2004

·        Endangered Languages Training Courses, Nijmegen, May 2003/2004

·        DOBES Summer School, Frankfurt, September 2004

·        International DELAMAN Workshop, Nijmegen, November 2004

WP3: Standardised Descriptions

Goals

·        Drafting of a stable proposal for the representation of metadata information for language resources and tools.

This draft has been submitted to ISO committee TC 37/SC 3 on November 1st as for a three month DIS (Draft of International Standard) for project ISO 12620-1 (Data Category Registry).

·        Preparation of the IMDI specification to make it as conformant as possible to the document submitted to ISO, in order to make it the localisation of this specification by the various partners in the project possible.

·        Implementation of an on-line tool dedicated to the browsing and selection of metadata descriptors, to allow an international dissemination of the work achieved in the project.

Standardised Descriptions

The following major work has been carried out:

·        an ISO TC37/SC4 metadata requirements document has been worked out

·        the definition of the ISO TC37/SC4 Data Category Registry was finished and a web-accessible framework for manipulating the DCR was built

·        all IMDI categories were entered into the ISO TC37/SC4 Data Category Registry

·        the IMDI set is now available in 8 languages (D, F, E, NL, Se, I, Gr, Sp)

Further Information

Main Intera website : http://www.elda.org/intera