For more details we refer to the major web-site of the INTERA project: www.elda.org/intera

This site briefly summarizes the essential results from the Work Packages 2 and 3.

INTERA Goals

The Integrated European language data Repository Area (INTERA) is an eContent Project based essentially on two pillars :

è to build an integrated European language resource area by connecting international, national and regional data centres,

è to produce new multilingual language resources.

The first goal involves the integration of a critical mass of different types of language resources with the help of metadata descriptions and the interlinking of the resulting distributed resource repository with an existing tool repository, thus enabling users to directly start suitable tools on the included resources. INTERA anticipates that this integrated and interlinked metadata description domain will facilitate the access to language resources in Europe and help professionals in industry, the eContent business, research and education, and increase the usage of the resources already available.

The second goal addresses the lack of quality of multilingual resources, especially for the less widely spoken languages, including Balkan ones, which are of crucial importance to the development of the eContent business. INTERA goes further ahead by developing exemplary methods for their business attractive production.

Summary of WP2 and WP3 Activities

The INTERA project started in January 2003 and lasted 2 years. The major achievements of the work done in WP2 and WP3 within the life time of the project include:

WP2: Integrated Resource Domain

Major Goals

· Development of a mature IMDI metadata framework existing of a metadata set, controlled vocabularies, creation and exploration tools

· Creation of a large metadata domain of language resource with in total now about 50 participating institutions world-wide. In INTERA some major data centers contributed to this domain as sub-contractors.

Work on Metadata Sets and Tools

· The new IMDI version was demonstrated and explained at various meetings. With the end of the INTERA meeting we can say that a complete, mature and robust framework for metadata was delivered.

· The controlled vocabularies were completed.

· Special profiles were developed for the Sign Language community and for the Dutch Spoken Corpus

· The IMDI Editor has been finished and supports all IMDI features.

· The IMDI Browser has been finished and also supports all IMDI features.

· The IMDI search component has been finished and also supports all IMDI features.

· The manuals were extended to contain the full specifications.

· A system was developed to allow to handle IPR and to turn IPR requirements into access rights in an efficient manner; this Access Rights management shell operates on a central server in this first phase (a distributed version will be developed in the coming DAM-LR project); the mechanisms are based on the IMDI metadata framework.

· A first version of a Tree-Building was developed.

Creation of IMDI Metadata Domain

The following matrix gives an overview about the resources that are being integrated within the INTERA project – partly in form of sub-contracting:

Partner	Subcontractor	Corpus	Type
MPI	BAS	Smartkom	multimodal	integrated
MPI	BAS	Verbmobil and others	Speech, text	integrated
MPI	Meertens	Dialect Corpus	speech	integrated
MPI	U Florence	Lablita	speech text	integrated
MPI	U Florence	CORAL ROM	Semantics ext	integrated
MPI		Dutch Spoken Corpus	speech text	integrated
MPI		Gesture corpus	multimodal	integrated
MPI		ESF Second Learner Corpus	speech text	integrated
MPI		PMOLL Corpus	speech text	integrated
MPI		various others	sign speech text	integrated
USAAR	DFKI	Negra, Tiger	annotated text	to be integrated
USAAR	CLPP Bulg	HPSG	treebank	to be integrated
USAAR	U Iasi	1984	text	to be integrated
LORIA	ATILF	Frantext, etc	text	to be integrated
ELDA		catalogue resources	various	integrated
ILC		lexica	various	integrated
ILSP		Parallel corpora	Various	to be integrated

In total there are now about 50 institutions world-wide from whom we know that they produce IMDI metadata descriptions and therefore contribute to the searchable domain. The OLAC harvesting machine has about 35 registered metadata providers – one is the IMDI domain. In general about 85 institutions make their language resources visible. That is not sufficient, however, it is a good start. The following institutions produce IMDI metadata descriptions in some form:

Europe

• ELRA Paris

• INALF Nancy

• DFKI Saarbrücken

• University of Saarland

• Bavarian Speech Archive Munich

• Meertens Institute Amsterdam

• University of Florence

• ILSP Athens

• ILC Pisa

• University of Madrid

• Max-Planck-Institute Nijmegen

• University of Kiel

• University of Bochum

• Free University of Berlin

• University of Bonn

• University of Bielefeld

• University of Helsinki

• Phonogrammarchiv Vienna

• University of Groningen

• Kotus Project Helsinki

• Sweden’s National Dialect Archive Lund

• European Sign Language Communities (Se, UK NL, D)

• University of Utrecht

• University of Uppsala

• University of Stavanger

• University of Lund

• University of Leipzig

• University of Erfurt

• University of Leiden

• University of Frankfurt

• …

International

• Federal University of Rio de Janeiro

• University of Colorado

• University of Buenos Aires

• University of Kansas

• University of Victoria

• University of Sydney

• University of Melbourne

• E Michigan University

• Wayne State University

• AILLA Austin

• …

Much dissemination of the IMDI concept was carried out in form of workshop and conference contributions and many training courses were organized. Here we want to briefly mention the major events:

· Open Forum for Metadata Registries, Santa Fe, January 2003

· SOAS Workshop, London, March 2003

· E-Meld Conference, Ypsilanti – Michigan, July 2003

· International Linguistics Congress, Prague, July 2003

· DRH Conference, Cheltenham, August 2003

· ENABLER Workshop, August 2003

· International Sign Language Meeting, Nijmegen, January 2004

· ISO TC37/SC4 Meeting, Jeju – Korea, February, 2004

· UNESCO Training Course on Digital Archiving, Vilnius, March 2004

· Lingua Pax Conference, Barcelona, May 2004

· LREC Conference, Lissabon, May 2004

· E-Meld Conference, Detroit, July 2004

· ACL Conference, Barcelona, July 2004

· IASA/IAML Conference, Oslo, August 2004

· DOBES Conference and Summerschool, Frankfurt, September 2004

· ISO TC37/SC4 Meeting Pisa, November 2004

· International DELAMAN Workshop, Nijmegen, November 2004

The following workshops were organized/co-organized by the MPI team where IMDI related concepts were a major part:

· INTERA Preparation Workshop on IMDI, Nijmegen, November 2001

· Sign Language Workshop, Nijmegen, May 2003

· Meeting on Bilingual Databases, Nijmegen, May 2003

· Sign Language Meeting, Nijmegen, January 2004

· Lexicon Workshop, Hahn, March 2004

· Endangered Languages Training Courses, Nijmegen, May 2003/2004

· DOBES Summer School, Frankfurt, September 2004

· International DELAMAN Workshop, Nijmegen, November 2004

WP3: Standardised Descriptions

Goals

· Drafting of a stable proposal for the representation of metadata information for language resources and tools.

This draft has been submitted to ISO committee TC 37/SC 3 on November 1st as for a three month DIS (Draft of International Standard) for project ISO 12620-1 (Data Category Registry).

· Preparation of the IMDI specification to make it as conformant as possible to the document submitted to ISO, in order to make it the localisation of this specification by the various partners in the project possible.

· Implementation of an on-line tool dedicated to the browsing and selection of metadata descriptors, to allow an international dissemination of the work achieved in the project.

Standardised Descriptions

The following major work has been carried out:

· an ISO TC37/SC4 metadata requirements document has been worked out

· the definition of the ISO TC37/SC4 Data Category Registry was finished and a web-accessible framework for manipulating the DCR was built

· all IMDI categories were entered into the ISO TC37/SC4 Data Category Registry

· the IMDI set is now available in 8 languages (D, F, E, NL, Se, I, Gr, Sp)

Further Information

Main Intera website : http://www.elda.org/intera