First EAGLES/ISLE Workshop on

Meta-Descriptions and Annotation Schemes for Multimodal/Multimedia Language Resources and 
Data Architectures and Software Support for Large Corpora

LREC 2000 Pre-Conference Workshops

Athens, Greece
29/30 May 2000



1. Workshop Outline

Currently, we can identify a number of trends in the community dealing with multimodal/multimedia language resources:

The joint EC/NSF funded EAGLES/ISLE1 initiative - as discussed in the white paper - aims to create standards and guidelines that can be applied to natural interactivity and multimodal language resources (e.g. speech, gesture, facial expressions, sign languages) that support the creation, use, re-use of and access to such resources. As part of this initiative, the workshop will address current trends and discuss structures which could simplify and assist the creation and use of annotated multimodal/multimedia resources, the process of finding suitable resources, and accessing them, for instance, via the Web. The workshop will address three related areas: annotation schemas, meta-descriptions for multimodal/multimedia language resources and tools and annotation environments.


Meta-Descriptions for Multimodal/Multimedia Language Resources (MMLR)

It is time to parallel the metadata activities occurring in the various groups by bringing  the far-flung community of multimedia language resource users together to start a discussion about the meta schemas to describe their resources. The goal is to be able to add linked meta-descriptions to the available multimedia language resources to form a browsable and searchable universe open to the Internet. A known portal, standardised meta-descriptions and suitable tools would make it easier for users to find the right resources for the task at hand. This interest unifies people from science, industry, and the wider community who have to use annotated multimedia resources for scientific analysis, training or commercial applications. 

Part of the proposed workshop will be dedicated to discussing the need for such a universe of linked meta-descriptions, the scope of the community, and existing work in this area. The nature of the meta-descriptions has to be discussed in detail with an emphasis on questions such as: (1) Which are the elements which describe the various language resources? (2) Is a minimal schema preferred or do we need something extensible (3) How can we achieve flexibility within the standard meta-description? (4) How can we automatically derive meta-descriptions to make general annotation feasible?

The workshop will also discuss whether benefits can be taken from existing standards such as Dublin-Core from the community of digital libraries, whether initiatives in the telecommunication and broadcasting community like the  the W3C Resource Description Framework are of relevance for our goals.


Annotation Schemas for MMLR

A second session of the workshop will be dedicated to discussing annotation schemas for multimodal/multimedia language resources. Until now the community has largely worked with text-only corpora based on orthographical transcriptions (with all their limitations) and with corpora covering speech data typically associated with one layer of orthographic transcription  specifically tailored to the needs of Automatic Speech Recognition systems. As computers have become more powerful, people have started to build corpora based on several video and sound tracks with rich multi-layer - up to 50 and more - annotation. The layers of annotation can have complex time relationships and intricate dependencies between and within layers. It seems clear  that many such complex structured corpora will be created and that the community needs guidelines to restrict the heterogeneity of such corpora.

At the Granada LREC conference we heard about initial projects having implemented "Abstract Data Models" for such multimedia corpora. In the meantime a broad discussion about the underlying universal structure for such annotations has also been initiated. A number of projects in the US and Europe have been funded to develop annotation and exploitation tools to cope with complex multimedia databases. A specialised workshop dedicated to annotation schemas is now over-due if we want to get good interoperability between resources and unified access to resources. Without an agreed standard for for annotation schemas we risk an explosive proliferation of the access tools needed to exploit such databases.

The emergence of multimedia on computers makes it possible to supercede the traditional approach, because direct media access allows us to refer to media time which will never change instead of referring to transcriptions which can be modified and are often inadequate for coding complex time relationships. 

The session will not only address theoretical matters such as the underlying common structure and abstract data models, but will also discuss suitable representation formats important for implementation. Formats suitable for open exchange and long-term archiving will not be the optimal choice for all types of program access and vice versa. We expect that modern tools will have to handle several co-existing representation formats. We also have to deal with the question of how we can integrate existing text-based corpora, or corpora which are progressively annotated after collection


Tools and annotation environments

There are a number of annotation schemes in general use, and there is a requirement for tools that can handle assemblies of language resources where individual resources don't all use the same scheme. Access to large corpora becomes impractically cumbersome without specialised tools and data structures. The session will report on current activities in this area.


Data Architectures and Software Support for Large Corpora

Several software systems for linguistic annotation, search, and retrieval of large corpora have been developed within the natural language processing community over the past several years, including LT-XML (Edinburgh), GATE (Sheffield), IMS Corpus Workbench (Stuttgart), Alembic Workbench (Mitre), MATE (Edinburgh/Odense/Stuttgart), Silfide (Loria/CNRS), SARA (BNC), and several others. Related to and in support of this development, there have also been efforts to develop standards for encoding and various kinds of linguistic annotation, as well as data architectures (e.g., TIPSTER, TalkBank) etc. Still other developments, such as the introduction of XML and the powerful XSL transformation language and work on semi-structured data (e.g., the work of the Lore group at Stanford), have also impacted the ways in which corpora and other linguistic resources can be represented, stored, and accessed. 

Approaches to the fundamental design of the formats, data, and tools are varied among current systems for the annotation and exploitation of linguistic corpora. A primary reason for this diversity is that most developers are concerned with only one aspect of the creation/annotation/exploitation process. However, in order to work effectively toward commonality, the phases of the process must be considered as a whole. This demands bringing together researchers and developers from a variety of domains in text, speech, video, etc., many of whom have previously had little or no contact. 

This workshop is intended to bring these groups together to look broadly at the technical issues that bear on the development of software systems for the annotation and exploitation of linguistic resources. The goal is to lay the groundwork for the definition of a data and system architecture to support corpus annotation and exploitation that can be widely adopted within the community. Among the issues to be addressed are: 

The motivation for this workshop is the American National Corpus (ANC) effort, which should begin corpus creation within the year. We anticipate that the ANC will provide a significant resource for natural language processing, and we therefore seek to identify state-of-the-art methods for its creation, annotation, and exploitation. Also, as a national and freely available resource, the data and system architecture of the ANC is likely to become a de facto standard. We therefore hope to draw together leading researchers and developers to establish a basis for the design of a system to support the creation and use of the ANC. 

A "Birds of a Feather" session for those interested in the ANC project will be held immediately following the workshop. 


2. Proceedings

The workshop organizers have produced proceedings. We have received final versions of most of the papers accepted for the workshop, and these were sent to the printers on May18th. These are available on the program page.


3. Organizational Issues

Organizers of the workshop

H. Cunningham, Department of Computer Science, University Sheffield

D. Roy, Natural Interactive Systems Laboratory, Faculty of Science and Engineering, University of Southern Denmark Odense

P. Wittenburg, Technical Department, Max-Planck-Institute for Psycholinguistics, Nijmegen


Program Committee

Monday, May 29th - Afternoon

Meta-Descriptions for Multimodal/Multimedia Language Resources

This table lists times, speakers and titles; click on the speaker name to get the abstract;
this usually gives all the authors and their affiliations.
Times for coffee breaks may change.
14:30 14:50 Wittenburg Meta-Descriptions for Language Resources 
14:50 15:20 Thompson All Data is Meta-Data: Rich Architectures for Rich Resources  (link to slides).
15:20 15:40 Heid Querying Meta and Object Data - Problems and Elements of Solutions
15:40 16:00 Stromqvist Optional extensions - a proposal for a flexible annotation system
16:00 16:20 Oostdijk Meta-Data in the Dutch Spoken Corpus Project
16:20 16:40 Coffee Break
16:40 17:00 Suihkonen On Meta Descriptions for Cross-Linguistic Electronic Linguistic Data
17:00 17:20 Broeder A Browseable Corpus: Accessing linguistic resources the easy way
17:20 17:40 Choukri Meta-Data from ELRA Perspective
17:40 18:00 Discussion + Summary

Tuesday, May 30th - Morning

Annotation Schemes for Multimodal/Multimedia Language Resources 
9:00 9:20 Wittenburg Terminology for Annotation Schemes
9:20 9:40 Martin Types of Cooperation and Referenceable Objects
9:40 10:00 Steininger Transliteration of Language and Labeling of Emotion and Gestures in SMARTKOM
10:00 10:20 Villasenor A Multimodal Dialogue Contribution Coding Scheme
10:20 10:40 Salmon-Alt Increasing the Genericity of the MATE Annotation Framework
10:40 11:00 Delmonte Towards an annotated Database for Anaphora Resolution
11:00 11:30 Coffee Break
11:30 11:50 Brugman The EUDICO project, multi-media annotation over the Internet
11:50 12:10 Ghorbel Semi-Automatic Annotation of Multimedia Documents via Adaptive Interfaces
12:10 12:30 Vollmann Annotation of Sound(/video) data in the Multimedia Language Documentation and Language Research Laboratory

Tuesday, May 30th - Afternoon

Data Architectures and Software Support for Large Corpora 
14:30 14:50 Ide Requirements, Tools and Architectures for Annotated Corpora
14:50 15:10 Bird ATLAS: A flexible and extensible Architecture for Linguistic Annotation
15:10 15:30 Simons Cellar: A data modeling system for linguistic annotation
15:30 15:50 Folch Semantic Tagging of a Corpus using the Topic Navigation Map Standard
15:50 16:10 Romary A Framework for multi-level linguistic annotation
16:10 16:30 Coffee Break .
16:30 16:50 Ide The XML Framework and Its Implications for Corpus Access and Use
16:50 17:10 Dybkjaer The MATE Workbench 
17:10 17:30 Fafiotte A simulation and collection platform on the Internet for multi-modal translated spoken dialogues
17:30 18:30 Ide&Wittenburg Summarizing Panel and Discussion
Subject are all three parts of the EAGLES/ISLE workshops

