New Page 1

Mission

The goal of the EAGLES/ISLE Meta Data Initiative is to make a proposal for a standard of meta-data descriptions of Multi-Media/Multi-Modal Language resources. Using such a standard it will become possible to create a browsable and searchable universe of such resources in the Internet. This will enable interested parties to efficiently locate suitable resources and thus increases their reusability.

Currently many language resources are being generated in disciplines such as corpus linguistics, anthropology and language and speech engineering but only few of them are available through the catalogues of the well-known agencies such as LDC and ELRA. Also most of these resources are not available in any "public" way at all and only very few people know about them. It is well known that even in the institutions where these resources are generated it seems to be problematic to exchange information about these resources in a systematic way.

The situation sketched above is the reason for starting the ISLE Meta Data Initiative. The Language Resource community needs a standard for describing the main characteristics of resources such as in the case of corpora: the name of the language spoken, the speakers age, sex and educational background etc. The community also needs tools. Tools to help generate such meta descriptions in an easy way, preferably during their creation. Tools that will make such descriptions available on the Internet and integrate them in the emerging universe of meta descriptions and tools that allow users to browse and search that universe and finally access the resources themselves.

The project will be based on previous work in the language resource community. In earlier corpora such as Childes or ESF Second Learner Corpus each corpus file included so-called header information in a proprietary format. Also important initiatives such as TEI and CES/xCES worked out tag-sets for typical data describing the whole file which in this initiative is called meta-data. Some institutions such as Helsinki University started to build web sites with corpus samples where hyperlinks and comments with typical meta data allow the user to easily navigate between the language samples.

The project will be partly bases on existing conventions and standards in the Language resource community. Existing corpora such as Childes or ESF Second Learner Corpus have each corpus file include a so-called header with information that we would now describe as the resource’s meta-data in a proprietary format. Also important initiatives such as TEI and CES/xCES worked out sets of tags that describe a whole transcription file and would be called meta-data within this initiative. Institutions such as Helsinki University started to build web sites with samples of corpora where hyperlinks and commentary text containing typical meta-data allow the user to easily navigate between the corpus samples.

Recently the MPI Browsable Corpus project and the ICE project came up with suggestions that come close to what the ISLE initiative wants to achieve, huge distributed sets of linked meta descriptions of resources that can be parsed and navigated by suitable browsers. We believe that this is the way the Internet will move and refer to the work of W3C.

To achieve its goal the ISLE meta data initiative has to answer a number of questions. First we have to define the scope of language resources and the community that we target with our proposal. Next we have to define the set of meta data elements and their domain and semantics. It has to be checked if we can borrow from other meta data initiatives such as Dublin Core. Finally we have to choose an implementation form for our meta-descriptions, it has to be checked if it is possible and advantageous to use existing frameworks such as RDF.

Finally, the initiative has to specify requirements for the tools to be used and possible practicable scenario have to be defined. These have to address issues such as where to store meta-descriptions, how to link them, how to guarantee their quality etc.

More details of the project are described in the White Paper.