Top of Page

home | introduction | research | people | facilities | events & news | visitor info | contact us | search


Index Technical Facilities

Linguistic Applications at the MPI

For the past couple of years, the Max-Planck-Institute has been working to develop tools which allow linguists, anthropologists, and psychologists to flexibly operate on the corpora of data they have collected. These corpora form the center of all work based on observational data at the MPI. They increasingly include multi-media data, as researchers become interested in topics such as the alignment of syntactic structure and prosody and intonation, or in the alignment of speech and gesture. Currently, the Institute is taking great efforts to digitize speech and video signals and store them on powerful media servers. The linguistic tools we want to eventually create can be illustrated with the following diagram:

 

Guided by browsing through or searching a set of meta-descriptions, the user will easily find the resources (s)he is interested in within a universe of MPI-resources and resources from other institutes. Corpora and lexicons are closely linked such that information flow is supported in both directions. Flexible viewers/editors help the researcher to analyze, modify or create resources. These viewers support immediate access to all types of information. Powerful search tools help the researcher to find particular fragments of the resource and may produce output which, for example, might be useful in typological studies. Further processing can be carried out on the search output, in fact it can be used as a new resource.

The Institute has a couple of tools available which were developed to create the overall architecture step by step. In the following diagram, the tools are briefly described. It should be mentioned that Shoebox is a program distributed by the SIL. The blue boxes indicate the tools which have been in operation for some time. The browsing part of BC (Browsable Corpus) is now ready , with searching on the meta-information has to be added. MED will be extended by the Search tool so that it can be used interactively within the selected environment. FSearch is a tool currently under development; we are testing how to do searches on very large resources using it. If successful, this technology will be integrated into the other tools. EUDICO is the main tool for the future; we intended to make it a general linguistic tool for work with multi-media corpora, in particular for descriptions of multi-modal data. The first version of EUDICO is ready, i.e. via local area and wide area networks, users can view mm-corpora.

MED (analysis and transcription/coding tool) platform-independent, speech analysis environment synchronized text and speech domains
Search (search using structure of corpus) supports complex search, platform independent, supports some corpus formats
TED (field work transcription/coding tool) supports PC-based video control and category definition
MT (analysis and transcription/coding tool)
See CAVA
MAC-based true multi-media tool, immediate access from code list to video fragment, relational database to store code, flexible database structure and query tool
BC (browse on meta information) browsing through world of XML-based meta- descriptions, platform independent
FSearch (fast search) index-based fast search for very large corpora
EUDICO (analysis and transcription/coding tool) internet-capable interactive MM-tool, Java-based, uses common corpus model, format-independent, platform for future extensions

The following table gives an overview about the functionality of the tools at this moment. It is also indicated what is planned for the coming year.

Functionality

MED

MT

EUDICO

Search

Fast

Search

state

ready

extend

finished

1.Version

finished

soon

platform independence

yes

no

yes

yes

yes

local operation

yes

yes

yes

yes

yes

internet operation

no

no

yes

(yes)

no

format

chat-like

rDB

various

various

various

common internal corpus model

no

no

yes

no

no

support of annotation layers

some

many

many

many

no

common exchange format

no

no

planned

no

no

corpus-lexicon integration

no

no

planned

-

-

lexically based auto-coding

no

no

planned

-

-

efficient back-end format

no

yes

(yes)

no

yes

SGML I/O filter

no

no

planned

no

yes

meta-data

(yes)

no

yes

(yes)

yes

browsable corpus

yes

no

yes

no

no

search on meta-data

planned

no

planned

no

yes

search on corpus data

yes

yes

planned

yes

yes

structure support in search

planned

no

planned

yes

no

regexp/logical expr/within/ dependencies etc. in search

planned

partly

planned

yes

(some)

incremental search

?

yes

planned

yes

yes

multi-modal search

no

no

planned

no

no

UNICODE

?

no

planned

no

no

synch. mm presentations

yes

yes

yes

-

-

var. presentation formats

no

no

yes

-

-

partitur pres. format

no

no

yes

-

-

tree presentation format

no

no

planned

-

-

color highl. pres. format

no

no

planned

-

-

concordance pres. format

no

no

planned

-

yes

application of LT tools

no

no

planned

-

-

access to speech signal

immediate

immediate

immediate

-

immediate

speech analysis fct.

many

no

planned

-

-

access to video

no

immediate

immediate

-

-

sorting on results

no

no

?

-

yes

Some of the tools mentioned above are in development, and the corresponding information in these web-pages will be subject of continuous change.

A highly interesting discussion was raised recently about proper annotation formats for linguistic resources. Since the MPI had to go new ways (our resources are multi-medial and they have to be offered via networks) much time was used to think about appropriate formats (see the list above). March 1999 St. Bird and M. Liberman from University of Pennsylvania have created a very interesting overview page about available tools and they have written an excellent paper:  A Formal Framework for Linguistic Annotation. There is also an interesting debate about annotations of multi-media resources within the MPEG community. The emerging standard MPEG7   is meant to deal with the structure of such annotations.

For all questions with respect to these pages, please contact Peter Wittenburg of the Max-Planck-Institute.

 

Last updated: December 27, 2000 15:34

top of page | home

End of Page