Linguistic Applications - Max-Planck-Institute for Psycholinguistics

Linguistic Applications at the MPI

For the past couple of years, the Max-Planck-Institute has been working to develop tools which allow linguists, anthropologists, and psychologists to flexibly operate on the corpora of data they have collected. These corpora form the center of all work based on observational data at the MPI. They increasingly include multi-media data, as researchers become interested in topics such as the alignment of syntactic structure and prosody and intonation, or in the alignment of speech and gesture. Currently, the Institute is taking great efforts to digitize speech and video signals and store them on powerful media servers. The linguistic tools we want to eventually create can be illustrated with the following diagram:

Guided by browsing through or searching a set of meta-descriptions, the user will easily find the resources (s)he is interested in within a universe of MPI-resources and resources from other institutes. Corpora and lexicons are closely linked such that information flow is supported in both directions. Flexible viewers/editors help the researcher to analyze, modify or create resources. These viewers support immediate access to all types of information. Powerful search tools help the researcher to find particular fragments of the resource and may produce output which, for example, might be useful in typological studies. Further processing can be carried out on the search output, in fact it can be used as a new resource.

The Institute has a couple of tools available which were developed to create the overall architecture step by step. In the following diagram, the tools are briefly described. It should be mentioned that Shoebox is a program distributed by the SIL. The blue boxes indicate the tools which have been in operation for some time. The browsing part of BC (Browsable Corpus) is now ready , with searching on the meta-information has to be added. MED will be extended by the Search tool so that it can be used interactively within the selected environment. FSearch is a tool currently under development; we are testing how to do searches on very large resources using it. If successful, this technology will be integrated into the other tools. EUDICO is the main tool for the future; we intended to make it a general linguistic tool for work with multi-media corpora, in particular for descriptions of multi-modal data. The first version of EUDICO is ready, i.e. via local area and wide area networks, users can view mm-corpora.

MED (analysis and transcription/coding tool)	platform-independent, speech analysis environment synchronized text and speech domains
Search (search using structure of corpus)	supports complex search, platform independent, supports some corpus formats
TED (field work transcription/coding tool)	supports PC-based video control and category definition
MT (analysis and transcription/coding tool) See CAVA	MAC-based true multi-media tool, immediate access from code list to video fragment, relational database to store code, flexible database structure and query tool
BC (browse on meta information)	browsing through world of XML-based meta- descriptions, platform independent
FSearch (fast search)	index-based fast search for very large corpora
EUDICO (analysis and transcription/coding tool)	internet-capable interactive MM-tool, Java-based, uses common corpus model, format-independent, platform for future extensions

The following table gives an overview about the functionality of the tools at this moment. It is also indicated what is planned for the coming year.

Functionality	MED	MT	EUDICO	Search	Fast Search
state	ready extend	finished	1.Version	finished	soon
platform independence	yes	no	yes	yes	yes
local operation	yes	yes	yes	yes	yes
internet operation	no	no	yes	(yes)	no
format	chat-like	rDB	various	various	various
common internal corpus model	no	no	yes	no	no
support of annotation layers	some	many	many	many	no
common exchange format	no	no	planned	no	no
corpus-lexicon integration	no	no	planned	-	-
lexically based auto-coding	no	no	planned	-	-
efficient back-end format	no	yes	(yes)	no	yes
SGML I/O filter	no	no	planned	no	yes
meta-data	(yes)	no	yes	(yes)	yes
browsable corpus	yes	no	yes	no	no
search on meta-data	planned	no	planned	no	yes
search on corpus data	yes	yes	planned	yes	yes
structure support in search	planned	no	planned	yes	no
regexp/logical expr/within/ dependencies etc. in search	planned	partly	planned	yes	(some)
incremental search	?	yes	planned	yes	yes
multi-modal search	no	no	planned	no	no
UNICODE	?	no	planned	no	no
synch. mm presentations	yes	yes	yes	-	-
var. presentation formats	no	no	yes	-	-
partitur pres. format	no	no	yes	-	-
tree presentation format	no	no	planned	-	-
color highl. pres. format	no	no	planned	-	-
concordance pres. format	no	no	planned	-	yes
application of LT tools	no	no	planned	-	-
access to speech signal	immediate	immediate	immediate	-	immediate
speech analysis fct.	many	no	planned	-	-
access to video	no	immediate	immediate	-	-
sorting on results	no	no	?	-	yes

Some of the tools mentioned above are in development, and the corresponding information in these web-pages will be subject of continuous change.

A highly interesting discussion was raised recently about proper annotation formats for linguistic resources. Since the MPI had to go new ways (our resources are multi-medial and they have to be offered via networks) much time was used to think about appropriate formats (see the list above). March 1999 St. Bird and M. Liberman from University of Pennsylvania have created a very interesting overview page about available tools and they have written an excellent paper: A Formal Framework for Linguistic Annotation. There is also an interesting debate about annotations of multi-media resources within the MPEG community. The emerging standard MPEG7 is meant to deal with the structure of such annotations.

For all questions with respect to these pages, please contact Peter Wittenburg of the Max-Planck-Institute.

Last updated: December 27, 2000 15:34

top of page | home