Top of Page

home | introduction | research | people | facilities | events & news | visitor info | contact us | search



Index Technical Facilities

Browsable Corpus

BrowsableCorpus is a tool which is aimed to help the researcher to navigate in the universe of corpora at the MPI and eventually even in a global universe. The basis of this is twofold: (1) A directory structure is created which specifies a hierarchical ordering scheme for storing and finding files which have to do with the corpus world (raw media files, transcripts, code files, documents, …). (2) A bundle of description files which contain meta-data about the corpus files such as names of interviewers and subjects, age of subjects, languages, etc.. With the help of the method of abstraction we will be able to detect commonalities between certain description files and form hierachies and clusters, i.e. if for example n corpora share the same language this information will be extracted, a new virtual node will be created which has as attributes in its new description file the common language, clickable pointers to its parent files, and clickable pointers to its newly created childs. In doing so and defining some grouping apriori including the corresponding description files, we expect to get a dense and browsable structure of files with meta-information. The following figure gives an impression about the browser. On the left the selected path is shown and on the right side an extraction of the meta-information associated with the marked node can be seen. The green symbol [S] stands for a session. This is the node under which the real corpus files (in this case a speech file, a CHAT-formatted transcript, and an ESF-formatted transcript) are to be found. The black symbols denote tags found in the XML description file.

Example

Given that one has marked a node in a session, one then has the possible operators specified for the kind of infromation included. For typical corpus file one can directly choose to start for example MED which would result in the screen layout shwn in the following figure. The corpus fragment is visualised by the corpus editor and on top the corresponding speech fragment is displayed by using a professional speech environment.

All the meta description files are XML-structured, i.e. they include tags specified in a DTD. These specifications are made such that there is enough space for scientists to enter special descriptions if they want. A browser which includes an XML-parser was developed which allows to navigate in that meta-file universe and render existing HTML-based non-structured descriptions. When navigating with this tool the user finally ends up at a "session" node which has as children all corpus files. At this point the user can directly initialise the corresponding viewer/analysis tools such as MED to operate on them. A search facility operating on the meta-data will be added in 99. A data entry form (html form) was added to the MEID intranet such that the user does not have to worry about XML. At the moment of submission, a PERL script generates the XML-structured files. We hope that this initiative leads to world-wide co-operation, since the technology used can operate on the Internet as well.


Last updated: February 15, 2000 13:38

top of page | home

End of Page