The majority of existing encoding solutions and tools are usually dedicated to one kind of annotation. In particular because there is a difficulty to incorporate different annotation schemes within one single hierarchy. Morever the growing size of available corpora makes them difficult to exploit and visualize. Besides, maintaining their structure is highly difficult when the annotations become complex.
The XML encoding formalism allows flexibility, portability and easy interchange of Linguistic Ressources (LR). Still XML has two main limits for the general encoding of complex multilevel LR:
The requirements for multilevel encoding of corpora are presented for example in [Bird and Liberman1999], [Cristea et al.1998] and [Dybkjær et al.1998]. The different kinds of LR involved in the annotation schemes are generally stored in different XML documents and linked to a reference textual document (the corpus) resulting in an acyclic graph structure. This representation, in particular the relations it expresses, can be matched against the abstract model of a relational database to allow efficient store and access to the corresponding data. Indeed, the general workbench developped by the MATE project makes use of such a relational database for the internal representation of XML encoded information [Dybkjær et al.1999].
The present paper intends to interleave XML and relational database approaches in order to obtain a general methodology and models for complex LR exchange and exploitation. We claim that (1) an additional abstract level similar to the one used in relational databases can be useful to define XML encoding principles and (2) a light relational database based on FSA inferred from the XML encoding can be particularly efficient for internal computation.
Applied to the multilevel annotation, the preliminary abstract model has to express the relations between the reference corpus and the different annotation levels. The solution we propose aims at associating the multilayer encoding of multiple views on a same corpus and the encoding of information redundancies. These redundoncies obtained from the XML structure would allow the design of internal representations, which can be optimized in time and space on the basis of FSA (Finite State Automata). We argue that a high level of structural organization of the LR is likely to lead to efficient processing, through the identification of similar factors and shared properties.
In the next section, we introduce our preliminary abstract level and present the model for multilevel annotations of textual corpora. We then show how to yield an XML encoding scheme from this model. In section 4, we suggest an internal representation inferred from the XML encoding and based on FSA. Finally a first implementation of these principles, experimented with a corpus of newspaper articles (Le Monde), is described.