Principles for multilevel annotated textual corpus

Next: XML encoding for Multilevel Up: A general relational model Previous: Second example: TAGML

Principles for multilevel annotated textual corpus

The principle of virtual ressources consists in describing a document as a set of elementary links (possibly unary) to subordinate documents. These links are occurences of a given information type which do not duplicate any content described in the documents which they are referring to. For instance a redundant html sub-document can be reached from different others WWW pages. The resulting representation is an acyclic graph which is relevant for the representation of multilayer annotations of textual corpora as shown [Bird and Liberman1999]. Their abstract representation consists in a main axis refered to by different annotation levels by the way of edges.

In order to use this principle for multilevel linguistic annotation, we must identify correctly redundant subdocuments. We have realized this identification with a RROM. Each level of annotation (morpho-syntactic tags, phrase structure, refering expressions, dialogue acts, topics...) becomes a different view on the same text. Here, a particular annotation is a relation to a word or a sequence of reference words. We generalize this approach by considering the words as tags expressed in an independant sub-document. These word tags are then linked to a reference axis. Any kind of combined annotations, such as gestures or sounds, can be integrated according to this linking principle. This is particularly useful for the encoding of multimodal dialogues which may include gestures, visual scenes, speaking and reference representations.

The minimal unit of description of a textual corpus used to simulate the reference axis is the event. The events are ordered thanks to a strict order relation. An event corresponds to a point on a reference axis similar to the one of [Bird and Liberman1999] and can be identified with a key. This axis is similar to the one of [Bird and Liberman1999] and is adequate to represent a temporal axis. For each level of information, an annotation is a relation between a tag and one or more events. Each level of annotation imposes its own semantics on the link relations. A single link from an event to a word can be interpreted as an occurence of the word in a textual corpus. Two links to two events starting from a tag of gesture can signify the begining and the end of the gesture. Our experience has shown that a direct encoding of these links with the XML link machinery can result in documents which are difficult to develop, interpret and maintain. By using a RROM to represent the relations between various annotations and the reference axis, as shown in figure 3, we can express a multilevel annotation system with a precise comprehensive abstract model that will lead to an efficient use of virtual ressources.

In figure 3, we see that events are linked with occurrences of a word or a compound. Compounds can be also viewed as a relation between several words (n-edge). Each occurrence (of a single word or a compound) is linked with a morphosyntactic tag. Dependencies relation between two occurrences allow to obtain a full dependency tree. Finally a phrase tag can also be linked to an occurence relation in order to give the phrase category (VP, NP, ...) of the phrase dominated by this occurrence.

table66

Table: Example of a classical textual annotation and XML documents for Ressource Entities.

table100

Table: XML documents for Ressource Relations.

Figure: Internal Finite State Representation.

One can note the similarity between the RROM and the entity/relation models used in relational data bases. The main difference is that the links do not need contain their own attributes: The model can be realized with the current specifications of the XML norm. It can also use DTD for contraints expressions on ressources. Our proposition can be seen as an attempt to use the well specified methodology of relational databases with the portability, the expressivity and the adequacy of XML for textual datas and the efficiency of finite state representation for internal computation.

Next: XML encoding for Multilevel Up: A general relational model Previous: Second example: TAGML

Patrice Lopez
Thu Apr 13 09:23:20 MET DST 2000