next up previous
Next: Efficient internal representations of Up: A Framework for Multilevel Previous: Principles for multilevel annotated

XML encoding for Multilevel annotated corpora

XML encoding and tools have several advantages compared with databases: Standardization aspects (data exchanges, unicode), existing dedicated tools (parsers, style sheet for document conversion), inheritance of the properties of SGML (textual ressources and linguistic oriented formalism, header specification, Text Encoding Initiative (TEI) specifications, ...). Moreover XML includes now interesting structuration features thanks to XML links and XML path specification.

The second step of our representation, the XML encoding, results directly from the organisation model of ressources. Classically, each annotation is composed by a main tag and a list of attributes. Each annotation is identified with a single identifier (id). The whole set of possible annotations for a given RE are gathered in a single document called auxiliary ressource document. For each RE we have one auxiliary ressource document that represents only one time all the tags necessary for the corpus annotation. For each annotation level, we have a relational document which specifies the links between the tags given in the different auxiliary document and possibly the events of the reference axis. These documents are realized with the XML link tags which links the identifiers of the ressources tags with the keys (an integer here) of the event in relation. The reference axis is not represented explicitly by a document, but is given here implicitly by the list of event identifiers.

Each auxiliary ressource document supposes that a DTD (Document Type Definition) specifies the constraints on the expression of the corresponding annotation ressources. the XML encoding Given the relation model introduced previously and one DTD for each RE, we can specify in a unique way the corresponding XML encoding.

The table 1 gives an example of a classical annotated corpus and the new encoding documents. Each RE is encoded in an independant XML document: The document corresponding to the RE word of figure 3 can be view as the dictionnary of the corpus, the document morphosyntactic tag as the tag set. An additional XML document, not shown here, gives an explicit labelling for each tag (for instance Np stands for plural noun). Each element of these documents are identified in order to be linked. Each RR is also defined in a XML document using XML links as shown table 2.


next up previous
Next: Efficient internal representations of Up: A Framework for Multilevel Previous: Principles for multilevel annotated

Patrice Lopez
Thu Apr 13 09:23:20 MET DST 2000