The usual result of a XML parser needs a lot of memory and present generally inefficient access mecanisms for links. Event-based XML parser SAX allows to access to relevant datas without loading the full document, but is slower than a classical parser and has the same drawbacks concerning reversibility of access. Our proposal is to use FSA techniques for the internal representation of the XML document. FSA techniques present time and space optimisations and efficient reversible access. The efficiency of this representation exploit the redundancy of the information. By indentifying and encoding this redundancy with respectively a RROM and a XML document, we can obtain in a straightforward way this efficient internal representation.
Figure: Screen shot of the workbench.
Figure: Dependency tree.
Each auxiliary ressource document is compiled into an automaton with prefix sharing (lexicographic trees). Each XML relational document gives the transitions between the different automata obtained with the auxiliary ressource document. A relational document can be compiled into a transducer which links some auxiliary ressources entries identified by their XML id. Edges of this tranducer are labelled with the couple of names in relation (see figure 4) in order to allow fully reversible access. The reading of any tag gives in linear time the link to all auxiliary ressources in relation to this tag. Identifiers used in the XML encoding are only used to build this representation.
In figure 4 the reading of an event key gives the access to all level of annotation which are linked to the corresponding event. Given a word, the access to the following word is just the word associated to the next event key. The access to a given word (or a given tag) results in a list of event keys corresponding to all occurrences of the word (or the tag) on the reference axis (ie in the corpus) still in linear time.
Even in the case of very large corpora, the size of auxiliary ressources automata is limited. On the contrary, the internal representation of the reference axis and of the different transitions to words and tags can be very large when there are millions of events. In this case, cache techniques with temporary files may be necessary.