Enhanced suffix arrays as language models: Virtual k-testable languages
Stehouwer, H., & van Zaanen, M.
Enhanced suffix arrays as language models: Virtual k-testable languages. In J. M. Sempere, & P. García (Eds.
), Grammatical inference: Theoretical results and applications 10th International Colloquium, ICGI 2010, Valencia, Spain, September 13-16, 2010. Proceedings
(pp. 305-308). Berlin: Springer.
In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited
size n. This approach, which is used with synchronous back-off, allows
us to distinguish between alternative sequences using large contexts. We
also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency
The approach can also be viewed as a collection of virtual k-testable
automata. Once built, we can directly access the results of any k-testable
automaton generated from the input training data. Synchronous back-
off automatically identies the k-testable automaton with the largest
feasible k. We have used this approach in several classification tasks.