Token merging in language model-based confusible disambiguation
Stehouwer, H., & Van Zaanen, M.
Token merging in language model-based confusible disambiguation. In T. Calders, K. Tuyls, & M. Pechenizkiy (Eds.
), Proceedings of the 21st Benelux Conference on Artificial Intelligence
In the context of confusible disambiguation (spelling correction that requires context), the synchronous
back-off strategy combined with traditional n-gram language models performs well. However, when
alternatives consist of a different number of tokens, this classiﬁcation technique cannot be applied directly,
because the computation of the probabilities is skewed. Previous work already showed that probabilities
based on different order n-grams should not be compared directly.
In this article, we propose new probability metrics in which the size of the n is varied according to the
number of tokens of the confusible alternative. This requires access to n-grams of variable length. Results
show that the synchronous back-off method is extremely robust.
We discuss the use of sufﬁx trees as a technique to store variable length n-gram information efﬁciently.