The Whitespace Text analyzer

This analyzer splits the input text it receives into multiple tokens based on white spaces. It allows to configure how e.g. punctuation marks should be treated.

Figure 350. Whitespace analyzer configuration panel


The + (Add) and - (Remove) buttons can be used to add or remove a category of characters, represented by a row in the table. A category can contain one or more characters; if there are more than one, each character is separately treated according to the setting for that category. The table has two columns, one labelled Marks, where the special characters or marks can be entered, and one labelled Action, specifying the way those characters should be handled in the tokenization process. When clicked on, the second column shows a dropdown list with predefined actions:

The Apply button has to be clicked to inform the analyzer of the changes and to put them into effect.