1.4. Regular Expressions

Regular expressions allow users to create complicated queries. Below follows a list of most commonly used regular expressions together with explanations and some potential uses.

The following tables have been created by a user of ELAN (an annotation tool which has the same search mechanism as TROVA). They may result quite useful also for other users since they offer a simple and clear overview of the main symbols (partly different from the ones just seen) used in regular expressions, with a short explanation and an example for each of them. Bear in mind that the examples are taken from the language that the user is being researching, so do not pay attention to the meaning of the words but to the working mechanism of the regular expressions.

Table 1.1. Symbols

\bat the beginning and/or end of a stringword boundary
\w+at the end of a stringvariable end of word
.anywhereany letter
.*between spacesany string of letters between spaces/any word
.*\between spacesany string of words
(x|y)anywhereeither x or y
[^x]place at the beginningnot x
(....)\lanywherewords with four reduplicated letters
?after a letterthe preceding letter is optional

Table 1.2. Search for particular word forms (examples)

saall words containing the string sasa, vasaku, sahata, tisa
\bsaall words starting with sasa, sahata, sana; NOT vasaku, tisa
\bsa\ball words sasa
\bsa..\ball words consisting of sa + two letters that follow sasaka, saku, sana
\bsa\w+all words beginning with sa, but not the word sa by itselfsahata, sana
\b.*ana\bal words ending in anasinana, tamuana, sana, bana, maana
(....)\lall words with four reduplicated letterspakupaku, vapakupaku, mahumahun, vamahumahun
\b(....)\lall words beginning with four reduplicated letterspakupaku; NOT vapakupaku
\b(....)\lana\ball words beginning with four reduplicated letters and ending in anavasuvasuana, hunuhunuana
\bva(....)\lall words consisting of the prefix va- + four reduplicated lettersvapakupaku, vagunagunaha
\bvahaa?\ball tokens of vahaa and vahavahaa and vaha

Table 1.3. Search for particular sequences of words (examples)

\bsaka\b .* \bhaastring of 3 words: (1) saka; (2) any word; (3) the word haa by itself or with suffixessaka antee haa; saka abana haari; saka kabuu haana
saka .* \bhaa\w+string of 3 words: (1) saka; (2) any word; (3) a word beginning with haa, but NOT the word haa by itselfsaka abana haari; saka kabuu haana
(\bsaka\b|\bsa\b) \bpaku\b2-word string consisting of saka or sa and pakusaka paku; sa paku
(\bsaka\b|\bsa\b) .* \bvaha\bstrings of 3 words: (1) saka or sa; (2) any word; (3) vahasaka tii vaha; sa tapaku vaha
(\bsaka\b|\bsa\b) (....)\l \bhaastrings of 3 words: (1) saka or sa; (2) any word with four reduplicated letters; (3) the word haa or a word beginning with haasa natanata haa; saka natanata haana