3.5. Interlinearization mode

Interlinearization mode is a text oriented mode designed for parsing and glossing annotations to one or more lines of interlinearized text. This can be done manually or with the use of one or more so-called Analyzers. The segmentation and (typically) the transcription of speech events need to be done in one or more of the other modes before interlinearization can be added in this mode.

Analyzers are software modules that accept an annotation as input and produce suggestions for one or more annotations, on one or more tiers, as output. Examples of the type of processing analyzers can perform are tokenization, morphological parsing and lookup of glosses. The behavior of some analyzers can be configured in a settings panel. Some analyzers need a connection to a lexicon, others can perform their task based on the input alone. Analyzers are implemented as extensions so that third party users and developers can create and add their own analyzers. At least eventually: the LEXAN API, as it is called, still has to be finalized, documented and published.

Part of the user interface of this mode is a Lexicon panel, the front end of a Lexicon Component module. It allows to create, import and edit a lexicon and its entries. Lexicons are stored separately from annotation data in a new data format. These are the lexicons that analyzers can get access to.

To start the Interlinearization mode, click Options > Interlinearization Mode from the main window.

Select Interlinearization Mode

Figure 3.21. Select Interlinearization Mode


The main screen consists of 4 panels, the panels on the left side are used for global settings, not tied to any particular transcription. The panels on the right side of the screen contain more specific settings and the transcriptions.

Interlinearization mode main view

Figure 3.22. Interlinearization mode main view


To start working in Interlinearization Mode, you need to have already set up a tier structure and have to have some segmentations (annotations on a top level tier). The values of annotations can be edited in this mode and annotations on dependent tiers, including subdivisions, can be created, but not primary segmentations on top-level, independent tiers. This can be done in Annotation mode and/or Segmentation mode. It is still possible to add new tier types and tiers in this mode (please refer to Section 2.3 and Section 2.4 for more information about tier structures).

If you want to use an anlyzer that requires a connection to a lexicon, you should first create or import a lexicon and link one or more tier types to specific fields in a lexical entry.

3.5.1. Types of analyzers and their settings

The following analyzers are distributed with ELAN:

  • Parse Analyzer

  • Gloss Analyzer

  • Lexicon Analyzer (a combination of the Parse and Gloss analyzers)

  • Whitespace Analyzer

Analyzer settings configuration panel

Figure 3.23. Analyzer settings configuration panel


The names are somewhat misleading; all of the Parse, Gloss and Lexicon analyzers require access to a lexicon. The Parse analyzer morphologically parses annotations from a word (or token) level tier, based on lexical units (prefixes, stems, suffixes etc.) available in the lexicon (internally the parser is implemented as a state machine with a stack). The results are shown as parse suggestions in a suggestion window from which the user can select one. This analyzer requires one source tier and one target tier, where the target is of a subdivision tier type.

The Gloss analyzer looks up the source annotation in the lexicon and lists all glosses found in the matched entries. The results are again presented as suggestions from which the user can select one. This analyzer requires one source tier and one target tier, where the target is of a symbolic association tier type.

The Lexicon analyzer is a combination of the parse and the gloss analyzer. By configuring the lexicon analyzer, the source tier containing the annotations will both be parsed and glossed in one action. This analyzer requires one source tier and two target tiers.

The Whitespace analyzer splits the selected source annotation at whitespaces and places the result on the target tier. It does not need any user confirmation. This analyzer requires one source tier and one target tier, where the target is of a subdivision tier type. Currently the behavior of this analyzer can not be configured (e.g. with respect to treatment of punctuation marks), this might be added in the future.

When configuring analyzers and their source and target tiers, it is possible that the target tier from one analyzer, is the source tier for the next analyzer. The configuration of the tiers is based on tier types rather than on individual tiers.

[Note]Note

Configuration on the basis of individual tiers might be added later as an option as well.

3.5.1.1. The Lexicon analyzer

The Lexicon analyzer is a combination of the parse and the gloss analyzer. When a lexical entry matches a part of the input token during the matching process (and thus becomes part of one of the suggestions), the glosses of that entry are added to the suggestions too (these "glosses" can be the from any field of the entry, depending on the tier-typ configuration), By configuring the lexicon analyzer, the source tier containing the annotations will both be parsed and glossed in one action. This analyzer requires one source tier and two target tiers. (The LEXAN API currently limits the number of target tiers to two, this might be too restrictive and may need to be reconsidered in a future release.)

The Lexicon analyzer supports the following configurable settings (see Figure 3.23):

  • Include variants in the parsing process if this option is checked the parser will also look at the variant field in the process of matching morphemes from the lexicon with parts of the word or token it has received as input

  • Match longer prefixes/suffixes first by default the parser tries to match shorter prefixes before longer ones. This has an effect on the order of the suggestions

  • Exclude aborted parses from results if the parser hasn't finished (one iteration of) the matching process within the maximum number of steps, it adds an "++ABORT++" label at the position in the suggestions where it stopped. This option allows to filter them out of the presented results.

  • Case sensitive matching tells the analyzer whether or not to ignore case in the matching process

  • Maximum number of parse steps this option determines when the parser should stop the matching process to prevent an unusable number of suggestions

  • Affix marker character by default the analyzer assumes the character that is used to mark a lexical entry as a prefix (a-) or suffix (-a) is a hyphen. This can be changed here (ideally this information should be an accessible property of the lexicon). Apart from this marker, the analyzer has hardcoded, built-in support for the morpheme types "prefix", "suffix", "root", "stem" to determine what to try to match in the parsing process.

  • String for missing values sets the text the analyzer should use to indicate that a part (e.g. a gloss) is missing in the lexicon

  • "Replace" field in the lexicon this analyzer supports replacement of a matched morph by one or more characters to make the next parse step (more) successful. This replacement text should be in the lexical entry and by default the analyzer looks for a (custom) field "replace". If it is in another field, it can be specified here.

Changes in these settings will only be passed to the analyzer after clicking Apply Settings!

3.5.1.2. The Parse analyzer

This analyzer is the same as the parser part of the Lexicon analyzer, with the same configurable settings.

3.5.1.3. The Gloss analyzer

This analyzer performs a look-up of the input token in the lexicon and returns all values of the lexical entry field it is configured for (via the tier type). This doesn't have to be the "gloss" field of the lexical entries, but can be any field.

This analyzer supports the following configurable setting:

  • String for missing values sets the text the analyzer should use if the specified field is not found in matched entries in the lexicon

3.5.1.4. The Whitespace Text analyzer

This analyzer splits the input text it receives in multiple tokens based on white spaces. Currently this analyzer can not be configured (e.g. in how punctuation marks should be treated).