Multimodal interaction in a model of visual world phenomena
Smith, A. C., Monaghan, P., & Huettig, F.
(2012). Multimodal interaction in a model of visual world phenomena
. Poster presented at the 18th Annual Conference on Architectures and Mechanisms for Language Processing (AMLaP 2012), Riva del Garda, Italy.
Existing computational models of the Visual World Paradigm (VWP) have simulated the connection between language processing and eye gaze behavior, and consequently have provided insight into the cognitive processes underlying lexical and sentence comprehension. Allopenna, Magnuson and Tanenhaus (1998), demonstrated that fixation probabilities during spoken word processing can be predicted by lexical activations in the TRACE model of spoken word recognition. Recent computational models have extended this work to predict fixation behavior during sentence processing from the integration of visual and linguistic information.
Recent empirical investigation of word level effects in VWP support claims that language mediated eye gaze is not only influenced by overlap at a phonological level (Allopenna, Magnuson & Tanenhaus, 1998) but also by relationships in terms of visual and semantic similarity. Huettig and McQueen (2007) found that when participants heard a word and viewed a scene containing objects phonologically, visually, or semantically similar to the target, then all competitors exerted an effect on fixations, but fixations to phonological competitors preceded those to other competitors. Current models of VWP that simulate the interaction between visual and linguistic information do so with representations that are unable to capture fine-grained semantic, phonological or visual feature relationships. They are therefore limited in their ability to examine effects of multimodal interactions in language processing.
Our research extends that of previous models by implementing representations in each modality that are sufficiently rich to capture similarities and distinctions in visual, phonological and semantic representations. Our starting point was to determine the extent to which multimodal interactions between these modalities in the VWP would be emergent from the nature of the representations themselves, rather than determined by architectural constraints. We constructed a recurrent connectionist model, based on Hub-and-spoke models of semantic processing, which integrates visual, phonological and semantic information within a central resource. We trained and tested the model on viewing scenes as in Huettig and McQueen’s (2007) study, and found that the model replicated the complex behaviour and time course dynamics of multimodal interaction, such that the model activated phonological competitors prior to activating visual and semantic competitors.
Our approach enables us to determine that differences in the computational properties of each modality’s representational structure is sufficient to produce behaviour consistent with the VWP. The componential nature of phonological representations and the holistic structure of visual and semantic representations result in fixations to phonological competitors preceding those to other competitors. Our findings suggest such language-mediated visual attention phenomena can emerge due to the statistics of the problem domain, with observed behaviour emerging as a natural consequence of differences in the structure of information within each modality, without requiring additional modality specific architectural constraints.