What is a speaker about to do when you see her eyebrows raised, her flat hand being turned with the palm facing up, and the lips shaped as if about to utter a ‘w’-sound? There is a good chance that she is about to ask a question. In conversation, speakers tend to move their face, head, hands, arms, or even their whole body. They may gesture to depict objects or actions, point to an (imaginary) location, nod, tilt or shake their head. Speakers also move their eyes, eyebrows, eyelids and mouths almost continuously. “In face-to-face interaction, the words we speak are embedded in a rich visual context”, says lead author Judith Holler.
Interlocutors need to interpret the meaning of such visual signals quickly, as they take turns to speak. To ensure a smooth conversation, listeners typically start planning their response as their conversational partner is still talking. How do we know which visual signals are meaningful? Does a speaker raise her hand as a gesture or is she merely going to scratch her head? To make the task even more complex, visual signals tend not to align in time with one another or with spoken words. And we need to combine what we see with what we hear (as when we point and say “There!”). Combining information from speech and body into meaningful messages thus seems like a hugely demanding task.
However, researchers were surprised to find that visual signals may actually be speeding up our understanding of spoken language. For example, in a previous study by Holler and colleagues, speakers responded faster to questions accompanied by hand or head gestures than to questions without gestures. It is almost paradoxical that we seem to process complex multimodal messages (combining speech and visual signals) faster than ‘simpler’ unimodal messages (speech-only). How can we explain why complex multimodal messages may be easier to understand?
To answer that question, Holler and Levinson developed a new theoretical framework for multimodal language processing in face-to-face interaction. The psycholinguists propose that people learn to associate specific visual-vocal signals with specific meanings. As visual signals often precede the corresponding spoken parts of the message, the visual information may help to predict what comes next. As a result, multimodal utterances may be easier to understand than unimodal utterances, which involve just speech.
Holler stresses that it is important to explain language processing in face-to-face conversation, as human language has evolved as a multimodal phenomenon, in social interaction. Likewise, children learn their first language(s) in face-to-face social interaction, and adults use language most when communicating face-to-face, in social interaction with other people. Thus, understanding how human language works in this sort of environment is an important scientific step.
“We hope that the new framework will advance our understanding of the cognitive processes underpinning human communication in face-to-face interaction, as well as providing the basis for gaining insight into problems with multimodal language processing in specific populations”, conclude the authors.
Holler, J., & Levinson, S. C. (2019). Multimodal language processing in human communication. Trends in Cognitive Sciences. Advance online publication. doi:10.1016/j.tics.2019.05.006.
Link to the PDF
About Judith Holler