Judith Holler

Publications

Displaying 1 - 16 of 16
  • Drijvers, L., & Holler, J. (2023). The multimodal facilitation effect in human communication. Psychonomic Bulletin & Review, 30(2), 792-801. doi:10.3758/s13423-022-02178-x.

    Abstract

    During face-to-face communication, recipients need to rapidly integrate a plethora of auditory and visual signals. This integration of signals from many different bodily articulators, all offset in time, with the information in the speech stream may either tax the cognitive system, thus slowing down language processing, or may result in multimodal facilitation. Using the classical shadowing paradigm, participants shadowed speech from face-to-face, naturalistic dyadic conversations in an audiovisual context, an audiovisual context without visual speech (e.g., lips), and an audio-only context. Our results provide evidence of a multimodal facilitation effect in human communication: participants were faster in shadowing words when seeing multimodal messages compared with when hearing only audio. Also, the more visual context was present, the fewer shadowing errors were made, and the earlier in time participants shadowed predicted lexical items. We propose that the multimodal facilitation effect may contribute to the ease of fast face-to-face conversational interaction.
  • Hamilton, A., & Holler, J. (Eds.). (2023). Face2face: Advancing the science of social interaction [Special Issue]. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences. Retrieved from https://royalsocietypublishing.org/toc/rstb/2023/378/1875.

    Abstract

    Face to face interaction is fundamental to human sociality but is very complex to study in a scientific fashion. This theme issue brings together cutting-edge approaches to the study of face-to-face interaction and showcases how we can make progress in this area. Researchers are now studying interaction in adult conversation, parent-child relationships, neurodiverse groups, interactions with virtual agents and various animal species. The theme issue reveals how new paradigms are leading to more ecologically grounded and comprehensive insights into what social interaction is. Scientific advances in this area can lead to improvements in education and therapy, better understanding of neurodiversity and more engaging artificial agents
  • Hamilton, A., & Holler, J. (2023). Face2face: Advancing the science of social interaction. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences, 378(1875): 20210470. doi:10.1098/rstb.2021.0470.

    Abstract

    Face-to-face interaction is core to human sociality and its evolution, and provides the environment in which most of human communication occurs. Research into the full complexities that define face-to-face interaction requires a multi-disciplinary, multi-level approach, illuminating from different perspectives how we and other species interact. This special issue showcases a wide range of approaches, bringing together detailed studies of naturalistic social-interactional behaviour with larger scale analyses for generalization, and investigations of socially contextualized cognitive and neural processes that underpin the behaviour we observe. We suggest that this integrative approach will allow us to propel forwards the science of face-to-face interaction by leading us to new paradigms and novel, more ecologically grounded and comprehensive insights into how we interact with one another and with artificial agents, how differences in psychological profiles might affect interaction, and how the capacity to socially interact develops and has evolved in the human and other species. This theme issue makes a first step into this direction, with the aim to break down disciplinary boundaries and emphasizing the value of illuminating the many facets of face-to-face interaction.
  • Hintz, F., Khoe, Y. H., Strauß, A., Psomakas, A. J. A., & Holler, J. (2023). Electrophysiological evidence for the enhancement of gesture-speech integration by linguistic predictability during multimodal discourse comprehension. Cognitive, Affective and Behavioral Neuroscience, 23, 340-353. doi:10.3758/s13415-023-01074-8.

    Abstract

    In face-to-face discourse, listeners exploit cues in the input to generate predictions about upcoming words. Moreover, in addition to speech, speakers produce a multitude of visual signals, such as iconic gestures, which listeners readily integrate with incoming words. Previous studies have shown that processing of target words is facilitated when these are embedded in predictable compared to non-predictable discourses and when accompanied by iconic compared to meaningless gestures. In the present study, we investigated the interaction of both factors. We recorded electroencephalogram from 60 Dutch adults while they were watching videos of an actress producing short discourses. The stimuli consisted of an introductory and a target sentence; the latter contained a target noun. Depending on the preceding discourse, the target noun was either predictable or not. Each target noun was paired with an iconic gesture and a gesture that did not convey meaning. In both conditions, gesture presentation in the video was timed such that the gesture stroke slightly preceded the onset of the spoken target by 130 ms. Our ERP analyses revealed independent facilitatory effects for predictable discourses and iconic gestures. However, the interactive effect of both factors demonstrated that target processing (i.e., gesture-speech integration) was facilitated most when targets were part of predictable discourses and accompanied by an iconic gesture. Our results thus suggest a strong intertwinement of linguistic predictability and non-verbal gesture processing where listeners exploit predictive discourse cues to pre-activate verbal and non-verbal representations of upcoming target words.
  • Kendrick, K. H., Holler, J., & Levinson, S. C. (2023). Turn-taking in human face-to-face interaction is multimodal: Gaze direction and manual gestures aid the coordination of turn transitions. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences, 378(1875): 20210473. doi:10.1098/rstb.2021.0473.

    Abstract

    Human communicative interaction is characterized by rapid and precise turn-taking. This is achieved by an intricate system that has been elucidated in the field of conversation analysis, based largely on the study of the auditory signal. This model suggests that transitions occur at points of possible completion identified in terms of linguistic units. Despite this, considerable evidence exists that visible bodily actions including gaze and gestures also play a role. To reconcile disparate models and observations in the literature, we combine qualitative and quantitative methods to analyse turn-taking in a corpus of multimodal interaction using eye-trackers and multiple cameras. We show that transitions seem to be inhibited when a speaker averts their gaze at a point of possible turn completion, or when a speaker produces gestures which are beginning or unfinished at such points. We further show that while the direction of a speaker's gaze does not affect the speed of transitions, the production of manual gestures does: turns with gestures have faster transitions. Our findings suggest that the coordination of transitions involves not only linguistic resources but also visual gestural ones and that the transition-relevance places in turns are multimodal in nature.

    Additional information

    supplemental material
  • Mazzini, S., Holler, J., & Drijvers, L. (2023). Studying naturalistic human communication using dual-EEG and audio-visual recordings. STAR Protocols, 4(3): 102370. doi:10.1016/j.xpro.2023.102370.

    Abstract

    We present a protocol to study naturalistic human communication using dual-EEG and audio-visual recordings. We describe preparatory steps for data collection including setup preparation, experiment design, and piloting. We then describe the data collection process in detail which consists of participant recruitment, experiment room preparation, and data collection. We also outline the kinds of research questions that can be addressed with the current protocol, including several analysis possibilities, from conversational to advanced time-frequency analyses.
    For complete details on the use and execution of this protocol, please refer to Drijvers and Holler (2022).
  • Nota, N., Trujillo, J. P., & Holler, J. (2023). Specific facial signals associate with categories of social actions conveyed through questions. PLoS One, 18(7): e0288104. doi:10.1371/journal.pone.0288104.

    Abstract

    The early recognition of fundamental social actions, like questions, is crucial for understanding the speaker’s intended message and planning a timely response in conversation. Questions themselves may express more than one social action category (e.g., an information request “What time is it?”, an invitation “Will you come to my party?” or a criticism “Are you crazy?”). Although human language use occurs predominantly in a multimodal context, prior research on social actions has mainly focused on the verbal modality. This study breaks new ground by investigating how conversational facial signals may map onto the expression of different types of social actions conveyed through questions. The distribution, timing, and temporal organization of facial signals across social actions was analysed in a rich corpus of naturalistic, dyadic face-to-face Dutch conversations. These social actions were: Information Requests, Understanding Checks, Self-Directed questions, Stance or Sentiment questions, Other-Initiated Repairs, Active Participation questions, questions for Structuring, Initiating or Maintaining Conversation, and Plans and Actions questions. This is the first study to reveal differences in distribution and timing of facial signals across different types of social actions. The findings raise the possibility that facial signals may facilitate social action recognition during language processing in multimodal face-to-face interaction.

    Additional information

    supporting information
  • Nota, N., Trujillo, J. P., Jacobs, V., & Holler, J. (2023). Facilitating question identification through natural intensity eyebrow movements in virtual avatars. Scientific Reports, 13: 21295. doi:10.1038/s41598-023-48586-4.

    Abstract

    In conversation, recognizing social actions (similar to ‘speech acts’) early is important to quickly understand the speaker’s intended message and to provide a fast response. Fast turns are typical for fundamental social actions like questions, since a long gap can indicate a dispreferred response. In multimodal face-to-face interaction, visual signals may contribute to this fast dynamic. The face is an important source of visual signalling, and previous research found that prevalent facial signals such as eyebrow movements facilitate the rapid recognition of questions. We aimed to investigate whether early eyebrow movements with natural movement intensities facilitate question identification, and whether specific intensities are more helpful in detecting questions. Participants were instructed to view videos of avatars where the presence of eyebrow movements (eyebrow frown or raise vs. no eyebrow movement) was manipulated, and to indicate whether the utterance in the video was a question or statement. Results showed higher accuracies for questions with eyebrow frowns, and faster response times for questions with eyebrow frowns and eyebrow raises. No additional effect was observed for the specific movement intensity. This suggests that eyebrow movements that are representative of naturalistic multimodal behaviour facilitate question recognition.
  • Nota, N., Trujillo, J. P., & Holler, J. (2023). Conversational eyebrow frowns facilitate question identification: An online study using virtual avatars. Cognitive Science, 47(12): e13392. doi:10.1111/cogs.13392.

    Abstract

    Conversation is a time-pressured environment. Recognizing a social action (the ‘‘speech act,’’ such as a question requesting information) early is crucial in conversation to quickly understand the intended message and plan a timely response. Fast turns between interlocutors are especially relevant for responses to questions since a long gap may be meaningful by itself. Human language is multimodal, involving speech as well as visual signals from the body, including the face. But little is known about how conversational facial signals contribute to the communication of social actions. Some of the most prominent facial signals in conversation are eyebrow movements. Previous studies found links between eyebrow movements and questions, suggesting that these facial signals could contribute to the rapid recognition of questions. Therefore, we aimed to investigate whether early eyebrow movements (eyebrow frown or raise vs. no eyebrow movement) facilitate question identification. Participants were instructed to view videos of avatars where the presence of eyebrow movements accompanying questions was manipulated. Their task was to indicate whether the utterance was a question or a statement as accurately and quickly as possible. Data were collected using the online testing platform Gorilla. Results showed higher accuracies and faster response times for questions with eyebrow frowns, suggesting a facilitative role of eyebrow frowns for question identification. This means that facial signals can critically contribute to the communication of social actions in conversation by signaling social action-specific visual information and providing visual cues to speakers’ intentions.

    Additional information

    link to preprint
  • Trujillo, J. P., & Holler, J. (2023). Interactionally embedded gestalt principles of multimodal human communication. Perspectives on Psychological Science, 18(5), 1136-1159. doi:10.1177/17456916221141422.

    Abstract

    Natural human interaction requires us to produce and process many different signals, including speech, hand and head gestures, and facial expressions. These communicative signals, which occur in a variety of temporal relations with each other (e.g., parallel or temporally misaligned), must be rapidly processed as a coherent message by the receiver. In this contribution, we introduce the notion of interactionally embedded, affordance-driven gestalt perception as a framework that can explain how this rapid processing of multimodal signals is achieved as efficiently as it is. We discuss empirical evidence showing how basic principles of gestalt perception can explain some aspects of unimodal phenomena such as verbal language processing and visual scene perception but require additional features to explain multimodal human communication. We propose a framework in which high-level gestalt predictions are continuously updated by incoming sensory input, such as unfolding speech and visual signals. We outline the constituent processes that shape high-level gestalt perception and their role in perceiving relevance and prägnanz. Finally, we provide testable predictions that arise from this multimodal interactionally embedded gestalt-perception framework. This review and framework therefore provide a theoretically motivated account of how we may understand the highly complex, multimodal behaviors inherent in natural social interaction.
  • Bosker, H. R., Peeters, D., & Holler, J. (2020). How visual cues to speech rate influence speech perception. Quarterly Journal of Experimental Psychology, 73(10), 1523-1536. doi:10.1177/1747021820914564.

    Abstract

    Spoken words are highly variable and therefore listeners interpret speech sounds relative to the surrounding acoustic context, such as the speech rate of a preceding sentence. For instance, a vowel midway between short /ɑ/ and long /a:/ in Dutch is perceived as short /ɑ/ in the context of preceding slow speech, but as long /a:/ if preceded by a fast context. Despite the well-established influence of visual articulatory cues on speech comprehension, it remains unclear whether visual cues to speech rate also influence subsequent spoken word recognition. In two ‘Go Fish’-like experiments, participants were presented with audio-only (auditory speech + fixation cross), visual-only (mute videos of talking head), and audiovisual (speech + videos) context sentences, followed by ambiguous target words containing vowels midway between short /ɑ/ and long /a:/. In Experiment 1, target words were always presented auditorily, without visual articulatory cues. Although the audio-only and audiovisual contexts induced a rate effect (i.e., more long /a:/ responses after fast contexts), the visual-only condition did not. When, in Experiment 2, target words were presented audiovisually, rate effects were observed in all three conditions, including visual-only. This suggests that visual cues to speech rate in a context sentence influence the perception of following visual target cues (e.g., duration of lip aperture), which at an audiovisual integration stage bias participants’ target categorization responses. These findings contribute to a better understanding of how what we see influences what we hear.
  • Macuch Silva, V., Holler, J., Ozyurek, A., & Roberts, S. G. (2020). Multimodality and the origin of a novel communication system in face-to-face interaction. Royal Society Open Science, 7: 182056. doi:10.1098/rsos.182056.

    Abstract

    Face-to-face communication is multimodal at its core: it consists of a combination of vocal and visual signalling. However, current evidence suggests that, in the absence of an established communication system, visual signalling, especially in the form of visible gesture, is a more powerful form of communication than vocalisation, and therefore likely to have played a primary role in the emergence of human language. This argument is based on experimental evidence of how vocal and visual modalities (i.e., gesture) are employed to communicate about familiar concepts when participants cannot use their existing languages. To investigate this further, we introduce an experiment where pairs of participants performed a referential communication task in which they described unfamiliar stimuli in order to reduce reliance on conventional signals. Visual and auditory stimuli were described in three conditions: using visible gestures only, using non-linguistic vocalisations only and given the option to use both (multimodal communication). The results suggest that even in the absence of conventional signals, gesture is a more powerful mode of communication compared to vocalisation, but that there are also advantages to multimodality compared to using gesture alone. Participants with an option to produce multimodal signals had comparable accuracy to those using only gesture, but gained an efficiency advantage. The analysis of the interactions between participants showed that interactants developed novel communication systems for unfamiliar stimuli by deploying different modalities flexibly to suit their needs and by taking advantage of multimodality when required.
  • Ripperda, J., Drijvers, L., & Holler, J. (2020). Speeding up the detection of non-iconic and iconic gestures (SPUDNIG): A toolkit for the automatic detection of hand movements and gestures in video data. Behavior Research Methods, 52(4), 1783-1794. doi:10.3758/s13428-020-01350-2.

    Abstract

    In human face-to-face communication, speech is frequently accompanied by visual signals, especially communicative hand gestures. Analyzing these visual signals requires detailed manual annotation of video data, which is often a labor-intensive and time-consuming process. To facilitate this process, we here present SPUDNIG (SPeeding Up the Detection of Non-iconic and Iconic Gestures), a tool to automatize the detection and annotation of hand movements in video data. We provide a detailed description of how SPUDNIG detects hand movement initiation and termination, as well as open-source code and a short tutorial on an easy-to-use graphical user interface (GUI) of our tool. We then provide a proof-of-principle and validation of our method by comparing SPUDNIG’s output to manual annotations of gestures by a human coder. While the tool does not entirely eliminate the need of a human coder (e.g., for false positives detection), our results demonstrate that SPUDNIG can detect both iconic and non-iconic gestures with very high accuracy, and could successfully detect all iconic gestures in our validation dataset. Importantly, SPUDNIG’s output can directly be imported into commonly used annotation tools such as ELAN and ANVIL. We therefore believe that SPUDNIG will be highly relevant for researchers studying multimodal communication due to its annotations significantly accelerating the analysis of large video corpora.

    Additional information

    data and materials
  • Sekine, K., Schoechl, C., Mulder, K., Holler, J., Kelly, S., Furman, R., & Ozyurek, A. (2020). Evidence for children's online integration of simultaneous information from speech and iconic gestures: An ERP study. Language, Cognition and Neuroscience, 35(10), 1283-1294. doi:10.1080/23273798.2020.1737719.

    Abstract

    Children perceive iconic gestures, along with speech they hear. Previous studies have shown
    that children integrate information from both modalities. Yet it is not known whether children
    can integrate both types of information simultaneously as soon as they are available as adults
    do or processes them separately initially and integrate them later. Using electrophysiological
    measures, we examined the online neurocognitive processing of gesture-speech integration in
    6- to 7-year-old children. We focused on the N400 event-related potentials component which
    is modulated by semantic integration load. Children watched video clips of matching or
    mismatching gesture-speech combinations, which varied the semantic integration load. The
    ERPs showed that the amplitude of the N400 was larger in the mismatching condition than in
    the matching condition. This finding provides the first neural evidence that by the ages of 6
    or 7, children integrate multimodal semantic information in an online fashion comparable to
    that of adults.
  • Ter Bekke, M., Drijvers, L., & Holler, J. (2020). The predictive potential of hand gestures during conversation: An investigation of the timing of gestures in relation to speech. In Proceedings of the 7th GESPIN - Gesture and Speech in Interaction Conference. Stockholm: KTH Royal Institute of Technology.

    Abstract

    In face-to-face conversation, recipients might use the bodily movements of the speaker (e.g. gestures) to facilitate language processing. It has been suggested that one way through which this facilitation may happen is prediction. However, for this to be possible, gestures would need to precede speech, and it is unclear whether this is true during natural conversation.
    In a corpus of Dutch conversations, we annotated hand gestures that represent semantic information and occurred during questions, and the word(s) which corresponded most closely to the gesturally depicted meaning. Thus, we tested whether representational gestures temporally precede their lexical affiliates. Further, to see whether preceding gestures may indeed facilitate language processing, we asked whether the gesture-speech asynchrony predicts the response time to the question the gesture is part of.
    Gestures and their strokes (most meaningful movement component) indeed preceded the corresponding lexical information, thus demonstrating their predictive potential. However, while questions with gestures got faster responses than questions without, there was no evidence that questions with larger gesture-speech asynchronies get faster responses. These results suggest that gestures indeed have the potential to facilitate predictive language processing, but further analyses on larger datasets are needed to test for links between asynchrony and processing advantages.
  • Holler, J., & Beattie, G. (2002). A micro-analytic investigation of how iconic gestures and speech represent core semantic features in talk. Semiotica, 142, 31-69.

Share this page