Displaying 1 - 19 of 19
-
Emmendorfer, A. K., & Holler, J. (2025). Facial signals shape predictions about the nature of upcoming conversational responses. Scientific Reports, 15: 1381. doi:10.1038/s41598-025-85192-y.
Abstract
Increasing evidence suggests that interlocutors use visual communicative signals to form predictions about unfolding utterances, but there is little data on the predictive potential of facial signals in conversation. In an online experiment with virtual agents, we examine whether facial signals produced by an addressee may allow speakers to anticipate the response to a question before it is given. Participants (n = 80) viewed videos of short conversation fragments between two virtual humans. Each fragment ended with the Questioner asking a question, followed by a pause during which the Responder looked either straight at the Questioner (baseline), or averted their gaze, or accompanied the straight gaze with one of the following facial signals: brow raise, brow frown, nose wrinkle, smile, squint, mouth corner pulled back (dimpler). Participants then indicated on a 6-point scale whether they expected a “yes” or “no” response. Analyses revealed that all signals received different ratings relative to the baseline: brow raises, dimplers, and smiles were associated with more positive responses, gaze aversions, brow frowns, nose wrinkles, and squints with more negative responses. Qur findings show that interlocutors may form strong associations between facial signals and upcoming responses to questions, highlighting their predictive potential in face-to-face conversation.Additional information
supplementary materials -
Hömke, P., Levinson, S. C., Emmendorfer, A. K., & Holler, J. (2025). Eyebrow movements as signals of communicative problems in human face-to-face interaction. Royal Society Open Science, 12(3): 241632. doi:10.1098/rsos.241632.
Abstract
Repair is a core building block of human communication, allowing us to address problems of understanding in conversation. Past research has uncovered the basic mechanisms by which interactants signal and solve such problems. However, the focus has been on verbal interaction, neglecting the fact that human communication is inherently multimodal. Here, we focus on a visual signal particularly prevalent in signalling problems of understanding: eyebrow furrows and raises. We present, first, a corpus study showing that differences in eyebrow actions (furrows versus raises) were systematically associated with differences in the format of verbal repair initiations. Second, we present a follow-up study using an avatar that allowed us to test the causal consequences of addressee eyebrow movements, zooming into the effect of eyebrow furrows as signals of trouble in understanding in particular. The results revealed that addressees’ eyebrow furrows have a striking effect on speakers’ speech, leading speakers to produce answers to questions several seconds longer than when not perceiving addressee eyebrow furrows while speaking. Together, the findings demonstrate that eyebrow movements play a communicative role in initiating repair during conversation rather than being merely epiphenomenal and that their occurrence can critically influence linguistic behaviour. Thus, eyebrow movements should be considered core coordination devices in human conversational interaction.Additional information
link to preprint -
Ter Bekke, M., Drijvers, L., & Holler, J. (2025). Co-speech hand gestures are used to predict upcoming meaning. Psychological Science. Advance online publication. doi:10.1177/09567976251331041.
Abstract
In face-to-face conversation, people use speech and gesture to convey meaning. Seeing gestures alongside speech facilitates comprehenders’ language processing, but crucially, the mechanisms underlying this facilitation remain unclear. We investigated whether comprehenders use the semantic information in gestures, typically preceding related speech, to predict upcoming meaning. Dutch adults listened to questions asked by a virtual avatar. Questions were accompanied by an iconic gesture (e.g., typing) or meaningless control movement (e.g., arm scratch) followed by a short pause and target word (e.g., “type”). A Cloze experiment showed that gestures improved explicit predictions of upcoming target words. Moreover, an EEG experiment showed that gestures reduced alpha and beta power during the pause, indicating anticipation, and reduced N400 amplitudes, demonstrating facilitated semantic processing. Thus, comprehenders use iconic gestures to predict upcoming meaning. Theories of linguistic prediction should incorporate communicative bodily signals as predictive cues to capture how language is processed in face-to-face interaction.Additional information
supplementary material -
Tilston, O., Holler, J., & Bangerter, A. (2025). Opening social interactions: The coordination of approach, gaze, speech and handshakes during greetings. Cognitive Science, 49(2): e70049. doi:10.1111/cogs.70049.
Abstract
Despite the importance of greetings for opening social interactions, their multimodal coordination processes remain poorly understood. We used a naturalistic, lab-based setup where pairs of unacquainted participants approached and greeted each other while unaware their greeting behavior was studied. We measured the prevalence and time course of multimodal behaviors potentially culminating in a handshake, including motor behaviors (e.g., walking, standing up, hand movements like raise, grasp, and retraction), gaze patterns (using eye tracking glasses), and speech (close and distant verbal salutations). We further manipulated the visibility of partners’ eyes to test its effect on gaze. Our findings reveal that gaze to a partner's face increases over the course of a greeting, but is partly averted during approach and is influenced by the visibility of partners’ eyes. Gaze helps coordinate handshakes, by signaling intent and guiding the grasp. The timing of adjacency pairs in verbal salutations is comparable to the precision of floor transitions in the main body of conversations, and varies according to greeting phase, with distant salutation pair parts featuring more gaps and close salutation pair parts featuring more overlap. Gender composition and a range of multimodal behaviors affect whether pairs chose to shake hands or not. These findings fill several gaps in our understanding of greetings and provide avenues for future research, including advancements in social robotics and human−robot interaction. -
Trujillo, J. P., Dyer, R. M. K., & Holler, J. (2025). Dyadic differences in empathy scores are associated with kinematic similarity during conversational question-answer pairs. Discourse Processes, 62(3), 195-213. doi:10.1080/0163853X.2025.2467605.
Abstract
During conversation, speakers coordinate and synergize their behaviors at multiple levels, and in different ways. The extent to which individuals converge or diverge in their behaviors during interaction may relate to interpersonal differences relevant to social interaction, such as empathy as measured by the empathy quotient (EQ). An association between interpersonal difference in empathy and interpersonal entrainment could help to throw light on how interlocutor characteristics influence interpersonal entrainment. We investigated this possibility in a corpus of unconstrained conversation between dyads. We used dynamic time warping to quantify entrainment between interlocutors of head motion, hand motion, and maximum speech f0 during question–response sequences. We additionally calculated interlocutor differences in EQ scores. We found that, for both head and hand motion, greater difference in EQ was associated with higher entrainment. Thus, we consider that people who are dissimilar in EQ may need to “ground” their interaction with low-level movement entrainment. There was no significant relationship between f0 entrainment and EQ score differences. -
Trujillo, J. P., & Holler, J. (2025). Multimodal information density is highest in question beginnings, and early entropy is associated with fewer but longer visual signals. Discourse Processes, 62(2), 69-88. doi:10.1080/0163853X.2024.2413314.
Abstract
When engaged in spoken conversation, speakers convey meaning using both speech and visual signals, such as facial expressions and manual gestures. An important question is how information is distributed in utterances during face-to-face interaction when information from visual signals is also present. In a corpus of casual Dutch face-to-face conversations, we focus on spoken questions in particular because they occur frequently, thus constituting core building blocks of conversation. We quantified information density (i.e. lexical entropy and surprisal) and the number and relative duration of facial and manual signals. We tested whether lexical information density or the number of visual signals differed between the first and last halves of questions, as well as whether the number of visual signals occurring in the less-predictable portion of a question was associated with the lexical information density of the same portion of the question in a systematic manner. We found that information density, as well as number of visual signals, were higher in the first half of questions, and specifically lexical entropy was associated with fewer, but longer visual signals. The multimodal front-loading of questions and the complementary distribution of visual signals and high entropy words in Dutch casual face-to-face conversations may have implications for the parallel processes of utterance comprehension and response planning during turn-taking.Additional information
supplemental material -
Drijvers, L., & Holler, J. (2022). Face-to-face spatial orientation fine-tunes the brain for neurocognitive processing in conversation. iScience, 25(11): 105413. doi:10.1016/j.isci.2022.105413.
Abstract
We here demonstrate that face-to-face spatial orientation induces a special ‘social mode’ for neurocognitive processing during conversation, even in the absence of visibility. Participants conversed face-to-face, face-to-face but visually occluded, and back-to-back to tease apart effects caused by seeing visual communicative signals and by spatial orientation. Using dual-EEG, we found that 1) listeners’ brains engaged more strongly while conversing in face-to-face than back-to-back, irrespective of the visibility of communicative signals, 2) listeners attended to speech more strongly in a back-to-back compared to a face-to-face spatial orientation without visibility; visual signals further reduced the attention needed; 3) the brains of interlocutors were more in sync in a face-to-face compared to a back-to-back spatial orientation, even when they could not see each other; visual signals further enhanced this pattern. Communicating in face-to-face spatial orientation is thus sufficient to induce a special ‘social mode’ which fine-tunes the brain for neurocognitive processing in conversation. -
Eijk, L., Rasenberg, M., Arnese, F., Blokpoel, M., Dingemanse, M., Doeller, C. F., Ernestus, M., Holler, J., Milivojevic, B., Özyürek, A., Pouw, W., Van Rooij, I., Schriefers, H., Toni, I., Trujillo, J. P., & Bögels, S. (2022). The CABB dataset: A multimodal corpus of communicative interactions for behavioural and neural analyses. NeuroImage, 264: 119734. doi:10.1016/j.neuroimage.2022.119734.
Abstract
We present a dataset of behavioural and fMRI observations acquired in the context of humans involved in multimodal referential communication. The dataset contains audio/video and motion-tracking recordings of face-to-face, task-based communicative interactions in Dutch, as well as behavioural and neural correlates of participants’ representations of dialogue referents. Seventy-one pairs of unacquainted participants performed two interleaved interactional tasks in which they described and located 16 novel geometrical objects (i.e., Fribbles) yielding spontaneous interactions of about one hour. We share high-quality video (from three cameras), audio (from head-mounted microphones), and motion-tracking (Kinect) data, as well as speech transcripts of the interactions. Before and after engaging in the face-to-face communicative interactions, participants’ individual representations of the 16 Fribbles were estimated. Behaviourally, participants provided a written description (one to three words) for each Fribble and positioned them along 29 independent conceptual dimensions (e.g., rounded, human, audible). Neurally, fMRI signal evoked by each Fribble was measured during a one-back working-memory task. To enable functional hyperalignment across participants, the dataset also includes fMRI measurements obtained during visual presentation of eight animated movies (35 minutes total). We present analyses for the various types of data demonstrating their quality and consistency with earlier research. Besides high-resolution multimodal interactional data, this dataset includes different correlates of communicative referents, obtained before and after face-to-face dialogue, allowing for novel investigations into the relation between communicative behaviours and the representational space shared by communicators. This unique combination of data can be used for research in neuroscience, psychology, linguistics, and beyond. -
Holler, J., Drijvers, L., Rafiee, A., & Majid, A. (2022). Embodied space-pitch associations are shaped by language. Cognitive Science, 46(2): e13083. doi:10.1111/cogs.13083.
Abstract
Height-pitch associations are claimed to be universal and independent of language, but this claim remains controversial. The present study sheds new light on this debate with a multimodal analysis of individual sound and melody descriptions obtained in an interactive communication paradigm with speakers of Dutch and Farsi. The findings reveal that, in contrast to Dutch speakers, Farsi speakers do not use a height-pitch metaphor consistently in speech. Both Dutch and Farsi speakers’ co-speech gestures did reveal a mapping of higher pitches to higher space and lower pitches to lower space, and this gesture space-pitch mapping tended to co-occur with corresponding spatial words (high-low). However, this mapping was much weaker in Farsi speakers than Dutch speakers. This suggests that cross-linguistic differences shape the conceptualization of pitch and further calls into question the universality of height-pitch associations.Additional information
supporting information -
Holler, J. (2022). Visual bodily signals as core devices for coordinating minds in interaction. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences, 377(1859): 20210094. doi:10.1098/rstb.2021.0094.
Abstract
The view put forward here is that visual bodily signals play a core role in human communication and the coordination of minds. Critically, this role goes far beyond referential and propositional meaning. The human communication system that we consider to be the explanandum in the evolution of language thus is not spoken language. It is, instead, a deeply multimodal, multilayered, multifunctional system that developed—and survived—owing to the extraordinary flexibility and adaptability that it endows us with. Beyond their undisputed iconic power, visual bodily signals (manual and head gestures, facial expressions, gaze, torso movements) fundamentally contribute to key pragmatic processes in modern human communication. This contribution becomes particularly evident with a focus that includes non-iconic manual signals, non-manual signals and signal combinations. Such a focus also needs to consider meaning encoded not just via iconic mappings, since kinematic modulations and interaction-bound meaning are additional properties equipping the body with striking pragmatic capacities. Some of these capacities, or its precursors, may have already been present in the last common ancestor we share with the great apes and may qualify as early versions of the components constituting the hypothesized interaction engine. -
Holler, J., Bavelas, J., Woods, J., Geiger, M., & Simons, L. (2022). Given-new effects on the duration of gestures and of words in face-to-face dialogue. Discourse Processes, 59(8), 619-645. doi:10.1080/0163853X.2022.2107859.
Abstract
The given-new contract entails that speakers must distinguish for their addressee whether references are new or already part of their dialogue. Past research had found that, in a monologue to a listener, speakers shortened repeated words. However, the notion of the given-new contract is inherently dialogic, with an addressee and the availability of co-speech gestures. Here, two face-to-face dialogue experiments tested whether gesture duration also follows the given-new contract. In Experiment 1, four experimental sequences confirmed that when speakers repeated their gestures, they shortened the duration significantly. Experiment 2 replicated the effect with spontaneous gestures in a different task. This experiment also extended earlier results with words, confirming that speakers shortened their repeated words significantly in a multimodal dialogue setting, the basic form of language use. Because words and gestures were not necessarily redundant, these results offer another instance in which gestures and words independently serve pragmatic requirements of dialogue. -
Pouw, W., & Holler, J. (2022). Timing in conversation is dynamically adjusted turn by turn in dyadic telephone conversations. Cognition, 222: 105015. doi:10.1016/j.cognition.2022.105015.
Abstract
Conversational turn taking in humans involves incredibly rapid responding. The timing mechanisms underpinning such responses have been heavily debated, including questions such as who is doing the timing. Similar to findings on rhythmic tapping to a metronome, we show that floor transfer offsets (FTOs) in telephone conversations are serially dependent, such that FTOs are lag-1 negatively autocorrelated. Finding this serial dependence on a turn-by-turn basis (lag-1) rather than on the basis of two or more turns, suggests a counter-adjustment mechanism operating at the level of the dyad in FTOs during telephone conversations, rather than a more individualistic self-adjustment within speakers. This finding, if replicated, has major implications for models describing turn taking, and confirms the joint, dyadic nature of human conversational dynamics. Future research is needed to see how pervasive serial dependencies in FTOs are, such as for example in richer communicative face-to-face contexts where visual signals affect conversational timing. -
Schubotz, L., Özyürek, A., & Holler, J. (2022). Individual differences in working memory and semantic fluency predict younger and older adults' multimodal recipient design in an interactive spatial task. Acta Psychologica, 229: 103690. doi:10.1016/j.actpsy.2022.103690.
Abstract
Aging appears to impair the ability to adapt speech and gestures based on knowledge shared with an addressee
(common ground-based recipient design) in narrative settings. Here, we test whether this extends to spatial settings
and is modulated by cognitive abilities. Younger and older adults gave instructions on how to assemble 3D-
models from building blocks on six consecutive trials. We induced mutually shared knowledge by either
showing speaker and addressee the model beforehand, or not. Additionally, shared knowledge accumulated
across the trials. Younger and crucially also older adults provided recipient-designed utterances, indicated by a
significant reduction in the number of words and of gestures when common ground was present. Additionally, we
observed a reduction in semantic content and a shift in cross-modal distribution of information across trials.
Rather than age, individual differences in verbal and visual working memory and semantic fluency predicted the
extent of addressee-based adaptations. Thus, in spatial tasks, individual cognitive abilities modulate the inter-
active language use of both younger and older adulAdditional information
1-s2.0-S0001691822002050-mmc1.docx -
Trujillo, J. P., Levinson, S. C., & Holler, J. (2022). A multi-scale investigation of the human communication system's response to visual disruption. Royal Society Open Science, 9(4): 211489. doi:10.1098/rsos.211489.
Abstract
In human communication, when the speech is disrupted, the visual channel (e.g. manual gestures) can compensate to ensure successful communication. Whether speech also compensates when the visual channel is disrupted is an open question, and one that significantly bears on the status of the gestural modality. We test whether gesture and speech are dynamically co-adapted to meet communicative needs. To this end, we parametrically reduce visibility during casual conversational interaction and measure the effects on speakers' communicative behaviour using motion tracking and manual annotation for kinematic and acoustic analyses. We found that visual signalling effort was flexibly adapted in response to a decrease in visual quality (especially motion energy, gesture rate, size, velocity and hold-time). Interestingly, speech was also affected: speech intensity increased in response to reduced visual quality (particularly in speech-gesture utterances, but independently of kinematics). Our findings highlight that multi-modal communicative behaviours are flexibly adapted at multiple scales of measurement and question the notion that gesture plays an inferior role to speech.Additional information
supplemental material -
Bosker, H. R., Peeters, D., & Holler, J. (2020). How visual cues to speech rate influence speech perception. Quarterly Journal of Experimental Psychology, 73(10), 1523-1536. doi:10.1177/1747021820914564.
Abstract
Spoken words are highly variable and therefore listeners interpret speech sounds relative to the surrounding acoustic context, such as the speech rate of a preceding sentence. For instance, a vowel midway between short /ɑ/ and long /a:/ in Dutch is perceived as short /ɑ/ in the context of preceding slow speech, but as long /a:/ if preceded by a fast context. Despite the well-established influence of visual articulatory cues on speech comprehension, it remains unclear whether visual cues to speech rate also influence subsequent spoken word recognition. In two ‘Go Fish’-like experiments, participants were presented with audio-only (auditory speech + fixation cross), visual-only (mute videos of talking head), and audiovisual (speech + videos) context sentences, followed by ambiguous target words containing vowels midway between short /ɑ/ and long /a:/. In Experiment 1, target words were always presented auditorily, without visual articulatory cues. Although the audio-only and audiovisual contexts induced a rate effect (i.e., more long /a:/ responses after fast contexts), the visual-only condition did not. When, in Experiment 2, target words were presented audiovisually, rate effects were observed in all three conditions, including visual-only. This suggests that visual cues to speech rate in a context sentence influence the perception of following visual target cues (e.g., duration of lip aperture), which at an audiovisual integration stage bias participants’ target categorization responses. These findings contribute to a better understanding of how what we see influences what we hear. -
Macuch Silva, V., Holler, J., Ozyurek, A., & Roberts, S. G. (2020). Multimodality and the origin of a novel communication system in face-to-face interaction. Royal Society Open Science, 7: 182056. doi:10.1098/rsos.182056.
Abstract
Face-to-face communication is multimodal at its core: it consists of a combination of vocal and visual signalling. However, current evidence suggests that, in the absence of an established communication system, visual signalling, especially in the form of visible gesture, is a more powerful form of communication than vocalisation, and therefore likely to have played a primary role in the emergence of human language. This argument is based on experimental evidence of how vocal and visual modalities (i.e., gesture) are employed to communicate about familiar concepts when participants cannot use their existing languages. To investigate this further, we introduce an experiment where pairs of participants performed a referential communication task in which they described unfamiliar stimuli in order to reduce reliance on conventional signals. Visual and auditory stimuli were described in three conditions: using visible gestures only, using non-linguistic vocalisations only and given the option to use both (multimodal communication). The results suggest that even in the absence of conventional signals, gesture is a more powerful mode of communication compared to vocalisation, but that there are also advantages to multimodality compared to using gesture alone. Participants with an option to produce multimodal signals had comparable accuracy to those using only gesture, but gained an efficiency advantage. The analysis of the interactions between participants showed that interactants developed novel communication systems for unfamiliar stimuli by deploying different modalities flexibly to suit their needs and by taking advantage of multimodality when required. -
Ripperda, J., Drijvers, L., & Holler, J. (2020). Speeding up the detection of non-iconic and iconic gestures (SPUDNIG): A toolkit for the automatic detection of hand movements and gestures in video data. Behavior Research Methods, 52(4), 1783-1794. doi:10.3758/s13428-020-01350-2.
Abstract
In human face-to-face communication, speech is frequently accompanied by visual signals, especially communicative hand gestures. Analyzing these visual signals requires detailed manual annotation of video data, which is often a labor-intensive and time-consuming process. To facilitate this process, we here present SPUDNIG (SPeeding Up the Detection of Non-iconic and Iconic Gestures), a tool to automatize the detection and annotation of hand movements in video data. We provide a detailed description of how SPUDNIG detects hand movement initiation and termination, as well as open-source code and a short tutorial on an easy-to-use graphical user interface (GUI) of our tool. We then provide a proof-of-principle and validation of our method by comparing SPUDNIG’s output to manual annotations of gestures by a human coder. While the tool does not entirely eliminate the need of a human coder (e.g., for false positives detection), our results demonstrate that SPUDNIG can detect both iconic and non-iconic gestures with very high accuracy, and could successfully detect all iconic gestures in our validation dataset. Importantly, SPUDNIG’s output can directly be imported into commonly used annotation tools such as ELAN and ANVIL. We therefore believe that SPUDNIG will be highly relevant for researchers studying multimodal communication due to its annotations significantly accelerating the analysis of large video corpora.Additional information
data and materials -
Sekine, K., Schoechl, C., Mulder, K., Holler, J., Kelly, S., Furman, R., & Ozyurek, A. (2020). Evidence for children's online integration of simultaneous information from speech and iconic gestures: An ERP study. Language, Cognition and Neuroscience, 35(10), 1283-1294. doi:10.1080/23273798.2020.1737719.
Abstract
Children perceive iconic gestures, along with speech they hear. Previous studies have shown
that children integrate information from both modalities. Yet it is not known whether children
can integrate both types of information simultaneously as soon as they are available as adults
do or processes them separately initially and integrate them later. Using electrophysiological
measures, we examined the online neurocognitive processing of gesture-speech integration in
6- to 7-year-old children. We focused on the N400 event-related potentials component which
is modulated by semantic integration load. Children watched video clips of matching or
mismatching gesture-speech combinations, which varied the semantic integration load. The
ERPs showed that the amplitude of the N400 was larger in the mismatching condition than in
the matching condition. This finding provides the first neural evidence that by the ages of 6
or 7, children integrate multimodal semantic information in an online fashion comparable to
that of adults. -
Ter Bekke, M., Drijvers, L., & Holler, J. (2020). The predictive potential of hand gestures during conversation: An investigation of the timing of gestures in relation to speech. In Proceedings of the 7th GESPIN - Gesture and Speech in Interaction Conference. Stockholm: KTH Royal Institute of Technology.
Abstract
In face-to-face conversation, recipients might use the bodily movements of the speaker (e.g. gestures) to facilitate language processing. It has been suggested that one way through which this facilitation may happen is prediction. However, for this to be possible, gestures would need to precede speech, and it is unclear whether this is true during natural conversation.
In a corpus of Dutch conversations, we annotated hand gestures that represent semantic information and occurred during questions, and the word(s) which corresponded most closely to the gesturally depicted meaning. Thus, we tested whether representational gestures temporally precede their lexical affiliates. Further, to see whether preceding gestures may indeed facilitate language processing, we asked whether the gesture-speech asynchrony predicts the response time to the question the gesture is part of.
Gestures and their strokes (most meaningful movement component) indeed preceded the corresponding lexical information, thus demonstrating their predictive potential. However, while questions with gestures got faster responses than questions without, there was no evidence that questions with larger gesture-speech asynchronies get faster responses. These results suggest that gestures indeed have the potential to facilitate predictive language processing, but further analyses on larger datasets are needed to test for links between asynchrony and processing advantages.
Share this page