Research at the Multimodal Language department
Research at the MLD aims to understand the complex architecture of our human language faculty as multimodal system (e.g., in speech, gesture and sign) and its role in communication and cognition. We conduct our research across 5 integrated Research Lines that subsume several individual or collective projects. There are also Cluster Groups that focus on a sub theme that cross-cut several projects with in the Research Lines (e.g., Multimodal Reference; Multimodal Prosody; Multimodal Modeling) .Finally Focus Groups are led by a couple of researchers that are aimed to make advances in certain methodologies in multimodal language research ( e.g, Kinematics FG, Bayesian Statistic FG etc). Below we highlight some of the projects that are conducted under Research Lines, Cluster and Focus Groups but for a full list please visit Peoples' webpages.
- Research Lines
- Multimodal Language Structure and Typology
Multimodal Language Structure and Typology research line investigates how modality-specific aspects of visual expressions, such as visible iconicity, indexicality (pointing, eye gaze), and simultaneity, are recruited in different languages to understand the limits of variability and universality in multimodal language structures across typologies. Currently, we focus on demonstratives, negation, prosody, and iconicity.
- Multimodal Demonstratives project is conducted under Multimodal Reference Cluster (see Clusters below) by Dr. Paula Rubio Fernandez and investigates how online pointing, listeners' and producers' eye gaze and joint attention and the location of referents determine the choice of multimodal reference (e.g., this, that, that green book, etc) in different languages in production and comprehension. We also investigate the role demonstratives play for children as multimodal pathways to word learning. Here, we use VR, kinematic analysis, eye tracking, machine learning, and modeling to advance our investigations in this domain. We conduct research in Turkish, Dutch, and Spanish, but plan to extend this to different signed and spoken languages (see also Bahar Tarakci PhD project).
Example publication:
Rubio-Fernandez, P., Berke, M. D., & Jara-Ettinger, J. (2025). Tracking minds in communication. Trends in Cognitive Sciences, 29(3), 269-281. https://doi.org/10.1016/j.tics.2024.11.005
Multimodal Negation project is conducted both in spoken language and sign languages.
The spoken language project is led by Dr. Hatice Zora and investigates how prosody and gesture constitute part of negation in spoken Turkish corpus, as well as its neural processing (using virtual agents as stimuli), using EEG. Here, we test if the negation gesture is processed as part of grammar, that is, in relation to a morphological marker of negation, and how it interacts with prosody. This project also runs under Multimodal Prosody Cluster (see below).
Here, we also investigate how negation is used with manual and non-manual articulators in the Turkish Sign Language corpus as part of a PhD project led by one of our deaf PhD students, Hasan Dikyuva, and in collaboration with UvA, Dr. Beyza Sumer, and Dr. Roland Pfau.
Turkish Sign Language Corpus of Negation | Provided by: Hasan Dikyuva
Multimodal Prosody project aims to understand the relations between prosody, information structure marking (e.g., preverbal and contrastive information), and manual gestures in narrative corpora of Turkish and Dutch speakers in the PhD project of Beyza Nur Cay, as well as in conversational corpus with Dr. Asli Gurer and within the Multimodal Prosody Cluster.
- In the Iconicity project, we investigate how visual iconicity patterns in lexicons, and grammatical and discourse structures in sign languages, and their role in first and second language learning, in sign language emergence, and in communicative efficiency (with Dr. Beyza Sümer, Dr. Dilay Karadoller, METU, and Dr. Anita Slonimska).
Example publication:
Karadöller, D. Z., Peeters, D., Manhardt, F., Özyürek, A., & Ortega, G. (2024). Iconicity and gesture jointly facilitate learning of second language signs at first exposure in hearing non-signers. Language Learning, 74(4), 781-813. https://doi.org/10.1111/lang.12636
- Multimodal Language (Neuro-)Cognitive Processing
The Multimodal Language (Neuro-) Cognitive Processing research line asks whether and how multimodal language structures interact with (neuro)cognitive processing and, in turn how general cognitive constraints (memory, sensory processes), individual variations and diversities shape multimodal language use.
Predictive processing in context project investigates in general the role of gesture and facial expressions in predictive processing of language in conversations. Our researchers look for converging evidence from different studies (led by senior investigator Prof. Judith Holler) using corpus studies, VR and EEG that visual expressions such as manual gestures as well as facial expressions (e.g., eyebrows) precede related words in conversations and thus have the potential to facilitate processing of upcoming utterance elements and enable their prediction (e.g. referents, verbs, speech acts).
We also investigate gestures role in predictive language processing with iconic gestures in visual world paradigm across languages (Chinese, English, etc) and with pointing and social eye gaze in VR.
- Diverse populations project conducts research with blind people or patients with different types of brain damage, such as Alzheimer’s disease, mild cognitive dementia, and aphasia, to see how different types of experience or biological/cognitive/linguistic constraints shape the nature of multimodal expressions (see webpages Dr. Sharice Clough and Dr. Ezgi Mamus, and PhD student Martina Mellana). Here, we use state-of-the-art kinematic measures to quantify gesture spaces, locations, sizes, and vectors and relate them to semantic expressions and vectors in spoken language as well as VR.
Example of a VR setup to be used for people with Aphasia | Provided by: Martina Mellana
Example publication:
Mamus, E., Speed, L. J., Rissman, L., Majid, A., & Özyürek, A. (2023). Lack of visual experience affects multimodal language production: Evidence from congenitally blind and sighted people. Cognitive Science, 47(1), e13228. https://doi.org/10.1111/cogs.13228
Expressions of motion events by different populations | Provided by: Ezgi Mamus
- Multimodal Language Use in Interaction
Multimodal Language Use in Interaction research line aims to understand how using language with different modalities interfaces with interactive and communicative constraints of language use (e.g., communicative efficiency, relevance, alignment, turn-taking) and enables its flexibility.
Cross-modal alignment in conversation project is conducted currently in a PhD project (Sho Akamine) co-supervised with the Psychology of Language department and Center for Language Studies, RU. Using corpora (face to face and Zoom interactions) and experimental manipulations. we aim to understand how priming, grounding, and shared conceptualisation processes seem to underlie multimodal language alignment. Here we also use corpus driven multimodal computational modeling and kinematic gesture vectors to understand which factors in conversation can simulate alignment (see also Multimodal Modeling Cluster with Dr. Esam Ghaleb).
Example publication:
Akamine, S., et al. (2024). Speakers align both their gestures and words not only to establish but also to maintain reference to create shared labels for novel objects in interaction.
An example of multimodal (lexical + gestural) alignment over the six rounds of a referential communication game.
Contextual modulation project is led by Dr. Anita Slonimska and, in collaboration with Dr. Emanuela Campisi (University of Catania) investigates how gestures, eye gaze, and signs enable communicative efficiency, audience design, and mark the pragmatic relevance of gestures in language use in different contexts and languages (e.g., Dutch and Italian).
- We also investigate how and which multimodal aspects of conversations play a role in human-robot interactions using Furhat robot (with Dr. Chinmaya Mishra).
Various experimental feedback behaviors exhibited by Furhat robot during face-to-face conversation
- Multimodal Language Acquisition, Learning and Evolution
Multimodal Language Acquisition, Learning and Evolution research line investigates how the multimodal nature of language in children’s expressions and in the interactive input optimizes and scaffolds language acquisition and learning processes (e.g. visual attention, memory) for L1 or L2.
Recently, our team members have written two position papers that received 16 commentaries and summarize MLD’s position on how language acquisition can be viewed from a multimodal language framework, considering recent findings in speech, gesture, and sign development (Karadoller, Sumer, and Özyürek, 2024, 2025).
Iconicity in language learning project with Dr. Dilay Karadoller, Dr. Ercenur Unal and Dr. Beyza Sümer investigates if and how children’s use of iconic gestures and signs can help reduce the difficulties in the mappings between cognitively complex spatial relations (e.g., left-right), and locative terms, both in speaking and signing children. We also investigate, using eye tracking, whether patterns of children’s eye gaze to spatial scenes can predict whether they will gesture accompanying their speech in expressing left-right relations and predict their memory for these relations. Here we also develop a Turkish Sign Language Lexicon Data base (TID-LEX) to investigate further effects of different types and levels of iconicity in learning vocabulary in early and late signing children (link to METU webpage).
Example publication:
Ünal, E., Karadöller, D. Z., & Özyürek, A. (2026). Children sustain their attention on spatial scenes when planning to describe spatial relations multimodally in speech and gesture. Developmental Science, 29(2): e70128. doi:10.1111/desc.70128. https://doi.org/10.1111/desc.70128
A) Example of stimuli in spatial expressions
B) Fixations to the target picture over time
Pointing, eye gaze and demonstratives in language learning project conducted in collaboration with the Language Development Department, combining corpus and computational methods using infants’ mobile eye-tracking, investigates if parents’ use of demonstratives is more effective than using object labels and content words in guiding infants’ attention to objects in creating optimal word and concept learning moments (see also Multimodal Reference Cluster where this project is conducted under)
- Multimodal language evolution project, we investigate the role multimodality plays in language evolution and emergence. Currently, in a study conducted with LEADs group at MPI and PhD student Lois Dona, we investigate how, in particular, use of facial expressions might facilitate language emergence studied with communication games.
- Multimodal Language Modeling and Innovations
The Multimodal Language Modeling and Innovations research line aims to develop new technological tools, infrastructure, and modelling to advance studying language multimodally.
The department recently set up a state-of-the-art Vicon Motion Capture lab and produced novel pipelines for tracking movements, facial expressions, and eye gaze movements from human interactions and transferring these to VR agents for interactive experiments.
The Interaction lab was redesigned with AI-enhanced wearable eye-trackers synchronised with multiple cameras and audio recordings to be able to record interactive and conversational data sets at the dyadic or group levels.
Recently, this research line has also started to investigate how modality-specific aspects of multimodal language can be represented and generated in computational models and using machine learning and AI as new ways to understand the nature of human multimodal language representations and processing. This is led by Dr. Esam Ghaleb at Multimodal Modeling Cluster.
Here we also have individual projects such as creating Generative Gesture Generation models with our PhD student Lanmiou Liu in collaboration with Utrecht University and testing how VLLMs understand demonstratives and iconicity in signs compared to humans (with PhD students Onur Keles and Tianai Dong).
Example publication:
Liu, L., Ghaleb, E., Özyürek, A., & Yumak, Z. (in press). SemGes: Semantics-aware co-speech gesture generation using semantic coherence and relevance learning. In Proceedings of the International Conference on Computer Vision (ICCV 2025).
- Under Kinematic Focus Group tab, one can also find the recent methodological innovations we have made using kinematic measures to quantify different aspects of gestures and signs, such as velocity, segmentation, size, similarity, etc. The team developed accessible toolkits for automatic extraction of gestures, eye gaze movements, and speech using machine learning algorithms (see under Kinematic Focus Group and open-access documents on the departmental GitHub).
- Cluster Groups
Cluster Groups focus on specific research themes within each line, connecting researchers who share theoretical interests and encouraging collaboration, joint projects, and conceptual development.
- Reference Cluster
Leader: Paula Rubio-Fernández
Purpose: The Multimodal Reference Cluster investigates the interdependence between language and social cognition by studying multimodal referential communication from four complementary approaches:
Reference production: How do speakers of different languages synchronize gaze, pointing, and speech in face-to-face referential communication? We address this question by comparing the use of demonstratives and other referential expressions in Turkish, Japanese, and Spanish. Our participants wear eye-tracking glasses to monitor their gaze coordination, while external cameras record their speech and pointing gestures.
Reference comprehension: How do listeners integrate the speaker’s gaze, pointing, and speech when they interpret a referential expression? To accurately measure listener’s response (including their looking behavior via eye-tracking), we immerse our participants in a Virtual Reality where they play a referential communication task with a human-animated avatar. We are conducting the first experiment in Dutch at the MPI, but using mobile technology that will allow us to conduct the experiment in other languages during fieldwork.
Reference development: Infants break into language through pointing, and soon after start using demonstratives to establish joint attention with their caregivers. Despite the universality of these milestones, little is known about how children acquire the meaning of demonstratives across languages. To address this question, we investigate mother-infant interaction during naturalistic toy play, focusing on the mother’s use of demonstratives and definite articles (our baseline). We use head-mounted eye-tracking to monitor gaze coordination between mother and infant, and external cameras to track their object manipulation during reference.
Reference modelling: Recent advances in multimodal language models (MLMs) have enabled systems to use text and image so naturally that users often perceive them as real conversational partners. However, existing evaluations of MLMs have largely focused on their use of vocabulary and syntax, while overlooking a fundamental class of grammatical words: indexicals. We have recently completed the first study of humans’ and MLMs’ use of indexicals in simulated face-to-face referential communication. The results confirmed the predicted difficulty hierarchy (vocabulary < possessives < demonstratives) in both groups. However, the difference between content words and indexicals was larger in MLMs, suggesting limitations in perspective-taking and spatial reasoning.
The Multimodal Reference Cluster has monthly meetings, alternative Update Meetings (where all members give updates of their ongoing work and get feedback from others) and a Journal Club (where one of the junior members lead the discussion of a published paper on multimodal reference).
- Modeling cluster
Leader: Esam Ghaleb
Purpose: The Multimodal Modelling Group develops and applies computational methods to understand how language is realised and processed across multiple modalities, including speech, text, gesture, sign, facial expressions, eye gaze, and whole-body movement. The group treats modelling as a central methodological resource: we build tools, models, and theoretical frameworks that help reveal how multimodal signals jointly encode meaning, and how this varies across languages, communities, and interactional settings.
A core activity is the use of machine learning and AI to advance the segmentation, coding, and representation of visual communicative signals. We work with motion capture and video-based data to derive structured representations of multimodal behaviour, and we develop models that capture patterns and regularities in these signals.
The group investigates how multimodal representations learned by computational models can serve as testbeds for theories of multimodal language. By comparing model behaviour with human data, we examine how information is distributed across modalities and how different channels of communication interact.
We also conduct experiments with large language models and multimodal language models to investigate how these systems process, integrate, and generate multimodal language, and how this process compares to human communicative abilities.
Additionally, the group develops generative models for gestures and signs, which are used to drive virtual agents and avatars in controlled experimental settings. This work enables systematic studies of language in interaction, including how interlocutors respond to different kinds of multimodal behaviour and how kinematic properties affect comprehension and learning.
Overall, the group supports the broader research environment by providing reusable software, models, and analysis workflows for multimodal corpora, and by facilitating the integration of experimental, corpus-based, and modelling approaches. The overarching aim is to contribute computational methods and insights that establish multimodality and cross-linguistic diversity as fundamental design features in theories of the human language faculty.
- Prosody Cluster
Leader: Hatice Zora
Purpose: The group seeks to bring together researchers working on, or interested in, prosody across modalities, languages, and functions. Its objectives are to establish a shared foundation across research traditions, to identify and address central research challenges, and to explore strategies for advancing the field. The group will provide a forum for discussing experimental ideas, applying a range of methodologies, from behavioral and computational approaches to brain imaging techniques, and for developing theoretical frameworks.
Main topics: In particular, the group will focus on:
Architecture of prosody: How prosodic cues interact structurally and operationally, with an emphasis on their multimodality and multifunctionality.
Neurobiology of prosody: Neural mechanisms underlying prosody, with an emphasis on developing neural network models of interaction.
- Focus Groups
Focus Groups focuses on methods and tools, helping researchers build technical expertise and develop shared approaches that strengthen research across all themes.
- Bayesian/Statistics
Leader: Sho Akamine
Purpose: The aim of the statistics focus group is to enhance our understanding of statistics (e.g., mixed-effects regression models) and to cultivate the critical thinking necessary for accurately interpreting statistical analysis outcomes. Our primary focus is on Bayesian inference because: (i) it is gaining popularity, (ii) the knowledge gained can be applied to the traditional frequentist approach, and (iii) Bayesian regressions demand a solid understanding of statistics, which can be overlooked in the frequentist approach, leading to potential misuse and misinterpretation of statistical models.
Main topics: (Generalized) linear mixed-effects regression, Bayesian statistics, causal inference
- Concepts
Leader: Ezgi Mamus and Marius Peelen
Purpose: This focus group is a new initiative of the MPI and Donders Institute, aimed at bringing people together across centers and themes around a topic of shared interest: concepts. Concepts play an important role in multiple fields of cognitive neuroscience, including language (e.g., linguistic concepts), perception (e.g., object categories), action (e.g., embodied cognition), memory, neuropsychology (e.g., apraxia), and lifespan development (e.g., acquisition of concepts, semantic dementia). Many people at the MPI and Donders Institute share an interest in these concepts, and we believe it would be fruitful to bring them together to discuss core questions, new results, and competing theories, potentially leading to new interdisciplinary collaborations.
The full schedule for the academic year (always 13.30-15.00):
Tuesday, September 23, 2025 – MM 01.620 – Peter Hagoort
Tuesday, November 25, 2025 – MM 01.620 – Marius Peelen
Tuesday, February 3, 2026 – MPI 163 – Floris de Lange
Tuesday, March 24, 2026 – MPI 163 – Asli Ozyurek & Ezgi Mamus
Tuesday, May 26, 2026 – MPI 163 – TBA
- Kinematics
Leaders: Sharice Clough and Mounika Kanakanti
Purpose: To train researchers with the skills and knowledge to conduct kinematic analyses of gesture and sign language behavior using video-based motion tracking
Main Topics: The MLD Kinematics Focus Group gives researchers hands-on experience with coding modules covering topics such as how to extract video-based motion tracking data (e.g., MediaPipe data), how to preprocess the data (e.g., smoothing and normalization), how to merge motion-tracking data with other time series data (e.g., ELAN annotations, eye-tracking data), and how to perform calculations to quantify the spatiotemporal dynamics of a movement signal (e.g., velocity, submovements, holds, vertical amplitude, size, volume). The focus group also provides opportunities for researchers to present their own kinematic analyses for code review and share new research findings. The group is highly interactive with lots of discussion about novel applications of kinematic analyses for gesture and sign language research as well as limitations and practical considerations for applying video-based motion tracking methods to existing and new datasets.
Check the MPI GitHub page for repositories containing scripts for various multimodal analyses.
- MPI Github
This is the code for how we study how visible bodily signals (hands, face, and body posture) coordinate with speech to form multimodal language. Our work spans corpora, experimental design, computational modeling, and kinematic / gesture analysis. On this GitHub landing page, we host repositories containing scripts for various multimodal analyses. Portions of many of these scripts are sourced and adapted from EnvisionBox (https://envisionbox.org), which maintains a library of coding modules for open exchange among researchers.
https://github.com/Multimodal-Language-Department-MPI-NL
Extracting Mediapipe Keypoints
This module shows how to generate motion-tracking data from videos. It uses Google’s MediaPipe library to extract human pose landmarks across all video frames.
https://github.com/Multimodal-Language-Department-MPI-NL/MediaPipe_keypoints_extraction
Smoothing
This module shows how to smooth motion-tracking data to handle noise due to tracking inaccuracies and how to interpolate missing data.
https://github.com/Multimodal-Language-Department-MPI-NL/Smoothing
Normalization
This module shows how to normalize the size and position of motion-tracking data across files. Normalization ensures that the data for all files is on the same scale so that you can compare movement trajectories across files with different resolutions, camera setups, and participant size.
https://github.com/Multimodal-Language-Department-MPI-NL/Normalization
Merging Elan and MediaPipe
This module shows how to merge motion-tracking data with annotations from ELAN (or other time-series data). This allows you to perform kinematic analyses of gesture strokes or other manually annotated units from ELAN.
https://github.com/Multimodal-Language-Department-MPI-NL/Merging_Motion_ELAN
Speed, Acceleration, and Jerk
This module shows how to calculate movement speed. It also calculates acceleration (changes in speed over time) and jerk (e.g., sudden movements) which are derivatives of speed.
https://github.com/Multimodal-Language-Department-MPI-NL/Speed_Acceleration_Jerk
Submovements and Holds
This module shows how to calculate the number of submovements of a movement signal based on peak speed and how to detect movement holds (i.e., pauses) below a certain speed threshold. These measures relate to how complex and/or segmented a movement signal is.
https://github.com/Multimodal-Language-Department-MPI-NL/Submovements_Holds
Gesture Space, Size, and Volume
This module shows how to calculate maximum vertical amplitude (gesture height), characterize the location of gestures based on McNeillian space (McNeill, 1992), and calculate the 2D size or 3D volume of gesture (or sign) space.
https://github.com/Multimodal-Language-Department-MPI-NL/Gesture_Space_Size_and_Volume
Heatmap Visualization
This module shows how to generate a scatterplot density heatmap depicting the location of participants’ wrist keypoints during a given movement signal. This visualization shows how participants use space during sign or gesture production.
https://github.com/Multimodal-Language-Department-MPI-NL/Heatmap
- Reading Groups
Groups for collective exploration of current literature and theoretical developments.
- Theory of Multimodal Language
Leader: Ercenur Unal & Neil Cohn
For over a century, language has been considered as an amodal capacity, flowing into different modalities while maintaining a primary modality of speech. However, research over the past half-century has revealed problems with this speech-centric amodal conception of language, particularly given the pervasiveness of multimodal communication. In these meetings, we discuss readings that challenge the predominant amodal paradigm of language and propose alternative, multimodal theoretical frameworks of language.
We discuss issues like:
- What is language, particularly in relation to other behaviors like gesture, drawing, or music? What is a modality?
- Where does iconicity, indexicality, and symbolicity fit within the language architecture?
- How do we characterize the varying complexity of grammars and their interactions both within modalities (like in bilingual codeswitching) and across modalities (like in multimodality)?
- How does multimodality change conceptions of linguistic universals, evolution, or relativity?
Altogether, this discussion aims to provide a theoretical foundation for reconfiguring the grounding principles of the language sciences for a Multimodal Paradigm.
Share this page