Displaying 1 - 15 of 15
-
Burchardt, L., Van de Sande, Y., Kehy, M., Gamba, M., Ravignani, A., & Pouw, W. (2024). A toolkit for the dynamic study of air sacs in siamang and other elastic circular structures. PLOS Computational Biology, 20(6): e1012222. doi:10.1371/journal.pcbi.1012222.
Abstract
Biological structures are defined by rigid elements, such as bones, and elastic elements, like muscles and membranes. Computer vision advances have enabled automatic tracking of moving animal skeletal poses. Such developments provide insights into complex time-varying dynamics of biological motion. Conversely, the elastic soft-tissues of organisms, like the nose of elephant seals, or the buccal sac of frogs, are poorly studied and no computer vision methods have been proposed. This leaves major gaps in different areas of biology. In primatology, most critically, the function of air sacs is widely debated; many open questions on the role of air sacs in the evolution of animal communication, including human speech, remain unanswered. To support the dynamic study of soft-tissue structures, we present a toolkit for the automated tracking of semi-circular elastic structures in biological video data. The toolkit contains unsupervised computer vision tools (using Hough transform) and supervised deep learning (by adapting DeepLabCut) methodology to track inflation of laryngeal air sacs or other biological spherical objects (e.g., gular cavities). Confirming the value of elastic kinematic analysis, we show that air sac inflation correlates with acoustic markers that likely inform about body size. Finally, we present a pre-processed audiovisual-kinematic dataset of 7+ hours of closeup audiovisual recordings of siamang (Symphalangus syndactylus) singing. This toolkit (https://github.com/WimPouw/AirSacTracker) aims to revitalize the study of non-skeletal morphological structures across multiple species. -
Ghaleb, E., Rasenberg, M., Pouw, W., Toni, I., Holler, J., Özyürek, A., & Fernandez, R. (2024). Analysing cross-speaker convergence through the lens of automatically detected shared linguistic constructions. In L. K. Samuelson, S. L. Frank, A. Mackey, & E. Hazeltine (
Eds. ), Proceedings of the 46th Annual Meeting of the Cognitive Science Society (CogSci 2024) (pp. 1717-1723).Abstract
Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as alignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions---expressions with a common lexical core used by both speakers within a dialogue---and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.Additional information
link to eScholarship -
Ghaleb, E., Khaertdinov, B., Pouw, W., Rasenberg, M., Holler, J., Ozyurek, A., & Fernandez, R. (2024). Learning co-speech gesture representations in dialogue through contrastive learning: An intrinsic evaluation. In Proceedings of the 26th International Conference on Multimodal Interaction (ICMI 2024) (pp. 274-283).
Abstract
In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures’ variability and relationship with speech? This paper tackles this challenge by employing self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information. We propose an approach that includes both unimodal and multimodal pre-training to ground gesture representations in co-occurring speech. For training, we utilize a face-to-face dialogue dataset rich with representational iconic gestures. We conduct thorough intrinsic evaluations of the learned representations through comparison with human-annotated pairwise gesture similarity. Moreover, we perform a diagnostic probing analysis to assess the possibility of recovering interpretable gesture features from the learned representations. Our results show a significant positive correlation with human-annotated gesture similarity and reveal that the similarity between the learned representations is consistent with well-motivated patterns related to the dynamics of dialogue interaction. Moreover, our findings demonstrate that several features concerning the form of gestures can be recovered from the latent representations. Overall, this study shows that multimodal contrastive learning is a promising approach for learning gesture representations, which opens the door to using such representations in larger-scale gesture analysis studies. -
Ghaleb, E., Burenko, I., Rasenberg, M., Pouw, W., Uhrig, P., Holler, J., Toni, I., Ozyurek, A., & Fernandez, R. (2024). Cospeech gesture detection through multi-phase sequence labeling. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024) (pp. 4007-4015).
Abstract
Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and re-
traction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline models in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework’s capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis. -
Leonetti, S., Ravignani, A., & Pouw, W. (2024). A cross-species framework for classifying sound-movement couplings. Neuroscience and Biobehavioral Reviews, 167: 105911. doi:10.1016/j.neubiorev.2024.105911.
Abstract
Sound and movement are entangled in animal communication. This is obviously true in the case of sound-constituting vibratory movements of biological structures which generate acoustic waves. A little less obvious is that other moving structures produce the energy required to sustain these vibrations. In many species, the respiratory system moves to generate the expiratory flow which powers the sound-constituting movements (sound-powering movements). The sound may acquire additional structure via upper tract movements, such as articulatory movements or head raising (sound-filtering movements). Some movements are not necessary for sound production, but when produced, impinge on the sound-producing process due to weak biomechanical coupling with body parts (e.g., respiratory system) that are necessary for sound production (sound-impinging movements). Animals also produce sounds contingent with movement, requiring neuro-physiological control regimes allowing to flexibly couple movements to a produced sound, or coupling movements to a perceived external sound (sound-contingent movement). Here, we compare and classify the variety of ways sound and movements are coupled in animal communication; our proposed framework should help structure previous and future studies on this topic. -
Dowell, C., Hajnal, A., Pouw, W., & Wagman, J. B. (2020). Visual and haptic perception of affordances of feelies. Perception, 49(9), 905-925. doi:10.1177/0301006620946532.
Abstract
Most objects have well-defined affordances. Investigating perception of affordances of objects that were not created for a specific purpose would provide insight into how affordances are perceived. In addition, comparison of perception of affordances for such objects across different exploratory modalities (visual vs. haptic) would offer a strong test of the lawfulness of information about affordances (i.e., the invariance of such information over transformation). Along these lines, “feelies”— objects created by Gibson with no obvious function and unlike any common object—could shed light on the processes underlying affordance perception. This study showed that when observers reported potential uses for feelies, modality significantly influenced what kind of affordances were perceived. Specifically, visual exploration resulted in more noun labels (e.g., “toy”) than haptic exploration which resulted in more verb labels (i.e., “throw”). These results suggested that overlapping, but distinct classes of action possibilities are perceivable using vision and haptics. Semantic network analyses revealed that visual exploration resulted in object-oriented responses focused on object identification, whereas haptic exploration resulted in action-oriented responses. Cluster analyses confirmed these results. Affordance labels produced in the visual condition were more consistent, used fewer descriptors, were less diverse, but more novel than in the haptic condition. -
Eielts, C., Pouw, W., Ouwehand, K., Van Gog, T., Zwaan, R. A., & Paas, F. (2020). Co-thought gesturing supports more complex problem solving in subjects with lower visual working-memory capacity. Psychological Research, 84, 502-513. doi:10.1007/s00426-018-1065-9.
Abstract
During silent problem solving, hand gestures arise that have no communicative intent. The role of such co-thought gestures in
cognition has been understudied in cognitive research as compared to co-speech gestures. We investigated whether gesticulation
during silent problem solving supported subsequent performance in a Tower of Hanoi problem-solving task, in relation
to visual working-memory capacity and task complexity. Seventy-six participants were assigned to either an instructed gesture
condition or a condition that allowed them to gesture, but without explicit instructions to do so. This resulted in three
gesture groups: (1) non-gesturing; (2) spontaneous gesturing; (3) instructed gesturing. In line with the embedded/extended
cognition perspective on gesture, gesturing benefited complex problem-solving performance for participants with a lower
visual working-memory capacity, but not for participants with a lower spatial working-memory capacity. -
Hostetter, A. B., Pouw, W., & Wakefield, E. M. (2020). Learning from gesture and action: An investigation of memory for where objects went and how they got there. Cognitive Science, 44(9): e12889. doi:10.1111/cogs.12889.
Abstract
Speakers often use gesture to demonstrate how to perform actions—for example, they might show how to open the top of a jar by making a twisting motion above the jar. Yet it is unclear whether listeners learn as much from seeing such gestures as they learn from seeing actions that physically change the position of objects (i.e., actually opening the jar). Here, we examined participants' implicit and explicit understanding about a series of movements that demonstrated how to move a set of objects. The movements were either shown with actions that physically relocated each object or with gestures that represented the relocation without touching the objects. Further, the end location that was indicated for each object covaried with whether the object was grasped with one or two hands. We found that memory for the end location of each object was better after seeing the physical relocation of the objects, that is, after seeing action, than after seeing gesture, regardless of whether speech was absent (Experiment 1) or present (Experiment 2). However, gesture and action built similar implicit understanding of how a particular handgrasp corresponded with a particular end location. Although gestures miss the benefit of showing the end state of objects that have been acted upon, the data show that gestures are as good as action in building knowledge of how to perform an action. -
Pouw, W., Paxton, A., Harrison, S. J., & Dixon, J. A. (2020). Reply to Ravignani and Kotz: Physical impulses from upper-limb movements impact the respiratory–vocal system. Proceedings of the National Academy of Sciences of the United States of America, 117(38), 23225-23226. doi:10.1073/pnas.2015452117.
Additional information
This article has a letter -
Pouw, W., Paxton, A., Harrison, S. J., & Dixon, J. A. (2020). Acoustic information about upper limb movement in voicing. Proceedings of the National Academy of Sciences of the United States of America, 117(21), 11364-11367. doi:10.1073/pnas.2004163117.
Abstract
We show that the human voice has complex acoustic qualities that are directly coupled to peripheral musculoskeletal tensioning of the body, such as subtle wrist movements. In this study, human vocalizers produced a steady-state vocalization while rhythmically moving the wrist or the arm at different tempos. Although listeners could only hear but not see the vocalizer, they were able to completely synchronize their own rhythmic wrist or arm movement with the movement of the vocalizer which they perceived in the voice acoustics. This study corroborates
recent evidence suggesting that the human voice is constrained by bodily tensioning affecting the respiratory-vocal system. The current results show that the human voice contains a bodily imprint that is directly informative for the interpersonal perception of another’s dynamic physical states.Additional information
This article has a letter by Ravignani and Kotz This article has a reply to Ravignani and Kotz -
Pouw, W., Wassenburg, S. I., Hostetter, A. B., De Koning, B. B., & Paas, F. (2020). Does gesture strengthen sensorimotor knowledge of objects? The case of the size-weight illusion. Psychological Research, 84(4), 966-980. doi:10.1007/s00426-018-1128-y.
Abstract
Co-speech gestures have been proposed to strengthen sensorimotor knowledge related to objects’ weight and manipulability.
This pre-registered study (https ://www.osf.io/9uh6q /) was designed to explore how gestures affect memory for sensorimotor
information through the application of the visual-haptic size-weight illusion (i.e., objects weigh the same, but are experienced
as different in weight). With this paradigm, a discrepancy can be induced between participants’ conscious illusory
perception of objects’ weight and their implicit sensorimotor knowledge (i.e., veridical motor coordination). Depending on
whether gestures reflect and strengthen either of these types of knowledge, gestures may respectively decrease or increase
the magnitude of the size-weight illusion. Participants (N = 159) practiced a problem-solving task with small and large
objects that were designed to induce a size-weight illusion, and then explained the task with or without co-speech gesture
or completed a control task. Afterwards, participants judged the heaviness of objects from memory and then while holding
them. Confirmatory analyses revealed an inverted size-weight illusion based on heaviness judgments from memory and we
found gesturing did not affect judgments. However, exploratory analyses showed reliable correlations between participants’
heaviness judgments from memory and (a) the number of gestures produced that simulated actions, and (b) the kinematics of
the lifting phases of those gestures. These findings suggest that gestures emerge as sensorimotor imaginings that are governed
by the agent’s conscious renderings about the actions they describe, rather than implicit motor routines. -
Pouw, W., Harrison, S. J., Esteve-Gibert, N., & Dixon, J. A. (2020). Energy flows in gesture-speech physics: The respiratory-vocal system and its coupling with hand gestures. The Journal of the Acoustical Society of America, 148(3): 1231. doi:10.1121/10.0001730.
Abstract
Expressive moments in communicative hand gestures often align with emphatic stress in speech. It has recently been found that acoustic markers of emphatic stress arise naturally during steady-state phonation when upper-limb movements impart physical impulses on the body, most likely affecting acoustics via respiratory activity. In this confirmatory study, participants (N = 29) repeatedly uttered consonant-vowel (/pa/) mono-syllables while moving in particular phase relations with speech, or not moving the upper limbs. This study shows that respiration-related activity is affected by (especially high-impulse) gesturing when vocalizations occur near peaks in physical impulse. This study further shows that gesture-induced moments of bodily impulses increase the amplitude envelope of speech, while not similarly affecting the Fundamental Frequency (F0). Finally, tight relations between respiration-related activity and vocalization were observed, even in the absence of movement, but even more so when upper-limb movement is present. The current findings expand a developing line of research showing that speech is modulated by functional biomechanical linkages between hand gestures and the respiratory system. This identification of gesture-speech biomechanics promises to provide an alternative phylogenetic, ontogenetic, and mechanistic explanatory route of why communicative upper limb movements co-occur with speech in humans.
ACKNOWLEDGMENTSAdditional information
Link to Preprint on OSF -
Pouw, W., & Dixon, J. A. (2020). Gesture networks: Introducing dynamic time warping and network analysis for the kinematic study of gesture ensembles. Discourse Processes, 57(4), 301-319. doi:10.1080/0163853X.2019.1678967.
Abstract
We introduce applications of established methods in time-series and network
analysis that we jointly apply here for the kinematic study of gesture
ensembles. We define a gesture ensemble as the set of gestures produced
during discourse by a single person or a group of persons. Here we are
interested in how gestures kinematically relate to one another. We use
a bivariate time-series analysis called dynamic time warping to assess how
similar each gesture is to other gestures in the ensemble in terms of their
velocity profiles (as well as studying multivariate cases with gesture velocity
and speech amplitude envelope profiles). By relating each gesture event to
all other gesture events produced in the ensemble, we obtain a weighted
matrix that essentially represents a network of similarity relationships. We
can therefore apply network analysis that can gauge, for example, how
diverse or coherent certain gestures are with respect to the gesture ensemble.
We believe these analyses promise to be of great value for gesture
studies, as we can come to understand how low-level gesture features
(kinematics of gesture) relate to the higher-order organizational structures
present at the level of discourse.Additional information
Open Data OSF -
Pouw, W., Harrison, S. J., & Dixon, J. A. (2020). Gesture–speech physics: The biomechanical basis for the emergence of gesture–speech synchrony. Journal of Experimental Psychology: General, 149(2), 391-404. doi:10.1037/xge0000646.
Abstract
The phenomenon of gesture–speech synchrony involves tight coupling of prosodic contrasts in gesture
movement (e.g., peak velocity) and speech (e.g., peaks in fundamental frequency; F0). Gesture–speech
synchrony has been understood as completely governed by sophisticated neural-cognitive mechanisms.
However, gesture–speech synchrony may have its original basis in the resonating forces that travel through the
body. In the current preregistered study, movements with high physical impact affected phonation in line with
gesture–speech synchrony as observed in natural contexts. Rhythmic beating of the arms entrained phonation
acoustics (F0 and the amplitude envelope). Such effects were absent for a condition with low-impetus
movements (wrist movements) and a condition without movement. Further, movement–phonation synchrony
was more pronounced when participants were standing as opposed to sitting, indicating a mediating role for
postural stability. We conclude that gesture–speech synchrony has a biomechanical basis, which will have
implications for our cognitive, ontogenetic, and phylogenetic understanding of multimodal language.Additional information
Data availability analysis scripts and pre-registration -
Pouw, W., Trujillo, J. P., & Dixon, J. A. (2020). The quantification of gesture–speech synchrony: A tutorial and validation of multimodal data acquisition using device-based and video-based motion tracking. Behavior Research Methods, 52, 723-740. doi:10.3758/s13428-019-01271-9.
Abstract
There is increasing evidence that hand gestures and speech synchronize their activity on multiple dimensions and timescales. For example, gesture’s kinematic peaks (e.g., maximum speed) are coupled with prosodic markers in speech. Such coupling operates on very short timescales at the level of syllables (200 ms), and therefore requires high-resolution measurement of gesture kinematics and speech acoustics. High-resolution speech analysis is common for gesture studies, given that field’s classic ties with (psycho)linguistics. However, the field has lagged behind in the objective study of gesture kinematics (e.g., as compared to research on instrumental action). Often kinematic peaks in gesture are measured by eye, where a “moment of maximum effort” is determined by several raters. In the present article, we provide a tutorial on more efficient methods to quantify the temporal properties of gesture kinematics, in which we focus on common challenges and possible solutions that come with the complexities of studying multimodal language. We further introduce and compare, using an actual gesture dataset (392 gesture events), the performance of two video-based motion-tracking methods (deep learning vs. pixel change) against a high-performance wired motion-tracking system (Polhemus Liberty). We show that the videography methods perform well in the temporal estimation of kinematic peaks, and thus provide a cheap alternative to expensive motion-tracking systems. We hope that the present article incites gesture researchers to embark on the widespread objective study of gesture kinematics and their relation to speech.
Share this page