Publications

Displaying 1 - 11 of 11
  • Burchardt, L., Van de Sande, Y., Kehy, M., Gamba, M., Ravignani, A., & Pouw, W. (2024). A toolkit for the dynamic study of air sacs in siamang and other elastic circular structures. PLOS Computational Biology, 20(6): e1012222. doi:10.1371/journal.pcbi.1012222.

    Abstract

    Biological structures are defined by rigid elements, such as bones, and elastic elements, like muscles and membranes. Computer vision advances have enabled automatic tracking of moving animal skeletal poses. Such developments provide insights into complex time-varying dynamics of biological motion. Conversely, the elastic soft-tissues of organisms, like the nose of elephant seals, or the buccal sac of frogs, are poorly studied and no computer vision methods have been proposed. This leaves major gaps in different areas of biology. In primatology, most critically, the function of air sacs is widely debated; many open questions on the role of air sacs in the evolution of animal communication, including human speech, remain unanswered. To support the dynamic study of soft-tissue structures, we present a toolkit for the automated tracking of semi-circular elastic structures in biological video data. The toolkit contains unsupervised computer vision tools (using Hough transform) and supervised deep learning (by adapting DeepLabCut) methodology to track inflation of laryngeal air sacs or other biological spherical objects (e.g., gular cavities). Confirming the value of elastic kinematic analysis, we show that air sac inflation correlates with acoustic markers that likely inform about body size. Finally, we present a pre-processed audiovisual-kinematic dataset of 7+ hours of closeup audiovisual recordings of siamang (Symphalangus syndactylus) singing. This toolkit (https://github.com/WimPouw/AirSacTracker) aims to revitalize the study of non-skeletal morphological structures across multiple species.
  • Ghaleb, E., Rasenberg, M., Pouw, W., Toni, I., Holler, J., Özyürek, A., & Fernandez, R. (2024). Analysing cross-speaker convergence through the lens of automatically detected shared linguistic constructions. In L. K. Samuelson, S. L. Frank, A. Mackey, & E. Hazeltine (Eds.), Proceedings of the 46th Annual Meeting of the Cognitive Science Society (CogSci 2024) (pp. 1717-1723).

    Abstract

    Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as alignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions---expressions with a common lexical core used by both speakers within a dialogue---and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.

    Additional information

    link to eScholarship
  • Ghaleb, E., Khaertdinov, B., Pouw, W., Rasenberg, M., Holler, J., Ozyurek, A., & Fernandez, R. (2024). Learning co-speech gesture representations in dialogue through contrastive learning: An intrinsic evaluation. In Proceedings of the 26th International Conference on Multimodal Interaction (ICMI 2024) (pp. 274-283).

    Abstract

    In face-to-face dialogues, the form-meaning relationship of co-speech gestures varies depending on contextual factors such as what the gestures refer to and the individual characteristics of speakers. These factors make co-speech gesture representation learning challenging. How can we learn meaningful gestures representations considering gestures’ variability and relationship with speech? This paper tackles this challenge by employing self-supervised contrastive learning techniques to learn gesture representations from skeletal and speech information. We propose an approach that includes both unimodal and multimodal pre-training to ground gesture representations in co-occurring speech. For training, we utilize a face-to-face dialogue dataset rich with representational iconic gestures. We conduct thorough intrinsic evaluations of the learned representations through comparison with human-annotated pairwise gesture similarity. Moreover, we perform a diagnostic probing analysis to assess the possibility of recovering interpretable gesture features from the learned representations. Our results show a significant positive correlation with human-annotated gesture similarity and reveal that the similarity between the learned representations is consistent with well-motivated patterns related to the dynamics of dialogue interaction. Moreover, our findings demonstrate that several features concerning the form of gestures can be recovered from the latent representations. Overall, this study shows that multimodal contrastive learning is a promising approach for learning gesture representations, which opens the door to using such representations in larger-scale gesture analysis studies.
  • Ghaleb, E., Burenko, I., Rasenberg, M., Pouw, W., Uhrig, P., Holler, J., Toni, I., Ozyurek, A., & Fernandez, R. (2024). Cospeech gesture detection through multi-phase sequence labeling. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024) (pp. 4007-4015).

    Abstract

    Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and re-
    traction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline models in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework’s capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.
  • Leonetti, S., Ravignani, A., & Pouw, W. (2024). A cross-species framework for classifying sound-movement couplings. Neuroscience and Biobehavioral Reviews, 167: 105911. doi:10.1016/j.neubiorev.2024.105911.

    Abstract

    Sound and movement are entangled in animal communication. This is obviously true in the case of sound-constituting vibratory movements of biological structures which generate acoustic waves. A little less obvious is that other moving structures produce the energy required to sustain these vibrations. In many species, the respiratory system moves to generate the expiratory flow which powers the sound-constituting movements (sound-powering movements). The sound may acquire additional structure via upper tract movements, such as articulatory movements or head raising (sound-filtering movements). Some movements are not necessary for sound production, but when produced, impinge on the sound-producing process due to weak biomechanical coupling with body parts (e.g., respiratory system) that are necessary for sound production (sound-impinging movements). Animals also produce sounds contingent with movement, requiring neuro-physiological control regimes allowing to flexibly couple movements to a produced sound, or coupling movements to a perceived external sound (sound-contingent movement). Here, we compare and classify the variety of ways sound and movements are coupled in animal communication; our proposed framework should help structure previous and future studies on this topic.
  • Kamermans, K. L., Pouw, W., Mast, F. W., & Paas, F. (2019). Reinterpretation in visual imagery is possible without visual cues: A validation of previous research. Psychological Research, 83(6), 1237-1250. doi:10.1007/s00426-017-0956-5.

    Abstract

    Is visual reinterpretation of bistable figures (e.g., duck/rabbit figure) in visual imagery possible? Current consensus suggests that it is in principle possible because of converging evidence of quasi-pictorial functioning of visual imagery. Yet, studies that have directly tested and found evidence for reinterpretation in visual imagery, allow for the possibility that reinterpretation was already achieved during memorization of the figure(s). One study resolved this issue, providing evidence for reinterpretation in visual imagery (Mast and Kosslyn, Cognition 86:57-70, 2002). However, participants in that study performed reinterpretations with aid of visual cues. Hence, reinterpretation was not performed with mental imagery alone. Therefore, in this study we assessed the possibility of reinterpretation without visual support. We further explored the possible role of haptic cues to assess the multimodal nature of mental imagery. Fifty-three participants were consecutively presented three to be remembered bistable 2-D figures (reinterpretable when rotated 180 degrees), two of which were visually inspected and one was explored hapticly. After memorization of the figures, a visually bistable exemplar figure was presented to ensure understanding of the concept of visual bistability. During recall, 11 participants (out of 36; 30.6%) who did not spot bistability during memorization successfully performed reinterpretations when instructed to mentally rotate their visual image, but additional haptic cues during mental imagery did not inflate reinterpretation ability. This study validates previous findings that reinterpretation in visual imagery is possible.
  • Kamermans, K. L., Pouw, W., Fassi, L., Aslanidou, A., Paas, F., & Hostetter, A. B. (2019). The role of gesture as simulated action in reinterpretation of mental imagery. Acta Psychologica, 197, 131-142. doi:10.1016/j.actpsy.2019.05.004.

    Abstract

    In two experiments, we examined the role of gesture in reinterpreting a mental image. In Experiment 1, we found that participants gestured more about a figure they had learned through manual exploration than about a figure they had learned through vision. This supports claims that gestures emerge from the activation of perception-relevant actions during mental imagery. In Experiment 2, we investigated whether such gestures have a causal role in affecting the quality of mental imagery. Participants were randomly assigned to gesture, not gesture, or engage in a manual interference task as they attempted to reinterpret a figure they had learned through manual exploration. We found that manual interference significantly impaired participants' success on the task. Taken together, these results suggest that gestures reflect mental imaginings of interactions with a mental image and that these imaginings are critically important for mental manipulation and reinterpretation of that image. However, our results suggest that enacting the imagined movements in gesture is not critically important on this particular task.
  • Pouw, W., Paxton, A., Harrison, S. J., & Dixon, J. A. (2019). Acoustic specification of upper limb movement in voicing. In A. Grimminger (Ed.), Proceedings of the 6th Gesture and Speech in Interaction – GESPIN 6 (pp. 68-74). Paderborn: Universitaetsbibliothek Paderborn. doi:10.17619/UNIPB/1-812.
  • Pouw, W., & Dixon, J. A. (2019). Entrainment and modulation of gesture-speech synchrony under delayed auditory feedback. Cognitive Science, 43(3): e12721. doi:10.1111/cogs.12721.

    Abstract

    Gesture–speech synchrony re-stabilizes when hand movement or speech is disrupted by a delayed
    feedback manipulation, suggesting strong bidirectional coupling between gesture and speech. Yet it
    has also been argued from case studies in perceptual–motor pathology that hand gestures are a special
    kind of action that does not require closed-loop re-afferent feedback to maintain synchrony with
    speech. In the current pre-registered within-subject study, we used motion tracking to conceptually
    replicate McNeill’s (1992) classic study on gesture–speech synchrony under normal and 150 ms
    delayed auditory feedback of speech conditions (NO DAF vs. DAF). Consistent with, and extending
    McNeill’s original results, we obtain evidence that (a) gesture-speech synchrony is more stable
    under DAF versus NO DAF (i.e., increased coupling effect), (b) that gesture and speech variably
    entrain to the external auditory delay as indicated by a consistent shift in gesture-speech synchrony
    offsets (i.e., entrainment effect), and (c) that the coupling effect and the entrainment effect are codependent.
    We suggest, therefore, that gesture–speech synchrony provides a way for the cognitive
    system to stabilize rhythmic activity under interfering conditions.

    Additional information

    https://osf.io/pcde3/
  • Pouw, W., & Dixon, J. A. (2019). Quantifying gesture-speech synchrony. In A. Grimminger (Ed.), Proceedings of the 6th Gesture and Speech in Interaction – GESPIN 6 (pp. 75-80). Paderborn: Universitaetsbibliothek Paderborn. doi:10.17619/UNIPB/1-812.

    Abstract

    Spontaneously occurring speech is often seamlessly accompanied by hand gestures. Detailed
    observations of video data suggest that speech and gesture are tightly synchronized in time,
    consistent with a dynamic interplay between body and mind. However, spontaneous gesturespeech
    synchrony has rarely been objectively quantified beyond analyses of video data, which
    do not allow for identification of kinematic properties of gestures. Consequently, the point in
    gesture which is held to couple with speech, the so-called moment of “maximum effort”, has
    been variably equated with the peak velocity, peak acceleration, peak deceleration, or the onset
    of the gesture. In the current exploratory report, we provide novel evidence from motiontracking
    and acoustic data that peak velocity is closely aligned, and shortly leads, the peak pitch
    (F0) of speech

    Additional information

    https://osf.io/9843h/
  • Pouw, W., Rop, G., De Koning, B., & Paas, F. (2019). The cognitive basis for the split-attention effect. Journal of Experimental Psychology: General, 148(11), 2058-2075. doi:10.1037/xge0000578.

    Abstract

    The split-attention effect entails that learning from spatially separated, but mutually referring information
    sources (e.g., text and picture), is less effective than learning from the equivalent spatially integrated
    sources. According to cognitive load theory, impaired learning is caused by the working memory load
    imposed by the need to distribute attention between the information sources and mentally integrate them.
    In this study, we directly tested whether the split-attention effect is caused by spatial separation per se.
    Spatial distance was varied in basic cognitive tasks involving pictures (Experiment 1) and text–picture
    combinations (Experiment 2; preregistered study), and in more ecologically valid learning materials
    (Experiment 3). Experiment 1 showed that having to integrate two pictorial stimuli at greater distances
    diminished performance on a secondary visual working memory task, but did not lead to slower
    integration. When participants had to integrate a picture and written text in Experiment 2, a greater
    distance led to slower integration of the stimuli, but not to diminished performance on the secondary task.
    Experiment 3 showed that presenting spatially separated (compared with integrated) textual and pictorial
    information yielded fewer integrative eye movements, but this was not further exacerbated when
    increasing spatial distance even further. This effect on learning processes did not lead to differences in
    learning outcomes between conditions. In conclusion, we provide evidence that larger distances between
    spatially separated information sources influence learning processes, but that spatial separation on its
    own is not likely to be the only, nor a sufficient, condition for impacting learning outcomes.

    Files private

    Request files

Share this page