Displaying 1 - 15 of 15
-
Bujok, R., Meyer, A. S., & Bosker, H. R. (2025). Audiovisual perception of lexical stress: Beat gestures and articulatory cues. Language and Speech, 68(1), 181-203. doi:10.1177/00238309241258162.
Abstract
Human communication is inherently multimodal. Auditory speech, but also visual cues can be used to understand another talker. Most studies of audiovisual speech perception have focused on the perception of speech segments (i.e., speech sounds). However, less is known about the influence of visual information on the perception of suprasegmental aspects of speech like lexical stress. In two experiments, we investigated the influence of different visual cues (e.g., facial articulatory cues and beat gestures) on the audiovisual perception of lexical stress. We presented auditory lexical stress continua of disyllabic Dutch stress pairs together with videos of a speaker producing stress on the first or second syllable (e.g., articulating VOORnaam or voorNAAM). Moreover, we combined and fully crossed the face of the speaker producing lexical stress on either syllable with a gesturing body producing a beat gesture on either the first or second syllable. Results showed that people successfully used visual articulatory cues to stress in muted videos. However, in audiovisual conditions, we were not able to find an effect of visual articulatory cues. In contrast, we found that the temporal alignment of beat gestures with speech robustly influenced participants' perception of lexical stress. These results highlight the importance of considering suprasegmental aspects of language in multimodal contexts. -
Ye, C., McQueen, J. M., & Bosker, H. R. (2025). Effect of auditory cues to lexical stress on the visual perception of gestural timing. Attention, Perception & Psychophysics. Advance online publication. doi:10.3758/s13414-025-03072-z.
Abstract
Speech is often accompanied by gestures. Since beat gestures—simple nonreferential up-and-down hand movements—frequently co-occur with prosodic prominence, they can indicate stress in a word and hence influence spoken-word recognition. However, little is known about the reverse influence of auditory speech on visual perception. The current study investigated whether lexical stress has an effect on the perceived timing of hand beats. We used videos in which a disyllabic word, embedded in a carrier sentence (Experiment 1) or in isolation (Experiment 2), was coupled with an up-and-down hand beat, while varying their degrees of asynchrony. Results from Experiment 1, a novel beat timing estimation task, revealed that gestures were estimated to occur closer in time to the pitch peak in a stressed syllable than their actual timing, hence reducing the perceived temporal distance between gestures and stress by around 60%. Using a forced-choice task, Experiment 2 further demonstrated that listeners tended to perceive a gesture, falling midway between two syllables, on the syllable receiving stronger cues to stress than the other, and this auditory effect was greater when gestural timing was most ambiguous. Our findings suggest that f0 and intensity are the driving force behind the temporal attraction effect of stress on perceived gestural timing. This study provides new evidence for auditory influences on visual perception, supporting bidirectionality in audiovisual interaction between speech-related signals that occur in everyday face-to-face communication. -
Rohrer, P. L., Bujok, R., Van Maastricht, L., & Bosker, H. R. (2025). From “I dance” to “she danced” with a flick of the hands: Audiovisual stress perception in Spanish. Psychonomic Bulletin & Review. Advance online publication. doi:10.3758/s13423-025-02683-9.
Abstract
When talking, speakers naturally produce hand movements (co-speech gestures) that contribute to communication. Evidence in Dutch suggests that the timing of simple up-and-down, non-referential “beat” gestures influences spoken word recognition: the same auditory stimulus was perceived as CONtent (noun, capitalized letters indicate stressed syllables) when a beat gesture occurred on the first syllable, but as conTENT (adjective) when the gesture occurred on the second syllable. However, these findings were based on a small number of minimal pairs in Dutch, limiting the generalizability of the findings. We therefore tested this effect in Spanish, where lexical stress is highly relevant in the verb conjugation system, distinguishing bailo, “I dance” with word-initial stress from bailó, “she danced” with word-final stress. Testing a larger sample (N = 100), we also assessed whether individual differences in working memory capacity modulated how much individuals relied on the gestures in spoken word recognition. The results showed that, similar to Dutch, Spanish participants were biased to perceive lexical stress on the syllable that visually co-occurred with a beat gesture, with the effect being strongest when the acoustic stress cues were most ambiguous. No evidence was found for by-participant effect sizes to be influenced by individual differences in phonological or visuospatial working memory. These findings reveal gestural-speech coordination impacts lexical stress perception in a language where listeners are regularly confronted with such lexical stress contrasts, highlighting the impact of gestures’ timing on prominence perception and spoken word recognition. -
Bosker, H. R. (2021). Using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies. Behavior Research Methods, 53(5), 1945-1953. doi:10.3758/s13428-021-01542-4.
Abstract
Many studies of speech perception assess the intelligibility of spoken sentence stimuli by means
of transcription tasks (‘type out what you hear’). The intelligibility of a given stimulus is then often
expressed in terms of percentage of words correctly reported from the target sentence. Yet scoring
the participants’ raw responses for words correctly identified from the target sentence is a time-
consuming task, and hence resource-intensive. Moreover, there is no consensus among speech
scientists about what specific protocol to use for the human scoring, limiting the reliability of
human scores. The present paper evaluates various forms of fuzzy string matching between
participants’ responses and target sentences, as automated metrics of listener transcript accuracy.
We demonstrate that one particular metric, the Token Sort Ratio, is a consistent, highly efficient,
and accurate metric for automated assessment of listener transcripts, as evidenced by high
correlations with human-generated scores (best correlation: r = 0.940) and a strong relationship to
acoustic markers of speech intelligibility. Thus, fuzzy string matching provides a practical tool for
assessment of listener transcript accuracy in large-scale speech intelligibility studies. See
https://tokensortratio.netlify.app for an online implementation. -
Bosker, H. R., Badaya, E., & Corley, M. (2021). Discourse markers activate their, like, cohort competitors. Discourse Processes, 58(9), 837-851. doi:10.1080/0163853X.2021.1924000.
Abstract
Speech in everyday conversations is riddled with discourse markers (DMs), such as well, you know, and like. However, in many lab-based studies of speech comprehension, such DMs are typically absent from the carefully articulated and highly controlled speech stimuli. As such, little is known about how these DMs influence online word recognition. The present study specifically investigated the online processing of DM like and how it influences the activation of words in the mental lexicon. We specifically targeted the cohort competitor (CC) effect in the Visual World Paradigm: Upon hearing spoken instructions to “pick up the beaker,” human listeners also typically fixate—next to the target object—referents that overlap phonologically with the target word (cohort competitors such as beetle; CCs). However, several studies have argued that CC effects are constrained by syntactic, semantic, pragmatic, and discourse constraints. Therefore, the present study investigated whether DM like influences online word recognition by activating its cohort competitors (e.g., lightbulb). In an eye-tracking experiment using the Visual World Paradigm, we demonstrate that when participants heard spoken instructions such as “Now press the button for the, like … unicycle,” they showed anticipatory looks to the CC referent (lightbulb)well before hearing the target. This CC effect was sustained for a relatively long period of time, even despite hearing disambiguating information (i.e., the /k/ in like). Analysis of the reaction times also showed that participants were significantly faster to select CC targets (lightbulb) when preceded by DM like. These findings suggest that seemingly trivial DMs, such as like, activate their CCs, impacting online word recognition. Thus, we advocate a more holistic perspective on spoken language comprehension in naturalistic communication, including the processing of DMs. -
Bosker, H. R., & Peeters, D. (2021). Beat gestures influence which speech sounds you hear. Proceedings of the Royal Society B: Biological Sciences, 288: 20202419. doi:10.1098/rspb.2020.2419.
Abstract
Beat gestures—spontaneously produced biphasic movements of the hand—
are among the most frequently encountered co-speech gestures in human
communication. They are closely temporally aligned to the prosodic charac-
teristics of the speech signal, typically occurring on lexically stressed
syllables. Despite their prevalence across speakers of the world’s languages,
how beat gestures impact spoken word recognition is unclear. Can these
simple ‘flicks of the hand’ influence speech perception? Across a range
of experiments, we demonstrate that beat gestures influence the explicit
and implicit perception of lexical stress (e.g. distinguishing OBject from
obJECT), and in turn can influence what vowels listeners hear. Thus, we pro-
vide converging evidence for a manual McGurk effect: relatively simple and
widely occurring hand movements influence which speech sounds we hearAdditional information
example stimuli and experimental data -
Bosker, H. R. (2021). The contribution of amplitude modulations in speech to perceived charisma. In B. Weiss, J. Trouvain, M. Barkat-Defradas, & J. J. Ohala (
Eds. ), Voice attractiveness: Prosody, phonology and phonetics (pp. 165-181). Singapore: Springer. doi:10.1007/978-981-15-6627-1_10.Abstract
Speech contains pronounced amplitude modulations in the 1–9 Hz range, correlating with the syllabic rate of speech. Recent models of speech perception propose that this rhythmic nature of speech is central to speech recognition and has beneficial effects on language processing. Here, we investigated the contribution of amplitude modulations to the subjective impression listeners have of public speakers. The speech from US presidential candidates Hillary Clinton and Donald Trump in the three TV debates of 2016 was acoustically analyzed by means of modulation spectra. These indicated that Clinton’s speech had more pronounced amplitude modulations than Trump’s speech, particularly in the 1–9 Hz range. A subsequent perception experiment, with listeners rating the perceived charisma of (low-pass filtered versions of) Clinton’s and Trump’s speech, showed that more pronounced amplitude modulations (i.e., more ‘rhythmic’ speech) increased perceived charisma ratings. These outcomes highlight the important contribution of speech rhythm to charisma perception. -
Rodd, J., Decuyper, C., Bosker, H. R., & Ten Bosch, L. (2021). A tool for efficient and accurate segmentation of speech data: Announcing POnSS. Behavior Research Methods, 53, 744-756. doi:10.3758/s13428-020-01449-6.
Abstract
Despite advances in automatic speech recognition (ASR), human input is still essential to produce research-grade segmentations of speech data. Con- ventional approaches to manual segmentation are very labour-intensive. We introduce POnSS, a browser-based system that is specialized for the task of segmenting the onsets and offsets of words, that combines aspects of ASR with limited human input. In developing POnSS, we identified several sub- tasks of segmentation, and implemented each of these as separate interfaces for the annotators to interact with, to streamline their task as much as possible. We evaluated segmentations made with POnSS against a base- line of segmentations of the same data made conventionally in Praat. We observed that POnSS achieved comparable reliability to segmentation us- ing Praat, but required 23% less annotator time investment. Because of its greater efficiency without sacrificing reliability, POnSS represents a distinct methodological advance for the segmentation of speech data. -
Severijnen, G. G. A., Bosker, H. R., Piai, V., & McQueen, J. M. (2021). Listeners track talker-specific prosody to deal with talker-variability. Brain Research, 1769: 147605. doi:10.1016/j.brainres.2021.147605.
Abstract
One of the challenges in speech perception is that listeners must deal with considerable
segmental and suprasegmental variability in the acoustic signal due to differences between talkers. Most previous studies have focused on how listeners deal with segmental variability.
In this EEG experiment, we investigated whether listeners track talker-specific usage of suprasegmental cues to lexical stress to recognize spoken words correctly. In a three-day training phase, Dutch participants learned to map non-word minimal stress pairs onto different object referents (e.g., USklot meant “lamp”; usKLOT meant “train”). These non-words were
produced by two male talkers. Critically, each talker used only one suprasegmental cue to signal stress (e.g., Talker A used only F0 and Talker B only intensity). We expected participants to learn which talker used which cue to signal stress. In the test phase, participants indicated whether spoken sentences including these non-words were correct (“The word for lamp is…”).
We found that participants were slower to indicate that a stimulus was correct if the non-word was produced with the unexpected cue (e.g., Talker A using intensity). That is, if in training Talker A used F0 to signal stress, participants experienced a mismatch between predicted and perceived phonological word-forms if, at test, Talker A unexpectedly used intensity to cue
stress. In contrast, the N200 amplitude, an event-related potential related to phonological
prediction, was not modulated by the cue mismatch. Theoretical implications of these
contrasting results are discussed. The behavioral findings illustrate talker-specific prediction of prosodic cues, picked up through perceptual learning during training. -
Bosker, H. R., & Ghitza, O. (2018). Entrained theta oscillations guide perception of subsequent speech: Behavioral evidence from rate normalization. Language, Cognition and Neuroscience, 33(8), 955-967. doi:10.1080/23273798.2018.1439179.
Abstract
This psychoacoustic study provides behavioral evidence that neural entrainment in the theta range (3-9 Hz) causally shapes speech perception. Adopting the ‘rate normalization’ paradigm (presenting compressed carrier sentences followed by uncompressed target words), we show that uniform compression of a speech carrier to syllable rates inside the theta range influences perception of subsequent uncompressed targets, but compression outside theta range does not. However, the influence of carriers – compressed outside theta range – on target perception is salvaged when carriers are ‘repackaged’ to have a packet rate inside theta. This suggests that the brain can only successfully entrain to syllable/packet rates within theta range, with a causal influence on the perception of subsequent speech, in line with recent neuroimaging data. Thus, this study points to a central role for sustained theta entrainment in rate normalization and contributes to our understanding of the functional role of brain oscillations in speech perception. -
Bosker, H. R. (2018). Putting Laurel and Yanny in context. The Journal of the Acoustical Society of America, 144(6), EL503-EL508. doi:10.1121/1.5070144.
Abstract
Recently, the world’s attention was caught by an audio clip that was perceived as “Laurel” or “Yanny”. Opinions were sharply split: many could not believe others heard something different from their perception. However, a crowd-source experiment with >500 participants shows that it is possible to make people hear Laurel, where they previously heard Yanny, by manipulating preceding acoustic context. This study is not only the first to reveal within-listener variation in Laurel/Yanny percepts, but also to demonstrate contrast effects for global spectral information in larger frequency regions. Thus, it highlights the intricacies of human perception underlying these social media phenomena. -
Bosker, H. R., & Cooke, M. (2018). Talkers produce more pronounced amplitude modulations when speaking in noise. The Journal of the Acoustical Society of America, 143(2), EL121-EL126. doi:10.1121/1.5024404.
Abstract
Speakers adjust their voice when talking in noise (known as Lombard speech), facilitating speech comprehension. Recent neurobiological models of speech perception emphasize the role of amplitude modulations in speech-in-noise comprehension, helping neural oscillators to ‘track’ the attended speech. This study tested whether talkers produce more pronounced amplitude modulations in noise. Across four different corpora, modulation spectra showed greater power in amplitude modulations below 4 Hz in Lombard speech compared to matching plain speech. This suggests that noise-induced speech contains more pronounced amplitude modulations, potentially helping the listening brain to entrain to the attended talker, aiding comprehension. -
Kösem, A., Bosker, H. R., Takashima, A., Meyer, A. S., Jensen, O., & Hagoort, P. (2018). Neural entrainment determines the words we hear. Current Biology, 28, 2867-2875. doi:10.1016/j.cub.2018.07.023.
Abstract
Low-frequency neural entrainment to rhythmic input
has been hypothesized as a canonical mechanism
that shapes sensory perception in time. Neural
entrainment is deemed particularly relevant for
speech analysis, as it would contribute to the extraction
of discrete linguistic elements from continuous
acoustic signals. However, its causal influence in
speech perception has been difficult to establish.
Here, we provide evidence that oscillations build temporal
predictions about the duration of speech tokens
that affect perception. Using magnetoencephalography
(MEG), we studied neural dynamics during
listening to sentences that changed in speech rate.
Weobserved neural entrainment to preceding speech
rhythms persisting for several cycles after the change
in rate. The sustained entrainment was associated
with changes in the perceived duration of the last
word’s vowel, resulting in the perception of words
with different meanings. These findings support oscillatory
models of speech processing, suggesting that
neural oscillations actively shape speech perception. -
Maslowski, M., Meyer, A. S., & Bosker, H. R. (2018). Listening to yourself is special: Evidence from global speech rate tracking. PLoS One, 13(9): e0203571. doi:10.1371/journal.pone.0203571.
Abstract
Listeners are known to use adjacent contextual speech rate in processing temporally ambiguous speech sounds. For instance, an ambiguous vowel between short /A/ and long /a:/ in Dutch sounds relatively long (i.e., as /a:/) embedded in a fast precursor sentence, but short in a slow sentence. Besides the local speech rate, listeners also track talker-specific global speech rates. However, it is yet unclear whether other talkers' global rates are encoded with reference to a listener's self-produced rate. Three experiments addressed this question. In Experiment 1, one group of participants was instructed to speak fast, whereas another group had to speak slowly. The groups were compared on their perception of ambiguous /A/-/a:/ vowels embedded in neutral rate speech from another talker. In Experiment 2, the same participants listened to playback of their own speech and again evaluated target vowels in neutral rate speech. Neither of these experiments provided support for the involvement of self-produced speech in perception of another talker's speech rate. Experiment 3 repeated Experiment 2 but with a new participant sample that was unfamiliar with the participants from Experiment 2. This experiment revealed fewer /a:/ responses in neutral speech in the group also listening to a fast rate, suggesting that neutral speech sounds slow in the presence of a fast talker and vice versa. Taken together, the findings show that self-produced speech is processed differently from speech produced by others. They carry implications for our understanding of the perceptual and cognitive mechanisms involved in rate-dependent speech perception in dialogue settings. -
Van Bergen, G., & Bosker, H. R. (2018). Linguistic expectation management in online discourse processing: An investigation of Dutch inderdaad 'indeed' and eigenlijk 'actually'. Journal of Memory and Language, 103, 191-209. doi:10.1016/j.jml.2018.08.004.
Abstract
Interpersonal discourse particles (DPs), such as Dutch inderdaad (≈‘indeed’) and eigenlijk (≈‘actually’) are highly frequent in everyday conversational interaction. Despite extensive theoretical descriptions of their polyfunctionality, little is known about how they are used by language comprehenders. In two visual world eye-tracking experiments involving an online dialogue completion task, we asked to what extent inderdaad, confirming an inferred expectation, and eigenlijk, contrasting with an inferred expectation, influence real-time understanding of dialogues. Answers in the dialogues contained a DP or a control adverb, and a critical discourse referent was replaced by a beep; participants chose the most likely dialogue completion by clicking on one of four referents in a display. Results show that listeners make rapid and fine-grained situation-specific inferences about the use of DPs, modulating their expectations about how the dialogue will unfold. Findings further specify and constrain theories about the conversation-managing function and polyfunctionality of DPs.
Share this page