Publications

Displaying 1 - 18 of 18
  • Cos, F., Bujok, R., & Bosker, H. R. (2024). Test-retest reliability of audiovisual lexical stress perception after >1.5 years. In Y. Chen, A. Chen, & A. Arvaniti (Eds.), Proceedings of Speech Prosody 2024 (pp. 871-875). doi:10.21437/SpeechProsody.2024-176.

    Abstract

    In natural communication, we typically both see and hear our conversation partner. Speech comprehension thus requires the integration of auditory and visual information from the speech signal. This is for instance evidenced by the Manual McGurk effect, where the perception of lexical stress is biased towards the syllable that has a beat gesture aligned to it. However, there is considerable individual variation in how heavily gestural timing is weighed as a cue to stress. To assess within-individualconsistency, this study investigated the test-retest reliability of the Manual McGurk effect. We reran an earlier Manual McGurk experiment with the same participants, over 1.5 years later. At the group level, we successfully replicated the Manual McGurk effect with a similar effect size. However, a correlation of the by-participant effect sizes in the two identical experiments indicated that there was only a weak correlation between both tests, suggesting that the weighing of gestural information in the perception of lexical stress is stable at the group level, but less so in individuals. Findings are discussed in comparison to other measures of audiovisual integration in speech perception. Index Terms: Audiovisual integration, beat gestures, lexical stress, test-retest reliability
  • Matteo, M., & Bosker, H. R. (2024). How to test gesture-speech integration in ten minutes. In Y. Chen, A. Chen, & A. Arvaniti (Eds.), Proceedings of Speech Prosody 2024 (pp. 737-741). doi:10.21437/SpeechProsody.2024-149.

    Abstract

    Human conversations are inherently multimodal, including auditory speech, visual articulatory cues, and hand gestures. Recent studies demonstrated that the timing of a simple up-and-down hand movement, known as a beat gesture, can affect speech perception. A beat gesture falling on the first syllable of a disyllabic word induces a bias to perceive a strong-weak stress pattern (i.e., “CONtent”), while a beat gesture falling on the second syllable combined with the same acoustics biases towards a weak-strong stress pattern (“conTENT”). This effect, termed the “manual McGurk effect”, has been studied in both in-lab and online studies, employing standard experimental sessions lasting approximately forty minutes. The present work tests whether the manual McGurk effect can be observed in an online short version (“mini-test”) of the original paradigm, lasting only ten minutes. Additionally, we employ two different response modalities, namely a two-alternative forced choice and a visual analog scale. A significant manual McGurk effect was observed with both response modalities. Overall, the present study demonstrates the feasibility of employing a ten-minute manual McGurk mini-test to obtain a measure of gesture-speech integration. As such, it may lend itself for inclusion in large-scale test batteries that aim to quantify individual variation in language processing.
  • Motiekaitytė, K., Grosseck, O., Wolf, L., Bosker, H. R., Peeters, D., Perlman, M., Ortega, G., & Raviv, L. (2024). Iconicity and compositionality in emerging vocal communication systems: a Virtual Reality approach. In J. Nölle, L. Raviv, K. E. Graham, S. Hartmann, Y. Jadoul, M. Josserand, T. Matzinger, K. Mudd, M. Pleyer, A. Slonimska, & S. Wacewicz (Eds.), The Evolution of Language: Proceedings of the 15th International Conference (EVOLANG XV) (pp. 387-389). Nijmegen: The Evolution of Language Conferences.
  • Papoutsi*, C., Zimianiti*, E., Bosker, H. R., & Frost, R. L. A. (2024). Statistical learning at a virtual cocktail party. Psychonomic Bulletin & Review, 31, 849-861. doi:10.3758/s13423-023-02384-1.

    Abstract

    * These two authors contributed equally to this study
    Statistical learning – the ability to extract distributional regularities from input – is suggested to be key to language acquisition. Yet, evidence for the human capacity for statistical learning comes mainly from studies conducted in carefully controlled settings without auditory distraction. While such conditions permit careful examination of learning, they do not reflect the naturalistic language learning experience, which is replete with auditory distraction – including competing talkers. Here, we examine how statistical language learning proceeds in a virtual cocktail party environment, where the to-be-learned input is presented alongside a competing speech stream with its own distributional regularities. During exposure, participants in the Dual Talker group concurrently heard two novel languages, one produced by a female talker and one by a male talker, with each talker virtually positioned at opposite sides of the listener (left/right) using binaural acoustic manipulations. Selective attention was manipulated by instructing participants to attend to only one of the two talkers. At test, participants were asked to distinguish words from part-words for both the attended and the unattended languages. Results indicated that participants’ accuracy was significantly higher for trials from the attended vs. unattended
    language. Further, the performance of this Dual Talker group was no different compared to a control group who heard only one language from a single talker (Single Talker group). We thus conclude that statistical learning is modulated by selective attention, being relatively robust against the additional cognitive load provided by competing speech, emphasizing its efficiency in naturalistic language learning situations.

    Additional information

    supplementary file
  • Rohrer, P. L., Bujok, R., Van Maastricht, L., & Bosker, H. R. (2024). The timing of beat gestures affects lexical stress perception in Spanish. In Y. Chen, A. Chen, & A. Arvaniti (Eds.), Proceedings Speech Prosody 2024 (pp. 702-706). doi:10.21437/SpeechProsody.2024-142.

    Abstract

    It has been shown that when speakers produce hand gestures, addressees are attentive towards these gestures, using them to facilitate speech processing. Even relatively simple “beat” gestures are taken into account to help process aspects of speech such as prosodic prominence. In fact, recent evidence suggests that the timing of a beat gesture can influence spoken word recognition. Termed the manual McGurk Effect, Dutch participants, when presented with lexical stress minimal pair continua in Dutch, were biased to hear lexical stress on the syllable that coincided with a beat gesture. However, little is known about how this manual McGurk effect would surface in languages other than Dutch, with different acoustic cues to prominence, and variable gestures. Therefore, this study tests the effect in Spanish where lexical stress is arguably even more important, being a contrastive cue in the regular verb conjugation system. Results from 24 participants corroborate the effect in Spanish, namely that when given the same auditory stimulus, participants were biased to perceive lexical stress on the syllable that visually co-occurred with a beat gesture. These findings extend the manual McGurk effect to a different language, emphasizing the impact of gestures' timing on prosody perception and spoken word recognition.
  • Rohrer, P. L., Hong, Y., & Bosker, H. R. (2024). Gestures time to vowel onset and change the acoustics of the word in Mandarin. In Y. Chen, A. Chen, & A. Arvaniti (Eds.), Proceedings of Speech Prosody 2024 (pp. 866-870). doi:10.21437/SpeechProsody.2024-175.

    Abstract

    Recent research on multimodal language production has revealed that prominence in speech and gesture go hand-in-hand. Specifically, peaks in gesture (i.e., the apex) seem to closely coordinate with peaks in fundamental frequency (F0). The nature of this relationship may also be bi-directional, as it has also been shown that the production of gesture directly affects speech acoustics. However, most studies on the topic have largely focused on stress-based languages, where fundamental frequency has a prominence-lending function. Less work has been carried out on lexical tone languages such as Mandarin, where F0 is lexically distinctive. In this study, four native Mandarin speakers were asked to produce single monosyllabic CV words, taken from minimal lexical tone triplets (e.g., /pi1/, /pi2/, /pi3/), either with or without a beat gesture. Our analyses of the timing of the gestures showed that the gesture apex most stably occurred near vowel onset, with consonantal duration being the strongest predictor of apex placement. Acoustic analyses revealed that words produced with gesture showed raised F0 contours, greater intensity, and shorter durations. These findings further our understanding of gesture-speech alignment in typologically diverse languages, and add to the discussion about multimodal prominence.
  • Severijnen, G. G. A., Bosker, H. R., & McQueen, J. M. (2024). Your “VOORnaam” is not my “VOORnaam”: An acoustic analysis of individual talker differences in word stress in Dutch. Journal of Phonetics, 103: 101296. doi:10.1016/j.wocn.2024.101296.

    Abstract

    Different talkers speak differently, even within the same homogeneous group. These differences lead to acoustic variability in speech, causing challenges for correct perception of the intended message. Because previous descriptions of this acoustic variability have focused mostly on segments, talker variability in prosodic structures is not yet well documented. The present study therefore examined acoustic between-talker variability in word stress in Dutch. We recorded 40 native Dutch talkers from a participant sample with minimal dialectal variation and balanced gender, producing segmentally overlapping words (e.g., VOORnaam vs. voorNAAM; ‘first name’ vs. ‘respectable’, capitalization indicates lexical stress), and measured different acoustic cues to stress. Each individual participant’s acoustic measurements were analyzed using Linear Discriminant Analyses, which provide coefficients for each cue, reflecting the strength of each cue in a talker’s productions. On average, talkers primarily used mean F0, intensity, and duration. Moreover, each participant also employed a unique combination of cues, illustrating large prosodic variability between talkers. In fact, classes of cue-weighting tendencies emerged, differing in which cue was used as the main cue. These results offer the most comprehensive acoustic description, to date, of word stress in Dutch, and illustrate that large prosodic variability is present between individual talkers.
  • Uluşahin, O., Bosker, H. R., McQueen, J. M., & Meyer, A. S. (2024). Knowledge of a talker’s f0 affects subsequent perception of voiceless fricatives. In Y. Chen, A. Chen, & A. Arvaniti (Eds.), Proceedings of Speech Prosody 2024 (pp. 432-436).

    Abstract

    The human brain deals with the infinite variability of speech through multiple mechanisms. Some of them rely solely on information in the speech input (i.e., signal-driven) whereas some rely on linguistic or real-world knowledge (i.e., knowledge-driven). Many signal-driven perceptual processes rely on the enhancement of acoustic differences between incoming speech sounds, producing contrastive adjustments. For instance, when an ambiguous voiceless fricative is preceded by a high fundamental frequency (f0) sentence, the fricative is perceived as having lower a spectral center of gravity (CoG). However, it is not clear whether knowledge of a talker’s typical f0 can lead to similar contrastive effects. This study investigated a possible talker f0 effect on fricative CoG perception. In the exposure phase, two groups of participants (N=16 each) heard the same talker at high or low f0 for 20 minutes. Later, in the test phase, participants rated fixed-f0 /?ɔk/ tokens as being /sɔk/ (i.e., high CoG) or /ʃɔk/ (i.e., low CoG), where /?/ represents a fricative from a 5-step /s/-/ʃ/ continuum. Surprisingly, the data revealed the opposite of our contrastive hypothesis, whereby hearing high f0 instead biased perception towards high CoG. Thus, we demonstrated that talker f0 information affects fricative CoG perception.
  • Bosker, H. R. (2021). Using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies. Behavior Research Methods, 53(5), 1945-1953. doi:10.3758/s13428-021-01542-4.

    Abstract

    Many studies of speech perception assess the intelligibility of spoken sentence stimuli by means
    of transcription tasks (‘type out what you hear’). The intelligibility of a given stimulus is then often
    expressed in terms of percentage of words correctly reported from the target sentence. Yet scoring
    the participants’ raw responses for words correctly identified from the target sentence is a time-
    consuming task, and hence resource-intensive. Moreover, there is no consensus among speech
    scientists about what specific protocol to use for the human scoring, limiting the reliability of
    human scores. The present paper evaluates various forms of fuzzy string matching between
    participants’ responses and target sentences, as automated metrics of listener transcript accuracy.
    We demonstrate that one particular metric, the Token Sort Ratio, is a consistent, highly efficient,
    and accurate metric for automated assessment of listener transcripts, as evidenced by high
    correlations with human-generated scores (best correlation: r = 0.940) and a strong relationship to
    acoustic markers of speech intelligibility. Thus, fuzzy string matching provides a practical tool for
    assessment of listener transcript accuracy in large-scale speech intelligibility studies. See
    https://tokensortratio.netlify.app for an online implementation.
  • Bosker, H. R., Badaya, E., & Corley, M. (2021). Discourse markers activate their, like, cohort competitors. Discourse Processes, 58(9), 837-851. doi:10.1080/0163853X.2021.1924000.

    Abstract

    Speech in everyday conversations is riddled with discourse markers (DMs), such as well, you know, and like. However, in many lab-based studies of speech comprehension, such DMs are typically absent from the carefully articulated and highly controlled speech stimuli. As such, little is known about how these DMs influence online word recognition. The present study specifically investigated the online processing of DM like and how it influences the activation of words in the mental lexicon. We specifically targeted the cohort competitor (CC) effect in the Visual World Paradigm: Upon hearing spoken instructions to “pick up the beaker,” human listeners also typically fixate—next to the target object—referents that overlap phonologically with the target word (cohort competitors such as beetle; CCs). However, several studies have argued that CC effects are constrained by syntactic, semantic, pragmatic, and discourse constraints. Therefore, the present study investigated whether DM like influences online word recognition by activating its cohort competitors (e.g., lightbulb). In an eye-tracking experiment using the Visual World Paradigm, we demonstrate that when participants heard spoken instructions such as “Now press the button for the, like … unicycle,” they showed anticipatory looks to the CC referent (lightbulb)well before hearing the target. This CC effect was sustained for a relatively long period of time, even despite hearing disambiguating information (i.e., the /k/ in like). Analysis of the reaction times also showed that participants were significantly faster to select CC targets (lightbulb) when preceded by DM like. These findings suggest that seemingly trivial DMs, such as like, activate their CCs, impacting online word recognition. Thus, we advocate a more holistic perspective on spoken language comprehension in naturalistic communication, including the processing of DMs.
  • Bosker, H. R., & Peeters, D. (2021). Beat gestures influence which speech sounds you hear. Proceedings of the Royal Society B: Biological Sciences, 288: 20202419. doi:10.1098/rspb.2020.2419.

    Abstract

    Beat gestures—spontaneously produced biphasic movements of the hand—
    are among the most frequently encountered co-speech gestures in human
    communication. They are closely temporally aligned to the prosodic charac-
    teristics of the speech signal, typically occurring on lexically stressed
    syllables. Despite their prevalence across speakers of the world’s languages,
    how beat gestures impact spoken word recognition is unclear. Can these
    simple ‘flicks of the hand’ influence speech perception? Across a range
    of experiments, we demonstrate that beat gestures influence the explicit
    and implicit perception of lexical stress (e.g. distinguishing OBject from
    obJECT), and in turn can influence what vowels listeners hear. Thus, we pro-
    vide converging evidence for a manual McGurk effect: relatively simple and
    widely occurring hand movements influence which speech sounds we hear

    Additional information

    example stimuli and experimental data
  • Bosker, H. R. (2021). The contribution of amplitude modulations in speech to perceived charisma. In B. Weiss, J. Trouvain, M. Barkat-Defradas, & J. J. Ohala (Eds.), Voice attractiveness: Prosody, phonology and phonetics (pp. 165-181). Singapore: Springer. doi:10.1007/978-981-15-6627-1_10.

    Abstract

    Speech contains pronounced amplitude modulations in the 1–9 Hz range, correlating with the syllabic rate of speech. Recent models of speech perception propose that this rhythmic nature of speech is central to speech recognition and has beneficial effects on language processing. Here, we investigated the contribution of amplitude modulations to the subjective impression listeners have of public speakers. The speech from US presidential candidates Hillary Clinton and Donald Trump in the three TV debates of 2016 was acoustically analyzed by means of modulation spectra. These indicated that Clinton’s speech had more pronounced amplitude modulations than Trump’s speech, particularly in the 1–9 Hz range. A subsequent perception experiment, with listeners rating the perceived charisma of (low-pass filtered versions of) Clinton’s and Trump’s speech, showed that more pronounced amplitude modulations (i.e., more ‘rhythmic’ speech) increased perceived charisma ratings. These outcomes highlight the important contribution of speech rhythm to charisma perception.
  • Rodd, J., Decuyper, C., Bosker, H. R., & Ten Bosch, L. (2021). A tool for efficient and accurate segmentation of speech data: Announcing POnSS. Behavior Research Methods, 53, 744-756. doi:10.3758/s13428-020-01449-6.

    Abstract

    Despite advances in automatic speech recognition (ASR), human input is still essential to produce research-grade segmentations of speech data. Con- ventional approaches to manual segmentation are very labour-intensive. We introduce POnSS, a browser-based system that is specialized for the task of segmenting the onsets and offsets of words, that combines aspects of ASR with limited human input. In developing POnSS, we identified several sub- tasks of segmentation, and implemented each of these as separate interfaces for the annotators to interact with, to streamline their task as much as possible. We evaluated segmentations made with POnSS against a base- line of segmentations of the same data made conventionally in Praat. We observed that POnSS achieved comparable reliability to segmentation us- ing Praat, but required 23% less annotator time investment. Because of its greater efficiency without sacrificing reliability, POnSS represents a distinct methodological advance for the segmentation of speech data.
  • Severijnen, G. G. A., Bosker, H. R., Piai, V., & McQueen, J. M. (2021). Listeners track talker-specific prosody to deal with talker-variability. Brain Research, 1769: 147605. doi:10.1016/j.brainres.2021.147605.

    Abstract

    One of the challenges in speech perception is that listeners must deal with considerable
    segmental and suprasegmental variability in the acoustic signal due to differences between talkers. Most previous studies have focused on how listeners deal with segmental variability.
    In this EEG experiment, we investigated whether listeners track talker-specific usage of suprasegmental cues to lexical stress to recognize spoken words correctly. In a three-day training phase, Dutch participants learned to map non-word minimal stress pairs onto different object referents (e.g., USklot meant “lamp”; usKLOT meant “train”). These non-words were
    produced by two male talkers. Critically, each talker used only one suprasegmental cue to signal stress (e.g., Talker A used only F0 and Talker B only intensity). We expected participants to learn which talker used which cue to signal stress. In the test phase, participants indicated whether spoken sentences including these non-words were correct (“The word for lamp is…”).
    We found that participants were slower to indicate that a stimulus was correct if the non-word was produced with the unexpected cue (e.g., Talker A using intensity). That is, if in training Talker A used F0 to signal stress, participants experienced a mismatch between predicted and perceived phonological word-forms if, at test, Talker A unexpectedly used intensity to cue
    stress. In contrast, the N200 amplitude, an event-related potential related to phonological
    prediction, was not modulated by the cue mismatch. Theoretical implications of these
    contrasting results are discussed. The behavioral findings illustrate talker-specific prediction of prosodic cues, picked up through perceptual learning during training.
  • Bosker, H. R., Van Os, M., Does, R., & Van Bergen, G. (2019). Counting 'uhm's: how tracking the distribution of native and non-native disfluencies influences online language comprehension. Journal of Memory and Language, 106, 189-202. doi:10.1016/j.jml.2019.02.006.

    Abstract

    Disfluencies, like 'uh', have been shown to help listeners anticipate reference to low-frequency words. The associative account of this 'disfluency bias' proposes that listeners learn to associate disfluency with low-frequency referents based on prior exposure to non-arbitrary disfluency distributions (i.e., greater probability of low-frequency words after disfluencies). However, there is limited evidence for listeners actually tracking disfluency distributions online. The present experiments are the first to show that adult listeners, exposed to a typical or more atypical disfluency distribution (i.e., hearing a talker unexpectedly say uh before high-frequency words), flexibly adjust their predictive strategies to the disfluency distribution at hand (e.g., learn to predict high-frequency referents after disfluency). However, when listeners were presented with the same atypical disfluency distribution but produced by a non-native speaker, no adjustment was observed. This suggests pragmatic inferences can modulate distributional learning, revealing the flexibility of, and constraints on, distributional learning in incremental language comprehension.
  • Maslowski, M., Meyer, A. S., & Bosker, H. R. (2019). How the tracking of habitual rate influences speech perception. Journal of Experimental Psychology: Learning, Memory, and Cognition, 45(1), 128-138. doi:10.1037/xlm0000579.

    Abstract

    Listeners are known to track statistical regularities in speech. Yet, which temporal cues
    are encoded is unclear. This study tested effects of talker-specific habitual speech rate
    and talker-independent average speech rate (heard over a longer period of time) on
    the perception of the temporal Dutch vowel contrast /A/-/a:/. First, Experiment 1
    replicated that slow local (surrounding) speech contexts induce fewer long /a:/
    responses than faster contexts. Experiment 2 tested effects of long-term habitual
    speech rate. One high-rate group listened to ambiguous vowels embedded in `neutral'
    speech from talker A, intermixed with speech from fast talker B. Another low-rate group
    listened to the same `neutral' speech from talker A, but to talker B being slow.
    Between-group comparison of the `neutral' trials showed that the high-rate group
    demonstrated a lower proportion of /a:/ responses, indicating that talker A's habitual
    speech rate sounded slower when B was faster. In Experiment 3, both talkers
    produced speech at both rates, removing the different habitual speech rates of talker A
    and B, while maintaining the average rate differing between groups. This time no
    global rate effect was observed. Taken together, the present experiments show that a
    talker's habitual rate is encoded relative to the habitual rate of another talker, carrying
    implications for episodic and constraint-based models of speech perception.
  • Maslowski, M., Meyer, A. S., & Bosker, H. R. (2019). Listeners normalize speech for contextual speech rate even without an explicit recognition task. The Journal of the Acoustical Society of America, 146(1), 179-188. doi:10.1121/1.5116004.

    Abstract

    Speech can be produced at different rates. Listeners take this rate variation into account by normalizing vowel duration for contextual speech rate: An ambiguous Dutch word /m?t/ is perceived as short /mAt/ when embedded in a slow context, but long /ma:t/ in a fast context. Whilst some have argued that this rate normalization involves low-level automatic perceptual processing, there is also evidence that it arises at higher-level cognitive processing stages, such as decision making. Prior research on rate-dependent speech perception has only used explicit recognition tasks to investigate the phenomenon, involving both perceptual processing and decision making. This study tested whether speech rate normalization can be observed without explicit decision making, using a cross-modal repetition priming paradigm. Results show that a fast precursor sentence makes an embedded ambiguous prime (/m?t/) sound (implicitly) more /a:/-like, facilitating lexical access to the long target word "maat" in a (explicit) lexical decision task. This result suggests that rate normalization is automatic, taking place even in the absence of an explicit recognition task. Thus, rate normalization is placed within the realm of everyday spoken conversation, where explicit categorization of ambiguous sounds is rare.
  • Rodd, J., Bosker, H. R., Ten Bosch, L., & Ernestus, M. (2019). Deriving the onset and offset times of planning units from acoustic and articulatory measurements. The Journal of the Acoustical Society of America, 145(2), EL161-EL167. doi:10.1121/1.5089456.

    Abstract

    Many psycholinguistic models of speech sequence planning make claims about the onset and offset times of planning units, such as words, syllables, and phonemes. These predictions typically go untested, however, since psycholinguists have assumed that the temporal dynamics of the speech signal is a poor index of the temporal dynamics of the underlying speech planning process. This article argues that this problem is tractable, and presents and validates two simple metrics that derive planning unit onset and offset times from the acoustic signal and articulatographic data.

Share this page