Publications

Displaying 1 - 17 of 17
  • Bosker, H. R. (2021). Using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies. Behavior Research Methods, 53(5), 1945-1953. doi:10.3758/s13428-021-01542-4.

    Abstract

    Many studies of speech perception assess the intelligibility of spoken sentence stimuli by means
    of transcription tasks (‘type out what you hear’). The intelligibility of a given stimulus is then often
    expressed in terms of percentage of words correctly reported from the target sentence. Yet scoring
    the participants’ raw responses for words correctly identified from the target sentence is a time-
    consuming task, and hence resource-intensive. Moreover, there is no consensus among speech
    scientists about what specific protocol to use for the human scoring, limiting the reliability of
    human scores. The present paper evaluates various forms of fuzzy string matching between
    participants’ responses and target sentences, as automated metrics of listener transcript accuracy.
    We demonstrate that one particular metric, the Token Sort Ratio, is a consistent, highly efficient,
    and accurate metric for automated assessment of listener transcripts, as evidenced by high
    correlations with human-generated scores (best correlation: r = 0.940) and a strong relationship to
    acoustic markers of speech intelligibility. Thus, fuzzy string matching provides a practical tool for
    assessment of listener transcript accuracy in large-scale speech intelligibility studies. See
    https://tokensortratio.netlify.app for an online implementation.
  • Bosker, H. R., Badaya, E., & Corley, M. (2021). Discourse markers activate their, like, cohort competitors. Discourse Processes, 58(9), 837-851. doi:10.1080/0163853X.2021.1924000.

    Abstract

    Speech in everyday conversations is riddled with discourse markers (DMs), such as well, you know, and like. However, in many lab-based studies of speech comprehension, such DMs are typically absent from the carefully articulated and highly controlled speech stimuli. As such, little is known about how these DMs influence online word recognition. The present study specifically investigated the online processing of DM like and how it influences the activation of words in the mental lexicon. We specifically targeted the cohort competitor (CC) effect in the Visual World Paradigm: Upon hearing spoken instructions to “pick up the beaker,” human listeners also typically fixate—next to the target object—referents that overlap phonologically with the target word (cohort competitors such as beetle; CCs). However, several studies have argued that CC effects are constrained by syntactic, semantic, pragmatic, and discourse constraints. Therefore, the present study investigated whether DM like influences online word recognition by activating its cohort competitors (e.g., lightbulb). In an eye-tracking experiment using the Visual World Paradigm, we demonstrate that when participants heard spoken instructions such as “Now press the button for the, like … unicycle,” they showed anticipatory looks to the CC referent (lightbulb)well before hearing the target. This CC effect was sustained for a relatively long period of time, even despite hearing disambiguating information (i.e., the /k/ in like). Analysis of the reaction times also showed that participants were significantly faster to select CC targets (lightbulb) when preceded by DM like. These findings suggest that seemingly trivial DMs, such as like, activate their CCs, impacting online word recognition. Thus, we advocate a more holistic perspective on spoken language comprehension in naturalistic communication, including the processing of DMs.
  • Bosker, H. R., & Peeters, D. (2021). Beat gestures influence which speech sounds you hear. Proceedings of the Royal Society B: Biological Sciences, 288: 20202419. doi:10.1098/rspb.2020.2419.

    Abstract

    Beat gestures—spontaneously produced biphasic movements of the hand—
    are among the most frequently encountered co-speech gestures in human
    communication. They are closely temporally aligned to the prosodic charac-
    teristics of the speech signal, typically occurring on lexically stressed
    syllables. Despite their prevalence across speakers of the world’s languages,
    how beat gestures impact spoken word recognition is unclear. Can these
    simple ‘flicks of the hand’ influence speech perception? Across a range
    of experiments, we demonstrate that beat gestures influence the explicit
    and implicit perception of lexical stress (e.g. distinguishing OBject from
    obJECT), and in turn can influence what vowels listeners hear. Thus, we pro-
    vide converging evidence for a manual McGurk effect: relatively simple and
    widely occurring hand movements influence which speech sounds we hear

    Additional information

    example stimuli and experimental data
  • Bosker, H. R. (2021). The contribution of amplitude modulations in speech to perceived charisma. In B. Weiss, J. Trouvain, M. Barkat-Defradas, & J. J. Ohala (Eds.), Voice attractiveness: Prosody, phonology and phonetics (pp. 165-181). Singapore: Springer. doi:10.1007/978-981-15-6627-1_10.

    Abstract

    Speech contains pronounced amplitude modulations in the 1–9 Hz range, correlating with the syllabic rate of speech. Recent models of speech perception propose that this rhythmic nature of speech is central to speech recognition and has beneficial effects on language processing. Here, we investigated the contribution of amplitude modulations to the subjective impression listeners have of public speakers. The speech from US presidential candidates Hillary Clinton and Donald Trump in the three TV debates of 2016 was acoustically analyzed by means of modulation spectra. These indicated that Clinton’s speech had more pronounced amplitude modulations than Trump’s speech, particularly in the 1–9 Hz range. A subsequent perception experiment, with listeners rating the perceived charisma of (low-pass filtered versions of) Clinton’s and Trump’s speech, showed that more pronounced amplitude modulations (i.e., more ‘rhythmic’ speech) increased perceived charisma ratings. These outcomes highlight the important contribution of speech rhythm to charisma perception.
  • Rodd, J., Decuyper, C., Bosker, H. R., & Ten Bosch, L. (2021). A tool for efficient and accurate segmentation of speech data: Announcing POnSS. Behavior Research Methods, 53, 744-756. doi:10.3758/s13428-020-01449-6.

    Abstract

    Despite advances in automatic speech recognition (ASR), human input is still essential to produce research-grade segmentations of speech data. Con- ventional approaches to manual segmentation are very labour-intensive. We introduce POnSS, a browser-based system that is specialized for the task of segmenting the onsets and offsets of words, that combines aspects of ASR with limited human input. In developing POnSS, we identified several sub- tasks of segmentation, and implemented each of these as separate interfaces for the annotators to interact with, to streamline their task as much as possible. We evaluated segmentations made with POnSS against a base- line of segmentations of the same data made conventionally in Praat. We observed that POnSS achieved comparable reliability to segmentation us- ing Praat, but required 23% less annotator time investment. Because of its greater efficiency without sacrificing reliability, POnSS represents a distinct methodological advance for the segmentation of speech data.
  • Severijnen, G. G. A., Bosker, H. R., Piai, V., & McQueen, J. M. (2021). Listeners track talker-specific prosody to deal with talker-variability. Brain Research, 1769: 147605. doi:10.1016/j.brainres.2021.147605.

    Abstract

    One of the challenges in speech perception is that listeners must deal with considerable
    segmental and suprasegmental variability in the acoustic signal due to differences between talkers. Most previous studies have focused on how listeners deal with segmental variability.
    In this EEG experiment, we investigated whether listeners track talker-specific usage of suprasegmental cues to lexical stress to recognize spoken words correctly. In a three-day training phase, Dutch participants learned to map non-word minimal stress pairs onto different object referents (e.g., USklot meant “lamp”; usKLOT meant “train”). These non-words were
    produced by two male talkers. Critically, each talker used only one suprasegmental cue to signal stress (e.g., Talker A used only F0 and Talker B only intensity). We expected participants to learn which talker used which cue to signal stress. In the test phase, participants indicated whether spoken sentences including these non-words were correct (“The word for lamp is…”).
    We found that participants were slower to indicate that a stimulus was correct if the non-word was produced with the unexpected cue (e.g., Talker A using intensity). That is, if in training Talker A used F0 to signal stress, participants experienced a mismatch between predicted and perceived phonological word-forms if, at test, Talker A unexpectedly used intensity to cue
    stress. In contrast, the N200 amplitude, an event-related potential related to phonological
    prediction, was not modulated by the cue mismatch. Theoretical implications of these
    contrasting results are discussed. The behavioral findings illustrate talker-specific prediction of prosodic cues, picked up through perceptual learning during training.
  • Bosker, H. R., Van Os, M., Does, R., & Van Bergen, G. (2019). Counting 'uhm's: how tracking the distribution of native and non-native disfluencies influences online language comprehension. Journal of Memory and Language, 106, 189-202. doi:10.1016/j.jml.2019.02.006.

    Abstract

    Disfluencies, like 'uh', have been shown to help listeners anticipate reference to low-frequency words. The associative account of this 'disfluency bias' proposes that listeners learn to associate disfluency with low-frequency referents based on prior exposure to non-arbitrary disfluency distributions (i.e., greater probability of low-frequency words after disfluencies). However, there is limited evidence for listeners actually tracking disfluency distributions online. The present experiments are the first to show that adult listeners, exposed to a typical or more atypical disfluency distribution (i.e., hearing a talker unexpectedly say uh before high-frequency words), flexibly adjust their predictive strategies to the disfluency distribution at hand (e.g., learn to predict high-frequency referents after disfluency). However, when listeners were presented with the same atypical disfluency distribution but produced by a non-native speaker, no adjustment was observed. This suggests pragmatic inferences can modulate distributional learning, revealing the flexibility of, and constraints on, distributional learning in incremental language comprehension.
  • Maslowski, M., Meyer, A. S., & Bosker, H. R. (2019). How the tracking of habitual rate influences speech perception. Journal of Experimental Psychology: Learning, Memory, and Cognition, 45(1), 128-138. doi:10.1037/xlm0000579.

    Abstract

    Listeners are known to track statistical regularities in speech. Yet, which temporal cues
    are encoded is unclear. This study tested effects of talker-specific habitual speech rate
    and talker-independent average speech rate (heard over a longer period of time) on
    the perception of the temporal Dutch vowel contrast /A/-/a:/. First, Experiment 1
    replicated that slow local (surrounding) speech contexts induce fewer long /a:/
    responses than faster contexts. Experiment 2 tested effects of long-term habitual
    speech rate. One high-rate group listened to ambiguous vowels embedded in `neutral'
    speech from talker A, intermixed with speech from fast talker B. Another low-rate group
    listened to the same `neutral' speech from talker A, but to talker B being slow.
    Between-group comparison of the `neutral' trials showed that the high-rate group
    demonstrated a lower proportion of /a:/ responses, indicating that talker A's habitual
    speech rate sounded slower when B was faster. In Experiment 3, both talkers
    produced speech at both rates, removing the different habitual speech rates of talker A
    and B, while maintaining the average rate differing between groups. This time no
    global rate effect was observed. Taken together, the present experiments show that a
    talker's habitual rate is encoded relative to the habitual rate of another talker, carrying
    implications for episodic and constraint-based models of speech perception.
  • Maslowski, M., Meyer, A. S., & Bosker, H. R. (2019). Listeners normalize speech for contextual speech rate even without an explicit recognition task. The Journal of the Acoustical Society of America, 146(1), 179-188. doi:10.1121/1.5116004.

    Abstract

    Speech can be produced at different rates. Listeners take this rate variation into account by normalizing vowel duration for contextual speech rate: An ambiguous Dutch word /m?t/ is perceived as short /mAt/ when embedded in a slow context, but long /ma:t/ in a fast context. Whilst some have argued that this rate normalization involves low-level automatic perceptual processing, there is also evidence that it arises at higher-level cognitive processing stages, such as decision making. Prior research on rate-dependent speech perception has only used explicit recognition tasks to investigate the phenomenon, involving both perceptual processing and decision making. This study tested whether speech rate normalization can be observed without explicit decision making, using a cross-modal repetition priming paradigm. Results show that a fast precursor sentence makes an embedded ambiguous prime (/m?t/) sound (implicitly) more /a:/-like, facilitating lexical access to the long target word "maat" in a (explicit) lexical decision task. This result suggests that rate normalization is automatic, taking place even in the absence of an explicit recognition task. Thus, rate normalization is placed within the realm of everyday spoken conversation, where explicit categorization of ambiguous sounds is rare.
  • Rodd, J., Bosker, H. R., Ten Bosch, L., & Ernestus, M. (2019). Deriving the onset and offset times of planning units from acoustic and articulatory measurements. The Journal of the Acoustical Society of America, 145(2), EL161-EL167. doi:10.1121/1.5089456.

    Abstract

    Many psycholinguistic models of speech sequence planning make claims about the onset and offset times of planning units, such as words, syllables, and phonemes. These predictions typically go untested, however, since psycholinguists have assumed that the temporal dynamics of the speech signal is a poor index of the temporal dynamics of the underlying speech planning process. This article argues that this problem is tractable, and presents and validates two simple metrics that derive planning unit onset and offset times from the acoustic signal and articulatographic data.
  • Bosker, H. R. (2017). Accounting for rate-dependent category boundary shifts in speech perception. Attention, Perception & Psychophysics, 79, 333-343. doi:10.3758/s13414-016-1206-4.

    Abstract

    The perception of temporal contrasts in speech is known to be influenced by the speech rate in the surrounding context. This rate-dependent perception is suggested to involve general auditory processes since it is also elicited by non-speech contexts, such as pure tone sequences. Two general auditory mechanisms have been proposed to underlie rate-dependent perception: durational contrast and neural entrainment. The present study compares the predictions of these two accounts of rate-dependent speech perception by means of four experiments in which participants heard tone sequences followed by Dutch target words ambiguous between /ɑs/ “ash” and /a:s/ “bait”. Tone sequences varied in the duration of tones (short vs. long) and in the presentation rate of the tones (fast vs. slow). Results show that the duration of preceding tones did not influence target perception in any of the experiments, thus challenging durational contrast as explanatory mechanism behind rate-dependent perception. Instead, the presentation rate consistently elicited a category boundary shift, with faster presentation rates inducing more /a:s/ responses, but only if the tone sequence was isochronous. Therefore, this study proposes an alternative, neurobiologically plausible, account of rate-dependent perception involving neural entrainment of endogenous oscillations to the rate of a rhythmic stimulus.
  • Bosker, H. R., Reinisch, E., & Sjerps, M. J. (2017). Cognitive load makes speech sound fast, but does not modulate acoustic context effects. Journal of Memory and Language, 94, 166-176. doi:10.1016/j.jml.2016.12.002.

    Abstract

    In natural situations, speech perception often takes place during the concurrent execution of other cognitive tasks, such as listening while viewing a visual scene. The execution of a dual task typically has detrimental effects on concurrent speech perception, but how exactly cognitive load disrupts speech encoding is still unclear. The detrimental effect on speech representations may consist of either a general reduction in the robustness of processing of the speech signal (‘noisy encoding’), or, alternatively it may specifically influence the temporal sampling of the sensory input, with listeners missing temporal pulses, thus underestimating segmental durations (‘shrinking of time’). The present study investigated whether and how spectral and temporal cues in a precursor sentence that has been processed under high vs. low cognitive load influence the perception of a subsequent target word. If cognitive load effects are implemented through ‘noisy encoding’, increasing cognitive load during the precursor should attenuate the encoding of both its temporal and spectral cues, and hence reduce the contextual effect that these cues can have on subsequent target sound perception. However, if cognitive load effects are expressed as ‘shrinking of time’, context effects should not be modulated by load, but a main effect would be expected on the perceived duration of the speech signal. Results from two experiments indicate that increasing cognitive load (manipulated through a secondary visual search task) did not modulate temporal (Experiment 1) or spectral context effects (Experiment 2). However, a consistent main effect of cognitive load was found: increasing cognitive load during the precursor induced a perceptual increase in its perceived speech rate, biasing the perception of a following target word towards longer durations. This finding suggests that cognitive load effects in speech perception are implemented via ‘shrinking of time’, in line with a temporal sampling framework. In addition, we argue that our results align with a model in which early (spectral and temporal) normalization is unaffected by attention but later adjustments may be attention-dependent.
  • Bosker, H. R., & Kösem, A. (2017). An entrained rhythm's frequency, not phase, influences temporal sampling of speech. In Proceedings of Interspeech 2017 (pp. 2416-2420). doi:10.21437/Interspeech.2017-73.

    Abstract

    Brain oscillations have been shown to track the slow amplitude fluctuations in speech during comprehension. Moreover, there is evidence that these stimulus-induced cortical rhythms may persist even after the driving stimulus has ceased. However, how exactly this neural entrainment shapes speech perception remains debated. This behavioral study investigated whether and how the frequency and phase of an entrained rhythm would influence the temporal sampling of subsequent speech. In two behavioral experiments, participants were presented with slow and fast isochronous tone sequences, followed by Dutch target words ambiguous between as /ɑs/ “ash” (with a short vowel) and aas /a:s/ “bait” (with a long vowel). Target words were presented at various phases of the entrained rhythm. Both experiments revealed effects of the frequency of the tone sequence on target word perception: fast sequences biased listeners to more long /a:s/ responses. However, no evidence for phase effects could be discerned. These findings show that an entrained rhythm’s frequency, but not phase, influences the temporal sampling of subsequent speech. These outcomes are compatible with theories suggesting that sensory timing is evaluated relative to entrained frequency. Furthermore, they suggest that phase tracking of (syllabic) rhythms by theta oscillations plays a limited role in speech parsing.
  • Bosker, H. R., & Reinisch, E. (2017). Foreign languages sound fast: evidence from implicit rate normalization. Frontiers in Psychology, 8: 1063. doi:10.3389/fpsyg.2017.01063.

    Abstract

    Anecdotal evidence suggests that unfamiliar languages sound faster than one’s native language. Empirical evidence for this impression has, so far, come from explicit rate judgments. The aim of the present study was to test whether such perceived rate differences between native and foreign languages have effects on implicit speech processing. Our measure of implicit rate perception was “normalization for speaking rate”: an ambiguous vowel between short /a/ and long /a:/ is interpreted as /a:/ following a fast but as /a/ following a slow carrier sentence. That is, listeners did not judge speech rate itself; instead, they categorized ambiguous vowels whose perception was implicitly affected by the rate of the context. We asked whether a bias towards long /a:/ might be observed when the context is not actually faster but simply spoken in a foreign language. A fully symmetrical experimental design was used: Dutch and German participants listened to rate matched (fast and slow) sentences in both languages spoken by the same bilingual speaker. Sentences were followed by nonwords that contained vowels from an /a-a:/ duration continuum. Results from Experiments 1 and 2 showed a consistent effect of rate normalization for both listener groups. Moreover, for German listeners, across the two experiments, foreign sentences triggered more /a:/ responses than (rate matched) native sentences, suggesting that foreign sentences were indeed perceived as faster. Moreover, this Foreign Language effect was modulated by participants’ ability to understand the foreign language: those participants that scored higher on a foreign language translation task showed less of a Foreign Language effect. However, opposite effects were found for the Dutch listeners. For them, their native rather than the foreign language induced more /a:/ responses. Nevertheless, this reversed effect could be reduced when additional spectral properties of the context were controlled for. Experiment 3, using explicit rate judgments, replicated the effect for German but not Dutch listeners. We therefore conclude that the subjective impression that foreign languages sound fast may have an effect on implicit speech processing, with implications for how language learners perceive spoken segments in a foreign language.

    Additional information

    data sheet 1.docx
  • Bosker, H. R. (2017). How our own speech rate influences our perception of others. Journal of Experimental Psychology: Learning, Memory, and Cognition, 43(8), 1225-1238. doi:10.1037/xlm0000381.

    Abstract

    In conversation, our own speech and that of others follow each other in rapid succession. Effects of the surrounding context on speech perception are well documented but, despite the ubiquity of the sound of our own voice, it is unknown whether our own speech also influences our perception of other talkers. This study investigated context effects induced by our own speech through six experiments, specifically targeting rate normalization (i.e., perceiving phonetic segments relative to surrounding speech rate). Experiment 1 revealed that hearing pre-recorded fast or slow context sentences altered the perception of ambiguous vowels, replicating earlier work. Experiment 2 demonstrated that talking at a fast or slow rate prior to target presentation also altered target perception, though the effect of preceding speech rate was reduced. Experiment 3 showed that silent talking (i.e., inner speech) at fast or slow rates did not modulate the perception of others, suggesting that the effect of self-produced speech rate in Experiment 2 arose through monitoring of the external speech signal. Experiment 4 demonstrated that, when participants were played back their own (fast/slow) speech, no reduction of the effect of preceding speech rate was observed, suggesting that the additional task of speech production may be responsible for the reduced effect in Experiment 2. Finally, Experiments 5 and 6 replicate Experiments 2 and 3 with new participant samples. Taken together, these results suggest that variation in speech production may induce variation in speech perception, thus carrying implications for our understanding of spoken communication in dialogue settings.
  • Bosker, H. R. (2017). The role of temporal amplitude modulations in the political arena: Hillary Clinton vs. Donald Trump. In Proceedings of Interspeech 2017 (pp. 2228-2232). doi:10.21437/Interspeech.2017-142.

    Abstract

    Speech is an acoustic signal with inherent amplitude modulations in the 1-9 Hz range. Recent models of speech perception propose that this rhythmic nature of speech is central to speech recognition. Moreover, rhythmic amplitude modulations have been shown to have beneficial effects on language processing and the subjective impression listeners have of the speaker. This study investigated the role of amplitude modulations in the political arena by comparing the speech produced by Hillary Clinton and Donald Trump in the three presidential debates of 2016. Inspection of the modulation spectra, revealing the spectral content of the two speakers’ amplitude envelopes after matching for overall intensity, showed considerably greater power in Clinton’s modulation spectra (compared to Trump’s) across the three debates, particularly in the 1-9 Hz range. The findings suggest that Clinton’s speech had a more pronounced temporal envelope with rhythmic amplitude modulations below 9 Hz, with a preference for modulations around 3 Hz. This may be taken as evidence for a more structured temporal organization of syllables in Clinton’s speech, potentially due to more frequent use of preplanned utterances. Outcomes are interpreted in light of the potential beneficial effects of a rhythmic temporal envelope on intelligibility and speaker perception.
  • Maslowski, M., Meyer, A. S., & Bosker, H. R. (2017). Whether long-term tracking of speech rate affects perception depends on who is talking. In Proceedings of Interspeech 2017 (pp. 586-590). doi:10.21437/Interspeech.2017-1517.

    Abstract

    Speech rate is known to modulate perception of temporally ambiguous speech sounds. For instance, a vowel may be perceived as short when the immediate speech context is slow, but as long when the context is fast. Yet, effects of long-term tracking of speech rate are largely unexplored. Two experiments tested whether long-term tracking of rate influences perception of the temporal Dutch vowel contrast /ɑ/-/a:/. In Experiment 1, one low-rate group listened to 'neutral' rate speech from talker A and to slow speech from talker B. Another high-rate group was exposed to the same neutral speech from A, but to fast speech from B. Between-group comparison of the 'neutral' trials revealed that the low-rate group reported a higher proportion of /a:/ in A's 'neutral' speech, indicating that A sounded faster when B was slow. Experiment 2 tested whether one's own speech rate also contributes to effects of long-term tracking of rate. Here, talker B's speech was replaced by playback of participants' own fast or slow speech. No evidence was found that one's own voice affected perception of talker A in larger speech contexts. These results carry implications for our understanding of the mechanisms involved in rate-dependent speech perception and of dialogue.

Share this page