Publications

Displaying 1 - 36 of 36
  • Casillas, M., Foushee, R., Méndez Girón, J., Polian, G., & Brown, P. (2024). Little evidence for a noun bias in Tseltal spontaneous speech. First Language. Advance online publication. doi:10.1177/01427237231216571.

    Abstract

    This study examines whether children acquiring Tseltal (Mayan) demonstrate a noun bias – an overrepresentation of nouns in their early vocabularies. Nouns, specifically concrete and animate nouns, are argued to universally predominate in children’s early vocabularies because their referents are naturally available as bounded concepts to which linguistic labels can be mapped. This early advantage for noun learning has been documented using multiple methods and across a diverse collection of language populations. However, past evidence bearing on a noun bias in Tseltal learners has been mixed. Tseltal grammatical features and child–caregiver interactional patterns dampen the salience of nouns and heighten the salience of verbs, leading to the prediction of a diminished noun bias and perhaps even an early predominance of verbs. We here analyze the use of noun and verb stems in children’s spontaneous speech from egocentric daylong recordings of 29 Tseltal learners between 0;9 and 4;4. We find weak to no evidence for a noun bias using two separate analytical approaches on the same data; one analysis yields a preliminary suggestion of a flipped outcome (i.e. a verb bias). We discuss the implications of these findings for broader theories of learning bias in early lexical development.
  • Lutzenberger, H., Casillas, M., Fikkert, P., Crasborn, O., & De Vos, C. (2024). More than looks: Exploring methods to test phonological discrimination in the sign language Kata Kolok. Language Learning and Development. Advance online publication. doi:10.1080/15475441.2023.2277472.

    Abstract

    The lack of diversity in the language sciences has increasingly been criticized as it holds the potential for producing flawed theories. Research on (i) geographically diverse language communities and (ii) on sign languages is necessary to corroborate, sharpen, and extend existing theories. This study contributes a case study of adapting a well-established paradigm to study the acquisition of sign phonology in Kata Kolok, a sign language of rural Bali, Indonesia. We conducted an experiment modeled after the familiarization paradigm with child signers of Kata Kolok. Traditional analyses of looking time did not yield significant differences between signing and non-signing children. Yet, additional behavioral analyses (attention, eye contact, hand behavior) suggest that children who are signers and those who are non-signers, as well as those who are hearing and those who are deaf, interact differently with the task. This study suggests limitations of the paradigm due to the ecology of sign languages and the sociocultural characteristics of the sample, calling for a mixed-methods approach. Ultimately, this paper aims to elucidate the diversity of adaptations necessary for experimental design, procedure, and analysis, and to offer a critical reflection on the contribution of similar efforts and the diversification of the field.
  • Bergelson, E., Soderstrom, M., Schwarz, I.-C., Rowland, C. F., Ramírez-Esparza, N., Rague Hamrick, L., Marklund, E., Kalashnikova, M., Guez, A., Casillas, M., Benetti, L., Van Alphen, P. M., & Cristia, A. (2023). Everyday language input and production in 1,001 children from six continents. Proceedings of the National Academy of Sciences of the United States of America, 120(52): 2300671120. doi:10.1073/pnas.2300671120.

    Abstract

    Language is a universal human ability, acquired readily by young children, whootherwise struggle with many basics of survival. And yet, language ability is variableacross individuals. Naturalistic and experimental observations suggest that children’slinguistic skills vary with factors like socioeconomic status and children’s gender.But which factors really influence children’s day-to-day language use? Here, weleverage speech technology in a big-data approach to report on a unique cross-culturaland diverse data set: >2,500 d-long, child-centered audio-recordings of 1,001 2- to48-mo-olds from 12 countries spanning six continents across urban, farmer-forager,and subsistence-farming contexts. As expected, age and language-relevant clinical risksand diagnoses predicted how much speech (and speech-like vocalization) childrenproduced. Critically, so too did adult talk in children’s environments: Children whoheard more talk from adults produced more speech. In contrast to previous conclusionsbased on more limited sampling methods and a different set of language proxies,socioeconomic status (operationalized as maternal education) was not significantlyassociated with children’s productions over the first 4 y of life, and neither weregender or multilingualism. These findings from large-scale naturalistic data advanceour understanding of which factors are robust predictors of variability in the speechbehaviors of young learners in a wide range of everyday contexts
  • De Vos, C., Casillas, M., Uittenbogert, T., Crasborn, O., & Levinson, S. C. (2022). Predicting conversational turns: Signers’ and non-signers’ sensitivity to language-specific and globally accessible cues. Language, 98(1), 35-62. doi:10.1353/lan.2021.0085.

    Abstract

    Precision turn-taking may constitute a crucial part of the human endowment for communication. If so, it should be implemented similarly across language modalities, as in signed vs. spoken language. Here in the first experimental study of turn-end prediction in sign language, we find support for the idea that signed language, like spoken language, involves turn-type prediction and turn-end anticipation. In both cases, turns eliciting specific responses like questions accelerate anticipation. We also show remarkable cross-modality predictive capacity: non-signers anticipate sign turn-ends surprisingly well. Finally, we show that despite non-signers’ ability to intuitively predict signed turn-ends, early native signers do it much better by using their access to linguistic signals (here, question markers). As shown in prior work, question formation facilitates prediction, and age of sign language acquisition affects accuracy. The study thus sheds light on the kind of features that may facilitate turn-taking universally, and those that are language-specific.

    Additional information

    public summary
  • Casillas, M., Brown, P., & Levinson, S. C. (2021). Early language experience in a Papuan community. Journal of Child Language, 48(4), 792-814. doi:10.1017/S0305000920000549.

    Abstract

    The rate at which young children are directly spoken to varies due to many factors, including (a) caregiver ideas about children as conversational partners and (b) the organization of everyday life. Prior work suggests cross-cultural variation in rates of child-directed speech is due to the former factor, but has been fraught with confounds in comparing postindustrial and subsistence farming communities. We investigate the daylong language environments of children (0;0–3;0) on Rossel Island, Papua New Guinea, a small-scale traditional community where prior ethnographic study demonstrated contingency-seeking child interaction styles. In fact, children were infrequently directly addressed and linguistic input rate was primarily affected by situational factors, though children’s vocalization maturity showed no developmental delay. We compare the input characteristics between this community and a Tseltal Mayan one in which near-parallel methods produced comparable results, then briefly discuss the models and mechanisms for learning best supported by our findings.
  • Cychosz, M., Cristia, A., Bergelson, E., Casillas, M., Baudet, G., Warlaumont, A. S., Scaff, C., Yankowitz, L., & Seidl, A. (2021). Vocal development in a large‐scale crosslinguistic corpus. Developmental Science, 24(5): e13090. doi:10.1111/desc.13090.

    Abstract

    This study evaluates whether early vocalizations develop in similar ways in children across diverse cultural contexts. We analyze data from daylong audio recordings of 49 children (1–36 months) from five different language/cultural backgrounds. Citizen scientists annotated these recordings to determine if child vocalizations contained canonical transitions or not (e.g., “ba” vs. “ee”). Results revealed that the proportion of clips reported to contain canonical transitions increased with age. Furthermore, this proportion exceeded 0.15 by around 7 months, replicating and extending previous findings on canonical vocalization development but using data from the natural environments of a culturally and linguistically diverse sample. This work explores how crowdsourcing can be used to annotate corpora, helping establish developmental milestones relevant to multiple languages and cultures. Lower inter‐annotator reliability on the crowdsourcing platform, relative to more traditional in‐lab expert annotators, means that a larger number of unique annotators and/or annotations are required, and that crowdsourcing may not be a suitable method for more fine‐grained annotation decisions. Audio clips used for this project are compiled into a large‐scale infant vocalization corpus that is available for other researchers to use in future work.

    Additional information

    supporting information audio data
  • Frost, R. L. A., & Casillas, M. (2021). Investigating statistical learning of nonadjacent dependencies: Running statistical learning tasks in non-WEIRD populations. In SAGE Research Methods Cases. doi:10.4135/9781529759181.

    Abstract

    Language acquisition is complex. However, one thing that has been suggested to help learning is the way that information is distributed throughout language; co-occurrences among particular items (e.g., syllables and words) have been shown to help learners discover the words that a language contains and figure out how those words are used. Humans’ ability to draw on this information—“statistical learning”—has been demonstrated across a broad range of studies. However, evidence from non-WEIRD (Western, Educated, Industrialized, Rich, and Democratic) societies is critically lacking, which limits theorizing on the universality of this skill. We extended work on statistical language learning to a new, non-WEIRD linguistic population: speakers of Yélî Dnye, who live on a remote island off mainland Papua New Guinea (Rossel Island). We performed a replication of an existing statistical learning study, training adults on an artificial language with statistically defined words, then examining what they had learnt using a two-alternative forced-choice test. Crucially, we implemented several key amendments to the original study to ensure the replication was suitable for remote field-site testing with speakers of Yélî Dnye. We made critical changes to the stimuli and materials (to test speakers of Yélî Dnye, rather than English), the instructions (we re-worked these significantly, and added practice tasks to optimize participants’ understanding), and the study format (shifting from a lab-based to a portable tablet-based setup). We discuss the requirement for acute sensitivity to linguistic, cultural, and environmental factors when adapting studies to test new populations.

  • Räsänen, O., Seshadri, S., Lavechin, M., Cristia, A., & Casillas, M. (2021). ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordings. Behavior Research Methods, 53, 818-835. doi:10.3758/s13428-020-01460-x.

    Abstract

    Recordings captured by wearable microphones are a standard method for investigating young children’s language environments. A key measure to quantify from such data is the amount of speech present in children’s home environments. To this end, the LENA recorder and software—a popular system for measuring linguistic input—estimates the number of adult words that children may hear over the course of a recording. However, word count estimation is challenging to do in a language-independent manner; the relationship between observable acoustic patterns and language-specific lexical entities is far from uniform across human languages. In this paper, we ask whether some alternative linguistic units, namely phone(me)s or syllables, could be measured instead of, or in parallel with, words in order to achieve improved cross-linguistic applicability and comparability of an automated system for measuring child language input. We discuss the advantages and disadvantages of measuring different units from theoretical and technical points of view. We also investigate the practical applicability of measuring such units using a novel system called Automatic LInguistic unit Count Estimator (ALICE) together with audio from seven child-centered daylong audio corpora from diverse cultural and linguistic environments. We show that language-independent measurement of phoneme counts is somewhat more accurate than syllables or words, but all three are highly correlated with human annotations on the same data. We share an open-source implementation of ALICE for use by the language research community, allowing automatic phoneme, syllable, and word count estimation from child-centered audio recordings.
  • Casillas, M., Brown, P., & Levinson, S. C. (2020). Early language experience in a Tseltal Mayan village. Child Development, 91(5), 1819-1835. doi:10.1111/cdev.13349.

    Abstract

    Daylong at-home audio recordings from 10 Tseltal Mayan children (0;2–3;0; Southern Mexico) were analyzed for how often children engaged in verbal interaction with others and whether their speech environment changed with age, time of day, household size, and number of speakers present. Children were infrequently directly spoken to, with most directed speech coming from adults, and no increase with age. Most directed speech came in the mornings, and interactional peaks contained nearly four times the baseline rate of directed speech. Coarse indicators of children's language development (babbling, first words, first word combinations) suggest that Tseltal children manage to extract the linguistic information they need despite minimal directed speech. Multiple proposals for how they might do so are discussed.

    Additional information

    Tseltal-CLE-SuppMat.pdf
  • Casillas, M., & Hilbrink, E. (2020). Communicative act development. In K. P. Schneider, & E. Ifantidou (Eds.), Developmental and Clinical Pragmatics (pp. 61-88). Berlin: De Gruyter Mouton.

    Abstract

    How do children learn to map linguistic forms onto their intended meanings? This chapter begins with an introduction to some theoretical and analytical tools used to study communicative acts. It then turns to communicative act development in spoken and signed language acquisition, including both the early scaffolding and production of communicative acts (both non-verbal and verbal) as well as their later links to linguistic development and Theory of Mind. The chapter wraps up by linking research on communicative act development to the acquisition of conversational skills, cross-linguistic and individual differences in communicative experience during development, and human evolution. Along the way, it also poses a few open questions for future research in this domain.
  • Cychosz, M., Romeo, R., Soderstrom, M., Scaff, C., Ganek, H., Cristia, A., Casillas, M., De Barbaro, K., Bang, J. Y., & Weisleder, A. (2020). Longform recordings of everyday life: Ethics for best practices. Behavior Research Methods, 52, 1951-1969. doi:10.3758/s13428-020-01365-9.

    Abstract

    Recent advances in large-scale data storage and processing offer unprecedented opportunities for behavioral scientists to collect and analyze naturalistic data, including from under-represented groups. Audio data, particularly real-world audio recordings, are of particular interest to behavioral scientists because they provide high-fidelity access to subtle aspects of daily life and social interactions. However, these methodological advances pose novel risks to research participants and communities. In this article, we outline the benefits and challenges associated with collecting, analyzing, and sharing multi-hour audio recording data. Guided by the principles of autonomy, privacy, beneficence, and justice, we propose a set of ethical guidelines for the use of longform audio recordings in behavioral research. This article is also accompanied by an Open Science Framework Ethics Repository that includes informed consent resources such as frequent participant concerns and sample consent forms.
  • MacDonald, K., Räsänen, O., Casillas, M., & Warlaumont, A. S. (2020). Measuring prosodic predictability in children’s home language environments. In S. Denison, M. Mack, Y. Xu, & B. C. Armstrong (Eds.), Proceedings of the 42nd Annual Virtual Meeting of the Cognitive Science Society (CogSci 2020) (pp. 695-701). Montreal, QB: Cognitive Science Society.

    Abstract

    Children learn language from the speech in their home environment. Recent work shows that more infant-directed speech
    (IDS) leads to stronger lexical development. But what makes IDS a particularly useful learning signal? Here, we expand on an attention-based account first proposed by Räsänen et al. (2018): that prosodic modifications make IDS less predictable, and thus more interesting. First, we reproduce the critical finding from Räsänen et al.: that lab-recorded IDS pitch is less predictable compared to adult-directed speech (ADS). Next, we show that this result generalizes to the home language environment, finding that IDS in daylong recordings is also less predictable than ADS but that this pattern is much less robust than for IDS recorded in the lab. These results link experimental work on attention and prosodic modifications of IDS to real-world language-learning environments, highlighting some challenges of scaling up analyses of IDS to larger datasets that better capture children’s actual input.
  • Bergelson*, E., Casillas*, M., Soderstrom, M., Seidl, A., Warlaumont, A. S., & Amatuni, A. (2019). What Do North American Babies Hear? A large-scale cross-corpus analysis. Developmental Science, 22(1): e12724. doi:10.1111/desc.12724.

    Abstract

    - * indicates joint first authorship - Abstract: A range of demographic variables influence how much speech young children hear. However, because studies have used vastly different sampling methods, quantitative comparison of interlocking demographic effects has been nearly impossible, across or within studies. We harnessed a unique collection of existing naturalistic, day-long recordings from 61 homes across four North American cities to examine language input as a function of age, gender, and maternal education. We analyzed adult speech heard by 3- to 20-month-olds who wore audio recorders for an entire day. We annotated speaker gender and speech register (child-directed or adult-directed) for 10,861 utterances from female and male adults in these recordings. Examining age, gender, and maternal education collectively in this ecologically-valid dataset, we find several key results. First, the speaker gender imbalance in the input is striking: children heard 2--3x more speech from females than males. Second, children in higher-maternal-education homes heard more child-directed speech than those in lower-maternal education homes. Finally, our analyses revealed a previously unreported effect: the proportion of child-directed speech in the input increases with age, due to a decrease in adult-directed speech with age. This large-scale analysis is an important step forward in collectively examining demographic variables that influence early development, made possible by pooled, comparable, day-long recordings of children's language environments. The audio recordings, annotations, and annotation software are readily available for re-use and re-analysis by other researchers.

    Additional information

    desc12724-sup-0001-supinfo.pdf
  • Casillas, M., & Cristia, A. (2019). A step-by-step guide to collecting and analyzing long-format speech environment (LFSE) recordings. Collabra, 5(1): 24. doi:10.1525/collabra.209.

    Abstract

    Recent years have seen rapid technological development of devices that can record communicative behavior as participants go about daily life. This paper is intended as an end-to-end methodological guidebook for potential users of these technologies, including researchers who want to study children’s or adults’ communicative behavior in everyday contexts. We explain how long-format speech environment (LFSE) recordings provide a unique view on language use and how they can be used to complement other measures at the individual and group level. We aim to help potential users of these technologies make informed decisions regarding research design, hardware, software, and archiving. We also provide information regarding ethics and implementation, issues that are difficult to navigate for those new to this technology, and on which little or no resources are available. This guidebook offers a concise summary of information for new users and points to sources of more detailed information for more advanced users. Links to discussion groups and community-augmented databases are also provided to help readers stay up-to-date on the latest developments.
  • Casillas, M., Rafiee, A., & Majid, A. (2019). Iranian herbalists, but not cooks, are better at naming odors than laypeople. Cognitive Science, 43(6): e12763. doi:10.1111/cogs.12763.

    Abstract

    Odor naming is enhanced in communities where communication about odors is a central part of daily life (e.g., wine experts, flavorists, and some hunter‐gatherer groups). In this study, we investigated how expert knowledge and daily experience affect the ability to name odors in a group of experts that has not previously been investigated in this context—Iranian herbalists; also called attars—as well as cooks and laypeople. We assessed naming accuracy and consistency for 16 herb and spice odors, collected judgments of odor perception, and evaluated participants' odor meta‐awareness. Participants' responses were overall more consistent and accurate for more frequent and familiar odors. Moreover, attars were more accurate than both cooks and laypeople at naming odors, although cooks did not perform significantly better than laypeople. Attars' perceptual ratings of odors and their overall odor meta‐awareness suggest they are also more attuned to odors than the other two groups. To conclude, Iranian attars—but not cooks—are better odor namers than laypeople. They also have greater meta‐awareness and differential perceptual responses to odors. These findings further highlight the critical role that expertise and type of experience have on olfactory functions.

    Additional information

    Supplementary Materials
  • Räsänen, O., Seshadri, S., Karadayi, J., Riebling, E., Bunce, J., Cristia, A., Metze, F., Casillas, M., Rosemberg, C., Bergelson, E., & Soderstrom, M. (2019). Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech. Speech Communication, 113, 63-80. doi:10.1016/j.specom.2019.08.005.

    Abstract

    Automatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE.
  • Bögels, S., Casillas, M., & Levinson, S. C. (2018). Planning versus comprehension in turn-taking: Fast responders show reduced anticipatory processing of the question. Neuropsychologia, 109, 295-310. doi:10.1016/j.neuropsychologia.2017.12.028.

    Abstract

    Rapid response latencies in conversation suggest that responders start planning before the ongoing turn is finished. Indeed, an earlier EEG study suggests that listeners start planning their responses to questions as soon as they can (Bögels, S., Magyari, L., & Levinson, S. C. (2015). Neural signatures of response planning occur midway through an incoming question in conversation. Scientific Reports, 5, 12881). The present study aimed to (1) replicate this early planning effect and (2) investigate whether such early response planning incurs a cost on participants’ concurrent comprehension of the ongoing turn. During the experiment participants answered questions from a confederate partner. To address aim (1), the questions were designed such that response planning could start either early or late in the turn. Our results largely replicate Bögels et al. (2015) showing a large positive ERP effect and an oscillatory alpha/beta reduction right after participants could have first started planning their verbal response, again suggesting an early start of response planning. To address aim (2), the confederate's questions also contained either an expected word or an unexpected one to elicit a differential N400 effect, either before or after the start of response planning. We hypothesized an attenuated N400 effect after response planning had started. In contrast, the N400 effects before and after planning did not differ. There was, however, a positive correlation between participants' response time and their N400 effect size after planning had started; quick responders showed a smaller N400 effect, suggesting reduced attention to comprehension and possibly reduced anticipatory processing. We conclude that early response planning can indeed impact comprehension processing.

    Additional information

    mmc1.pdf
  • Cristia, A., Ganesh, S., Casillas, M., & Ganapathy, S. (2018). Talker diarization in the wild: The case of child-centered daylong audio-recordings. In Proceedings of Interspeech 2018 (pp. 2583-2587). doi:10.21437/Interspeech.2018-2078.

    Abstract

    Speaker diarization (answering 'who spoke when') is a widely researched subject within speech technology. Numerous experiments have been run on datasets built from broadcast news, meeting data, and call centers—the task sometimes appears close to being solved. Much less work has begun to tackle the hardest diarization task of all: spontaneous conversations in real-world settings. Such diarization would be particularly useful for studies of language acquisition, where researchers investigate the speech children produce and hear in their daily lives. In this paper, we study audio gathered with a recorder worn by small children as they went about their normal days. As a result, each child was exposed to different acoustic environments with a multitude of background noises and a varying number of adults and peers. The inconsistency of speech and noise within and across samples poses a challenging task for speaker diarization systems, which we tackled via retraining and data augmentation techniques. We further studied sources of structured variation across raw audio files, including the impact of speaker type distribution, proportion of speech from children, and child age on diarization performance. We discuss the extent to which these findings might generalize to other samples of speech in the wild.
  • Räsänen, O., Seshadri, S., & Casillas, M. (2018). Comparison of syllabification algorithms and training strategies for robust word count estimation across different languages and recording conditions. In Proceedings of Interspeech 2018 (pp. 1200-1204). doi:10.21437/Interspeech.2018-1047.

    Abstract

    Word count estimation (WCE) from audio recordings has a number of applications, including quantifying the amount of speech that language-learning infants hear in their natural environments, as captured by daylong recordings made with devices worn by infants. To be applicable in a wide range of scenarios and also low-resource domains, WCE tools should be extremely robust against varying signal conditions and require minimal access to labeled training data in the target domain. For this purpose, earlier work has used automatic syllabification of speech, followed by a least-squares-mapping of syllables to word counts. This paper compares a number of previously proposed syllabifiers in the WCE task, including a supervised bi-directional long short-term memory (BLSTM) network that is trained on a language for which high quality syllable annotations are available (a “high resource language”), and reports how the alternative methods compare on different languages and signal conditions. We also explore additive noise and varying-channel data augmentation strategies for BLSTM training, and show how they improve performance in both matching and mismatching languages. Intriguingly, we also find that even though the BLSTM works on languages beyond its training data, the unsupervised algorithms can still outperform it in challenging signal conditions on novel languages.
  • Casillas, M., Bergelson, E., Warlaumont, A. S., Cristia, A., Soderstrom, M., VanDam, M., & Sloetjes, H. (2017). A New Workflow for Semi-automatized Annotations: Tests with Long-Form Naturalistic Recordings of Childrens Language Environments. In Proceedings of Interspeech 2017 (pp. 2098-2102). doi:10.21437/Interspeech.2017-1418.

    Abstract

    Interoperable annotation formats are fundamental to the utility, expansion, and sustainability of collective data repositories.In language development research, shared annotation schemes have been critical to facilitating the transition from raw acoustic data to searchable, structured corpora. Current schemes typically require comprehensive and manual annotation of utterance boundaries and orthographic speech content, with an additional, optional range of tags of interest. These schemes have been enormously successful for datasets on the scale of dozens of recording hours but are untenable for long-format recording corpora, which routinely contain hundreds to thousands of audio hours. Long-format corpora would benefit greatly from (semi-)automated analyses, both on the earliest steps of annotation—voice activity detection, utterance segmentation, and speaker diarization—as well as later steps—e.g., classification-based codes such as child-vs-adult-directed speech, and speech recognition to produce phonetic/orthographic representations. We present an annotation workflow specifically designed for long-format corpora which can be tailored by individual researchers and which interfaces with the current dominant scheme for short-format recordings. The workflow allows semi-automated annotation and analyses at higher linguistic levels. We give one example of how the workflow has been successfully implemented in a large cross-database project.
  • Casillas, M., & Frank, M. C. (2017). The development of children's ability to track and predict turn structure in conversation. Journal of Memory and Language, 92, 234-253. doi:10.1016/j.jml.2016.06.013.

    Abstract

    Children begin developing turn-taking skills in infancy but take several years to fluidly integrate their growing knowledge of language into their turn-taking behavior. In two eye-tracking experiments, we measured children’s anticipatory gaze to upcoming responders while controlling linguistic cues to turn structure. In Experiment 1, we showed English and non-English conversations to English-speaking adults and children. In Experiment 2, we phonetically controlled lexicosyntactic and prosodic cues in English-only speech. Children spontaneously made anticipatory gaze switches by age two and continued improving through age six. In both experiments, children and adults made more anticipatory switches after hearing questions. Consistent with prior findings on adult turn prediction, prosodic information alone did not increase children’s anticipatory gaze shifts. But, unlike prior work with adults, lexical information alone was not sucient either—children’s performance was best overall with lexicosyntax and prosody together. Our findings support an account in which turn tracking and turn prediction emerge in infancy and then gradually become integrated with children’s online linguistic processing.
  • Casillas, M., Amatuni, A., Seidl, A., Soderstrom, M., Warlaumont, A., & Bergelson, E. (2017). What do Babies hear? Analyses of Child- and Adult-Directed Speech. In Proceedings of Interspeech 2017 (pp. 2093-2097). doi:10.21437/Interspeech.2017-1409.

    Abstract

    Child-directed speech is argued to facilitate language development, and is found cross-linguistically and cross-culturally to varying degrees. However, previous research has generally focused on short samples of child-caregiver interaction, often in the lab or with experimenters present. We test the generalizability of this phenomenon with an initial descriptive analysis of the speech heard by young children in a large, unique collection of naturalistic, daylong home recordings. Trained annotators coded automatically-detected adult speech 'utterances' from 61 homes across 4 North American cities, gathered from children (age 2-24 months) wearing audio recorders during a typical day. Coders marked the speaker gender (male/female) and intended addressee (child/adult), yielding 10,886 addressee and gender tags from 2,523 minutes of audio (cf. HB-CHAAC Interspeech ComParE challenge; Schuller et al., in press). Automated speaker-diarization (LENA) incorrectly gender-tagged 30% of male adult utterances, compared to manually-coded consensus. Furthermore, we find effects of SES and gender on child-directed and overall speech, increasing child-directed speech with child age, and interactions of speaker gender, child gender, and child age: female caretakers increased their child-directed speech more with age than male caretakers did, but only for male infants. Implications for language acquisition and existing classification algorithms are discussed.
  • Schuller, B., Steidl, S., Batliner, A., Bergelson, E., Krajewski, J., Janott, C., Amatuni, A., Casillas, M., Seidl, A., Soderstrom, M., Warlaumont, A. S., Hidalgo, G., Schnieder, S., Heiser, C., Hohenhorst, W., Herzog, M., Schmitt, M., Qian, K., Zhang, Y., Trigeorgis, G. and 2 moreSchuller, B., Steidl, S., Batliner, A., Bergelson, E., Krajewski, J., Janott, C., Amatuni, A., Casillas, M., Seidl, A., Soderstrom, M., Warlaumont, A. S., Hidalgo, G., Schnieder, S., Heiser, C., Hohenhorst, W., Herzog, M., Schmitt, M., Qian, K., Zhang, Y., Trigeorgis, G., Tzirakis, P., & Zafeiriou, S. (2017). The INTERSPEECH 2017 computational paralinguistics challenge: Addressee, cold & snoring. In Proceedings of Interspeech 2017 (pp. 3442-3446). doi:10.21437/Interspeech.2017-43.

    Abstract

    The INTERSPEECH 2017 Computational Paralinguistics Challenge addresses three different problems for the first time in research competition under well-defined conditions: In the Addressee sub-challenge, it has to be determined whether speech produced by an adult is directed towards another adult or towards a child; in the Cold sub-challenge, speech under cold has to be told apart from ‘healthy’ speech; and in the Snoring subchallenge, four different types of snoring have to be classified. In this paper, we describe these sub-challenges, their conditions, and the baseline feature extraction and classifiers, which include data-learnt feature representations by end-to-end learning with convolutional and recurrent neural networks, and bag-of-audiowords for the first time in the challenge series
  • Casillas, M., Bobb, S. C., & Clark, E. V. (2016). Turn taking, timing, and planning in early language acquisition. Journal of Child Language, 43, 1310-1337. doi:10.1017/S0305000915000689.

    Abstract

    Young children answer questions with longer delays than adults do, and they don't reach typical adult response times until several years later. We hypothesized that this prolonged pattern of delay in children's timing results from competing demands: to give an answer, children must understand a question while simultaneously planning and initiating their response. Even as children get older and more efficient in this process, the demands on them increase because their verbal responses become more complex. We analyzed conversational question-answer sequences between caregivers and their children from ages 1;8 to 3;5, finding that children (1) initiate simple answers more quickly than complex ones, (2) initiate simple answers quickly from an early age, and (3) initiate complex answers more quickly as they grow older. Our results suggest that children aim to respond quickly from the start, improving on earlier-acquired answer types while they begin to practice later-acquired, slower ones.

    Additional information

    S0305000915000689sup001.docx
  • Clark, E. V., & Casillas, M. (2016). First language acquisition. In K. Allen (Ed.), The Routledge Handbook of Linguistics (pp. 311-328). New York: Routledge.
  • Holler, J., Kendrick, K. H., Casillas, M., & Levinson, S. C. (Eds.). (2016). Turn-Taking in Human Communicative Interaction. Lausanne: Frontiers Media. doi:10.3389/978-2-88919-825-2.

    Abstract

    The core use of language is in face-to-face conversation. This is characterized by rapid turn-taking. This turn-taking poses a number central puzzles for the psychology of language.

    Consider, for example, that in large corpora the gap between turns is on the order of 100 to 300 ms, but the latencies involved in language production require minimally between 600ms (for a single word) or 1500 ms (for as simple sentence). This implies that participants in conversation are predicting the ends of the incoming turn and preparing in advance. But how is this done? What aspects of this prediction are done when? What happens when the prediction is wrong? What stops participants coming in too early? If the system is running on prediction, why is there consistently a mode of 100 to 300 ms in response time?

    The timing puzzle raises further puzzles: it seems that comprehension must run parallel with the preparation for production, but it has been presumed that there are strict cognitive limitations on more than one central process running at a time. How is this bottleneck overcome? Far from being 'easy' as some psychologists have suggested, conversation may be one of the most demanding cognitive tasks in our everyday lives. Further questions naturally arise: how do children learn to master this demanding task, and what is the developmental trajectory in this domain?

    Research shows that aspects of turn-taking such as its timing are remarkably stable across languages and cultures, but the word order of languages varies enormously. How then does prediction of the incoming turn work when the verb (often the informational nugget in a clause) is at the end? Conversely, how can production work fast enough in languages that have the verb at the beginning, thereby requiring early planning of the whole clause? What happens when one changes modality, as in sign languages -- with the loss of channel constraints is turn-taking much freer? And what about face-to-face communication amongst hearing individuals -- do gestures, gaze, and other body behaviors facilitate turn-taking? One can also ask the phylogenetic question: how did such a system evolve? There seem to be parallels (analogies) in duetting bird species, and in a variety of monkey species, but there is little evidence of anything like this among the great apes.

    All this constitutes a neglected set of problems at the heart of the psychology of language and of the language sciences. This research topic welcomes contributions from right across the board, for example from psycholinguists, developmental psychologists, students of dialogue and conversation analysis, linguists interested in the use of language, phoneticians, corpus analysts and comparative ethologists or psychologists. We welcome contributions of all sorts, for example original research papers, opinion pieces, and reviews of work in subfields that may not be fully understood in other subfields.
  • Casillas, M., De Vos, C., Crasborn, O., & Levinson, S. C. (2015). The perception of stroke-to-stroke turn boundaries in signed conversation. In D. C. Noelle, R. Dale, A. S. Warlaumont, J. Yoshimi, T. Matlock, C. D. Jennings, & P. R. Maglio (Eds.), Proceedings of the 37th Annual Meeting of the Cognitive Science Society (CogSci 2015) (pp. 315-320). Austin, TX: Cognitive Science Society.

    Abstract

    Speaker transitions in conversation are often brief, with minimal vocal overlap. Signed languages appear to defy this pattern with frequent, long spans of simultaneous signing. But recent evidence suggests that turn boundaries in signed language may only include the content-bearing parts of the turn (from the first stroke to the last), and not all turn-related movement (from first preparation to final retraction). We tested whether signers were able to anticipate “stroke-to-stroke” turn boundaries with only minimal conversational context. We found that, indeed, signers anticipated turn boundaries at the ends of turn-final strokes. Signers often responded early, especially when the turn was long or contained multiple possible end points. Early responses for long turns were especially apparent for interrogatives—long interrogative turns showed much greater anticipation compared to short ones.
  • Holler, J., Kendrick, K. H., Casillas, M., & Levinson, S. C. (2015). Editorial: Turn-taking in human communicative interaction. Frontiers in Psychology, 6: 1919. doi:10.3389/fpsyg.2015.01919.
  • Lammertink, I., Casillas, M., Benders, T., Post, B., & Fikkert, P. (2015). Dutch and English toddlers' use of linguistic cues in predicting upcoming turn transitions. Frontiers in Psychology, 6: 495. doi:10.3389/fpsyg.2015.00495.
  • Arnon, I., Casillas, M., Kurumada, C., & Estigarribia, B. (Eds.). (2014). Language in interaction: Studies in honor of Eve V. Clark. Amsterdam: Benjamins.

    Abstract

    Understanding how communicative goals impact and drive the learning process has been a long-standing issue in the field of language acquisition. Recent years have seen renewed interest in the social and pragmatic aspects of language learning: the way interaction shapes what and how children learn. In this volume, we bring together researchers working on interaction in different domains to present a cohesive overview of ongoing interactional research. The studies address the diversity of the environments children learn in; the role of para-linguistic information; the pragmatic forces driving language learning; and the way communicative pressures impact language use and change. Using observational, empirical and computational findings, this volume highlights the effect of interpersonal communication on what children hear and what they learn. This anthology is inspired by and dedicated to Prof. Eve V. Clark – a pioneer in all matters related to language acquisition – and a major force in establishing interaction and communication as crucial aspects of language learning.
  • Casillas, M. (2014). Taking the floor on time: Delay and deferral in children’s turn taking. In I. Arnon, M. Casillas, C. Kurumada, & B. Estigarribia (Eds.), Language in Interaction: Studies in honor of Eve V. Clark (pp. 101-114). Amsterdam: Benjamins.

    Abstract

    A key part of learning to speak with others is figuring out when to start talking and how to hold the floor in conversation. For young children, the challenge of planning a linguistic response can slow down their response latencies, making misunderstanding, repair, and loss of the floor more likely. Like adults, children can mitigate their delays by using fillers (e.g., uh and um) at the start of their turns. In this chapter I analyze the onset and development of fillers in five children’s spontaneous speech from ages 1;6–3;6. My findings suggest that children start using fillers by 2;0, and use them to effectively mitigate delay in making a response.
  • Casillas, M. (2014). Turn-taking. In D. Matthews (Ed.), Pragmatic development in first language acquisition (pp. 53-70). Amsterdam: Benjamins.

    Abstract

    Conversation is a structured, joint action for which children need to learn a specialized set skills and conventions. Because conversation is a primary source of linguistic input, we can better grasp how children become active agents in their own linguistic development by studying their acquisition of conversational skills. In this chapter I review research on children’s turn-taking. This fundamental skill of human interaction allows children to gain feedback, make clarifications, and test hypotheses at every stage of development. I broadly review children’s conversational experiences, the types of turn-based contingency they must acquire, how they ask and answer questions, and when they manage to make timely responses
  • Casillas, M., & Frank, M. C. (2013). The development of predictive processes in children’s discourse understanding. In M. Knauff, M. Pauen, N. Sebanz, & I. Wachsmuth (Eds.), Proceedings of the 35th Annual Meeting of the Cognitive Science Society. (pp. 299-304). Austin,TX: Cognitive Society.

    Abstract

    We investigate children’s online predictive processing as it occurs naturally, in conversation. We showed 1–7 year-olds short videos of improvised conversation between puppets, controlling for available linguistic information through phonetic manipulation. Even one- and two-year-old children made accurate and spontaneous predictions about when a turn-switch would occur: they gazed at the upcoming speaker before they heard a response begin. This predictive skill relies on both lexical and prosodic information together, and is not tied to either type of information alone. We suggest that children integrate prosodic, lexical, and visual information to effectively predict upcoming linguistic material in conversation.
  • Sumner, M., Kurumada, C., Gafter, R., & Casillas, M. (2013). Phonetic variation and the recognition of words with pronunciation variants. In M. Knauff, M. Pauen, N. Sebanz, & I. Wachsmuth (Eds.), Proceedings of the 35th Annual Meeting of the Cognitive Science Society (CogSci 2013) (pp. 3486-3492). Austin, TX: Cognitive Science Society.
  • Casillas, M., & Frank, M. C. (2012). Cues to turn boundary prediction in adults and preschoolers. In S. Brown-Schmidt, J. Ginzburg, & S. Larsson (Eds.), Proceedings of SemDial 2012 (SeineDial): The 16th Workshop on the Semantics and Pragmatics of Dialogue (pp. 61-69). Paris: Université Paris-Diderot.

    Abstract

    Conversational turns often proceed with very brief pauses between speakers. In order to maintain “no gap, no overlap” turntaking, we must be able to anticipate when an ongoing utterance will end, tracking the current speaker for upcoming points of potential floor exchange. The precise set of cues that listeners use for turn-end boundary anticipation is not yet established. We used an eyetracking paradigm to measure adults’ and children’s online turn processing as they watched videos of conversations in their native language (English) and a range of other languages they did not speak. Both adults and children anticipated speaker transitions effectively. In addition, we observed evidence of turn-boundary anticipation for questions even in languages that were unknown to participants, suggesting that listeners’ success in turn-end anticipation does not rely solely on lexical information.
  • Casillas, M., & Amaral, P. (2011). Learning cues to category membership: Patterns in children’s acquisition of hedges. In C. Cathcart, I.-H. Chen, G. Finley, S. Kang, C. S. Sandy, & E. Stickles (Eds.), Proceedings of the Berkeley Linguistics Society 37th Annual Meeting (pp. 33-45). Linguistic Society of America, eLanguage.

    Abstract

    When we think of children acquiring language, we often think of their acquisition of linguistic structure as separate from their acquisition of knowledge about the world. But it is clear that in the process of learning about language, children consult what they know about the world; and that in learning about the world, children use linguistic cues to discover how items are related to one another. This interaction between the acquisition of linguistic structure and the acquisition of category structure is especially clear in word learning.

Share this page