New tool automatically checks accuracy of speech transcriptions

11 March 2021
Speech researchers are often interested in intelligibility. This is typically measured by playing utterances to volunteers and asking them to type out what they hear. However, manually scoring those transcripts is a time-consuming task. Hans Rutger Bosker from the Max Planck Institute for Psycholinguistics has created a freely available tool for automatically checking speech transcripts. The tool is based on ‘fuzzy’ string matching, which measures how much a word’s spelling approximately matches the target word.

Do elderly people perceive speech differently from younger people? How well do people with cochlear implants perceive speech? How much do we understand from a target talker if there is a competing talker speaking in the background? Such questions are central to speech research and are usually tested using a similar type of experiment. Often, volunteers are presented with spoken utterances and then asked to type out what they hear. Researchers then measure the intelligibility of a given utterance by calculating the proportion of words correctly typed out from the target sentence. For instance, if the utterance is “The big blue house is for sale” and someone only types out “The house is for sale” (perhaps because the utterance was mixed with noise making it hard to hear), this would receive a score of 5 out of 7 = 71% correct. “Yet manually scoring these transcripts is a very time-consuming, resource-intensive, difficult, and boring task”, says Hans Rutger Bosker. Would a fast and cheap automatic tool based on ‘fuzzy’ string matching perform this complex task as well as humans?

Token Sort Ratio

Fuzzy string matching compares how much the typed out words resemble the target words. Bosker decided to use this technique with speech transcripts that had already been scored by humans, containing over 50,000 words. The fuzzy matching metric that most resembled human scores was the ‘Token Sort Ratio’ (TSR), as demonstrated by the best correlation of r = 0.94. Moreover, the TSR score also correlated strongly with acoustic markers of intelligibility. “Recordings which contained particular acoustic markers that we know make the speech more intelligible also showed high TSR scores”, explains Bosker. But while the human scorers had taken over 40 hours to score the entire dataset, the TSR was calculated in a matter of seconds.

 “The TSR score provides a practical, reliable, and valid tool, opening up new opportunities for large-scale speech research”, says Bosker. The TSR matches the spelling of words, which means that it can be used for any language. Bosker has made his tool publicly available for everyone to use. “The TSR makes a fine addition to the speech researcher’s toolbox, forming an important alternative for slow, expensive, and variable human-generated scores”, he concludes.

According to Bosker, the tool can in the future also be used for speech training and clinical applications. For instance, it provides an opportunity for real-time feedback on listener performance during the running of a hearing assessment.

“Instead of having to wait for an experimenter to score the performance, the computer can now score the typed out responses automatically. It may even adjust the test accordingly, for instance by making the next sentence harder to hear if performance on the previous sentence was high.”

Link to article

Share this page