Computational speech recognition: a complex, resource-intensive task with great potential
With the introduction of Apple iPhone’s personal assistant, Siri, the technology of speech recognition went mainstream. The ease with which the calm electronic voice attempts to answer her owner’s every question in no way reflects the complexity of the technology and the scientific advancements in the field of speech recognition. What appears to be a simple interaction in a matter of seconds actually is a multi-stepped process of transmission, encoding, relay, statistical modeling and interpretation.
Scores of researchers have worked for years to develop and improve speech recognition models, including Yunxin Zhao, a computer science professor in the University of Missouri’s College of Engineering.
Potential applications are endless for robust speech recognition software. Any and all processes that can benefit from hands-free communication are up for grabs.
Zhao’s group’s current speech recognition research is focused on robust modeling of statistical distributions of phonetic sound data — known as acoustic modeling — as well as the integration of powerful word prediction models — known as language models — to improve accuracy performance. Both projects have been funded by the National Science Foundation (NSF). The latter is a recent collaborative effort with Ohio’s Wright State University faculty researcher Shaojun Wang. The computationally demanding research aims to construct large-scale distributed language models that will incorporate human speech variables such as vocabulary, sentence structure and meaning, and use these models for speech recognition and machine translation.
“Speech itself has a great deal of variability. There are many nuances depending on age, where the speaker is from and who they are talking to, as well as the acoustic environment in which they are speaking,” Zhao said. “Success of the acoustic modeling research relies on this data — the way someone talks and the words they use. It also is dependent on the models that are used to describe distributions of the data and training methods that are employed to learn the models from the data.”
“The syntactic nature of language can be used to predict words,” Zhao said of the language modeling task. “If you know the topic it’s easier to predict what words the speaker will use.
“It is a complex task requiring lots of resources,” said Zhao adding that the project’s success relies on large amounts of textual data, just as acoustic modeling relies on large amount of voice data.
The size of the model is a challenge because there are huge numbers of parameters. Zhao said incorporating the new language model into speech recognition becomes more difficult in algorithm and computation. Even the computing power of the Ohio Supercomputer Center is not big enough to handle it.
In related research into robustness of speech recognition, Zhao’s group is working to enhance speech from noise, separating mixed speech of multiple talkers. The result is a new approach to speech enhancement.
Referencing the scientific communities’ understanding of the two basic measurements for speech, phase and magnitude, Zhao said that for a long time, researchers thought that people only heard the magnitude but the phase has been ignored.
“It was believed phase could not be enhanced. But we have come up with a method to enhance it,” she said, adding that the information is also useful when separating mixed voices from each other when people are conversing in an environment like a social event.
Discovering fruitful applications of speech recognition technology is also an important aspect of research. Zhao was the principal investigator for a National Institutes of Health-funded telemedicine research project.
“We developed a system so that when a doctor and patient are teleconferencing, the doctor’s speech is turned into captions by our speech recognition system,” Zhao said. “This project still impacts our work and we plan to continue along those lines in the future.”
“At the bottom level, our work is machine learning, pattern recognition and signal processing,” said Zhao.
Besides creating improved techniques for speech recognition, her group’s research can be extended to solve problems in other application domains to benefit society.