Looking back on the development of speech recognition technology is like watching a child grow up, progressing from the baby-talk level of recognizing single syllables, to building a vocabulary of thousands of words, to answering questions with quick, witty replies, as Apple’s supersmart virtual assistant Siri does.
Listening to Siri, with its slightly snarkysense of humor, made us wonder how far speech recognition has come overthe years. Here’s a look at the developments in past decades that havemade it possible for people to control devices using only their voice.
1950s and 1960s: Baby Talk
The first speech recognition systems couldunderstand only digits. (Given the complexity of human language, itmakes sense that inventors and engineers first focused on numbers.)Bell Laboratories designed in 1952 the “Audrey” system, whichrecognized digits spoken by a single voice. Ten years later, IBMdemonstrated at the 1962 World’s Fair its “Shoebox“machine, which could understand 16 words spoken in English.
Labs in the United States, Japan, England, and the Soviet Uniondeveloped other hardware dedicated to recognizing spoken sounds,expanding speech recognition technology to support four vowels and nineconsonants.
They may not sound like much, but these first efforts were animpressive start, especially when you consider how primitive computersthemselves were at the time.
1970s: Speech Recognition Takes Off
Speech recognition technology made major strides in the 1970s, thanksto interest and funding from the U.S. Department of Defense. The DoD’sDARPA Speech Understanding Research (SUR) program, from 1971 to 1976,was one of the largest of its kind in the history of speechrecognition, and among other things it was responsible for CarnegieMellon’s “Harpy” speech-understanding system. Harpy could understand1011 words, approximately the vocabulary of an average three-year-old.
Harpy was significant because it introduced a more efficient searchapproach, called beam search, to “prove the finite-state network ofpossible sentences,” according to Readingsin Speech Recognition by Alex Waibel and Kai-Fu Lee. (Thestory of speech recognition is very much tied to advances in searchmethodology and technology, as Google’s entrance into speechrecognition on mobile devices proved just a few years ago.)
The ’70s also marked a few other important milestones in speechrecognition technology, including the founding of the first commercialspeech recognition company, Threshold Technology, as well as BellLaboratories’ introduction of a system that could interpret multiplepeople’s voices.
1980s: Speech Recognition Turns Toward Prediction Over the next decade, thanks to new approaches to understanding what people say, speech recognition vocabulary jumped from about a few hundred words to several thousand words, and had the potential to recognize an unlimited number of words. One major reason was a new statistical method known as the hidden Markov model.
Rather than simply using templates for words and looking for soundpatterns, HMM considered the probability of unknown sounds’ beingwords. This foundation would be in place for the next two decades (see AutomaticSpeech Recognition-A Brief History of the Technology Developmentby B.H. Juang and Lawrence R. Rabiner).
Equipped with this expanded vocabulary, speech recognition started towork its way into commercial applications for business and specializedindustry (for instance, medical use). It even entered the home, in theform of Worldsof Wonder’s Julie doll (1987), which children could train torespond to their voice. (“Finally, the doll that understands you.”)
See how well Juliecould speak.
However, whether speech recognition software at the time couldrecognize 1000 words, as the 1985 Kurzweil text-to-speech program did,or whether it could support a 5000-word vocabulary, as IBM’s systemdid, a significant hurdle remained: These programs took discretedictation, so you had … to … pause … after … each … and … every … word.
Next page: Speech recognition for the masses, and the future of speechrecognition
1990s: Automatic Speech Recognition Comes to the Masses
In the ’90s, computers with faster processors finally arrived, andspeech recognition software became viable for ordinary people.
In 1990, Dragon launched the first consumer speech recognition product,Dragon Dictate, for an incredible price of $9000. Seven years later,the much-improved Dragon NaturallySpeaking arrived. The applicationrecognized continuous speech, so you could speak, well, naturally, atabout 100 words per minute. However, you had to train the program for45 minutes, and it was still expensive at $695.
The advent of the first voice portal, VAL from BellSouth, was in 1996;VAL was a dial-in interactive voice recognition system that wassupposed to give you information based on what you said on the phone.VAL paved the way for all the inaccurate voice-activated menus thatwould plague callers for the next 15 years and beyond.
2000s: Speech Recognition Plateaus–Until Google Comes Along By 2001, computer speech recognition had topped out at 80 percent accuracy, and, near the end of the decade, the technology’s progress seemed to be stalled. Recognition systems did well when the language universe was limited–but they were still “guessing,” with the assistance of statistical models, among similar-sounding words, and the known language universe continued to grow as the Internet grew.
Did you know speech recognition and voice commands were built into Windows Vista and Mac OS X? Manycomputer users weren’t aware that those features existed. WindowsSpeech Recognition and OS X’s voice commands were interesting, but notas accurate or as easy to use as a plain old keyboard and mouse.
Speech recognition technology development began to edge back into theforefront with one major event: the arrival of the Google Voice Searchapp for the iPhone. The impact of Google’s app is significant for tworeasons. First, cell phones and other mobile devices are ideal vehiclesfor speech recognition, as the desire to replace their tiny on-screenkeyboards serves as an incentive to develop better, alternative inputmethods. Second, Google had the ability to offload the processing forits app to its cloud data centers, harnessing all that computing powerto perform the large-scale data analysis necessary to make matchesbetween the user’s words and the enormous number of human-speechexamples it gathered.
In short, the bottleneck with speech recognition has always been theavailability of data, and the ability to process it efficiently.Google’s app adds, to its analysis, the data from billions of searchqueries, to better predict what you’re probably saying.
In 2010, Google added “personalized recognition” to Voice Search on Android phones, so that thesoftware could record users’ voice searches and produce a more accuratespeech model. The company also added Voice Search to its Chrome browserin mid-2011. Remember how we started with 10 to 100 words, and thengraduated to a few thousand? Google’s English Voice Search system nowincorporates 230 billion words from actual user queries.
And now along comes Siri. Like Google’s Voice Search, Siri relies oncloud-based processing. It draws what it knows about you to generate acontextual reply, and it responds to your voice input with personality.(As my PCWorld colleague David Daw points out: “It’s not just fun butfunny. When you ask Siri the meaning of life, it tells you ’42’ or ‘Allevidence to date points to chocolate.’ If you tell it you want to hidea body, it helpfully volunteers nearby dumps and metal foundries.”)
Speech recognition has gone from utility to entertainment. The childseems all grown up.
The Future: Accurate, Ubiquitous Speech
The explosion of voice recognition apps indicates that speechrecognition’s time has come, and that you can expect plenty more appsin the future. These apps will not only let you control your PC byvoice or convert voice to text–they’ll also support multiplelanguages, offer assorted speaker voices for you to choose from, andintegrate into every part of your mobile devices (that is, they’llovercome Siri’s shortcomings).
The quality of speech recognition apps will improve, too. For instance,Sensory’s Trulyhandsfree Voice Control can hear and understand you,even in noisy environments.
As everyone starts becoming more comfortable speaking aloud to theirmobile gadgets, speech recognition technology will likely spill overinto other types of devices. It isn’t hard to imagine a near futurewhen we’ll be commanding our coffee makers, talking to our printers,and telling the lights to turn themselves off.