Voice recognition software has a long history but was also significantly flawed until recently. Now, mobile phones and improved software are changing the world of speech recognition.
When we were kids, my friends and I used to play a game where we fantasized about which technologies from Star Trek were most likely to be real-world inventions within our lifetimes. The transporter and warp drive — not likely. But the communicator, the voice-commanded computer and the universal translator — very likely.
When speech recognition arrived on the computer desktop, it seemed like a great idea — but for most people, it wasn’t a replacement for the keyboard and mouse. Now speech recognition technology is being put to use in a whole new environment: phones. And its presence there is further driving its use and development in directions it might never have headed on the desktop.
Speech recognition first appeared as a primitive technology in the 1950s, as little more than a curiosity. In the early 1960s, IBM’s Shoebox device could recognize 16 spoken words and could respond to simple mathematical requests, such as “three plus four total.”
DragonDictate by Dragon Systems was probably the first speech-recognition program for the PC, released in the early 1980s for DOS computers. It could recognize only individual words, spoken one at a time. It evolved over time into the product Dragon NaturallySpeaking (now in Version 11 and owned by Nuance Communications), which can transcribe text spoken in a normal conversational voice and speed.
Speech recognition on the desktop had two big limitations. First, in order for the program to work with a high degree of accuracy, it had to be trained to recognize the speech patterns of the user. Windows Vista’s and Windows 7′s native speech-to-text technology, and third-party products like Dragon NaturallySpeaking, still require a user-training period to be useful.
The second limitation was the prevalence of the keyboard. Most people were already in the habit of typing, not talking, and so speech control faced the same uphill barriers to adoption as the Dvorak keyboard layout. Why learn to use Dvorak when plain old QWERTY was readily available and worked fine?
Abhi Rele, senior product manager of Microsoft’s TellMe team, a group responsible for developing speech recognition technologies for multiple environments, concurs on this point: “In the desktop environment, users have easy access to other interaction modalities — namely, keyboard and mouse — and therefore the use of speech is primarily targeted towards speech enthusiasts.”
What speech-controlled computing needed for broader adoption was two things — better out-of-the-box usage and a venue where speech was already king, so to speak. One such venue has been on the rise for a long time: mobile phones.
Matt Revis, vice president of product management and marketing at Nuance, explains the differences between the desktop and mobile environments like this: “The desktop is a stationary environment focused entirely on desktop use cases, and so speech for the desktop follows that task flow: supporting office apps, Web browsing, communications, etc. In mobile, speech is more directed to supporting a variety of lifestyle scenarios: professionals on the go, out-and-about fun, hands-free [calling] and so on.”
Gartner analyst Tuong Nguyen agrees that voice makes more sense in a mobile context. “From a usage perspective,” he says, “the value of voice recognition on a handheld device is much greater. It adds a user-friendly, intuitive method of input.”
This is certainly true, Nguyen adds, if the alternative to speaking a simple declarative statement is to dig down through a slew of menus or struggle with tiny on-screen keyboards: “With the growing adoption of touch-only devices (no physical keys), voice recognition is used to enhanced data entry/input. It also supports hands-free requirements or legislation.”
Making it work
Speech recognition works by making statistical models of spoken language. “To recognize spoken words,” says Google product manager Amir Mane, “we compare the input speech to a statistical model of the language and try to find the closest match — the system’s best guess at what the user said.”
Statistical models of a language require a great deal of storage to be practical. “[They] must cover all of the fundamental sounds of the language (phonemes), all of the words, and all of the different ways that the words can be strung together in the spoken language,” Mane says. On top of that, there are accents, variations in sex and age, regional pronunciations, word choices (“soda” vs. “cola” vs. “pop”) and so on.
Mane notes that Google Voice Search’s statistical model requires three elements: acoustic models, language models and a lexicon. “An acoustic model is created by taking audio recordings of speech and the transcriptions of what was said, and using the two to create a representation of the phones — the basic components of all words in a given language,” he says.
The language model involves figuring out what words are likely to follow other words, and using that as a way to improve recognition accuracy. “The word ‘empire’ will be followed by the words ‘state’ or ‘strike’ [as in The Empire Strikes Back] more often than it is followed by the words ‘diverse’ or ‘guava,’ ” Mane explains. Collecting data from the field helps continuously improve the language model and the lexicon.
Google isn’t the only company crowdsourcing its recognition data. Speech-recognition app Vlingo puts cookies on users’ phones to continuously build speech models based on users’ own feedback, combined with models based on similar speakers.
On mobile devices
Because mobile devices have typically sported only a fraction of the storage and processing power of a desktop computer, speech processing has taken a while to appear on phones in anything more than a rudimentary form.
The Springer Handbook of Speech Processing describes how phones in the early 2000s, despite their constraints, could be programmed to recognize voices for dialing digit-by-digit, and to some extent to recognize names. The main issue was memory, so most of these phones could recognize only up to 10 or so names at a time. But another problem cited by the authors was relatively little usage of the feature, possibly due to poor marketing on the part of handset makers.
As memory and processing power increased, so did the recognition ability of the average phone. The Samsung SCH-p-207, released in 2005 for $99, added speech-to-text dictation as well as voice-activated dialing. The current generation of smartphones, with memory that runs into the hundreds of megabytes as well as gigabytes of Flash-based storage, are much less constrained.
Another key advance has been the speed of the network. The rising tide of faster wireless networks has raised a great many boats, including the most recent generation of speech-processing technologies, by making it possible to offload the work onto a remote server.
Amir Mane, product manager for Google Voice Search, explains how this has helped Google’s voice apps. “Since all the heavy lifting in terms of processing is done in the network [by Google's servers],” he says, “we were less susceptible to the limitations on the computing power of the handheld device.”
The current state of the art for voice recognition in phones lends itself to a lot more than just voice dialing.
Voice-activated functions actually do include voice dialing, one of the first such features to appear on phones. Even many basic, low-end cell phones have this today; my Nokia flip phone, vintage 2007 or so, had it — although its recognition was a little dodgy for some of the more unusual names in its phone book.
Gartner’s Nguyen notes that the newer breed of voice functions are more open-ended. “Instead of programming specific voice commands to functions,” he explains, “the application recognizes the speech and executes the appropriate action. Higher-end, more robust devices have made using these applications more viable.” In other words, instead of only being able to use the phrase “call 888-555-1212″ to bring up a phone number, users can say “dial Mom” or “phone my mother” instead.
This makes voice-driven apps such as Google Voice Search more practical. For example, if you say “Tron Legacy movie times,” you’re taken to a page that lists screenings by ZIP code or location — the app not only recognizes the context of the phrase, but can pull information from both your phone (your current location) and the Web (screening times).
The app also has enough savvy about English to automatically make certain distinctions without training. If I say “Mötley Crüe band,” the program gets it right — it even uses the band’s idiosyncratic spelling in the search term itself, although it leaves out the umlauts. Search on “Motley’s Crew,” and you get the comic strip.
That said, the limits of Google’s voice recognition become evident the further you stray from mainstream English. Foreign names are just about hopeless. Another consistent problem for speech recognition apps is the presence of ambient noise, which affects mobile users more often than desktop users. Nuance’s Revis cites “high-recognition accuracy in noisy outdoor environments” as an ongoing issue.
Dictation has come a long way since that 2005 Samsung phone. The iPhone’s Dragon Dictation app, which is powered by Dragon NaturallySpeaking, allows the user to dictate everything from memos and e-mails to Twitter updates. Dragon for Email offers similar capabilities for the BlackBerry.
For Android phones, Nuance offers FlexT9, which combines Dragon Dictation features with three types of touch-based input. There’s also the Handcent SMS app, which integrates with Android’s native speech-recognition technology to help you send text messages by voice.
Translation has been available text-to-text for years now (for example, via the well-known Babel Fish Web site). Translate-as-you-speak isn’t quite here, but it’s come a lot closer. For example, Jibbigo for the iPhone translates words, phrases and reasonably simple sentences, allowing two parties to speak alternately.
Ask almost anyone involved in engineering speech technologies what the next big step is, and they will typically give you one answer: natural-language processing.
Revis describes this as “systems that understand what you mean, not just what you say — ‘conversational’ interaction models where users speak what they want without any constraints on how they say it.” He gives as examples commands or requests for information such as “Where can I find a Nikon camera for under $100?” or “Text Jenny that I’m going to be 20 minutes late” or “Make reservations for three people at Morton’s tonight.”
“Offering natural-language processing in a spoken dialog is a double challenge,” Google’s Mane says. “First you have to recognize the words, then you have to extract the meaning.” The first part is becoming easier, but the second is still deeply elusive: Meaning is contextual and slippery, and not always successfully parsed by humans, either.
Microsoft’s Rele thinks that the additional services provided by a phone (such as a compass or GPS) can augment the usefulness of natural-language processing. So you could, he says, “plan dinner and a movie for two people by breaking down the task to use data from various sources, such as calendars, restaurant ratings, movie reviews and location.”
In addition, the phone’s services can be used to provide context for speech. “Spoken input from the user, along with intelligence gained from other modalities and sensors about users and their surroundings, can provide richer and more relevant results,” Rele says. If you’ve just used Foursquare to check in at a restaurant, for instance, the bias for ambiguous voice commands could be tilted toward things like dining out, booking reservations, getting a cab and so on.
The multiplatform app Vlingo, which bills itself as a “virtual assistant,” already offers some functionality along these lines. It plugs into services like OpenTable and Fandango to accomplish much of what it offers: making restaurant reservations, booking movie tickets, and so on.
Another future area that Nguyen sees voice recognition improving is games. “[Speech] can be used in gaming to add a different dimension to game play,” he says. So, for example, you could deliver orders Capt. Kirk-style to starships, or interrogate suspects in a mystery.
Is it you?
Another feature that is already being implemented is automatically tailoring recognition to the individual user. This is a hands-off version of the voice training required by desktop voice recognition.
For example, the latest iteration of Google Voice Search has an opt-in function that allows a custom speech profile to be built over time for the user. “When a user ‘opts in’ to use personalized recognition,” Mane explains, “we keep a link between them and their utterances, which allows us to build the first rudimentary, personalized recognition models.”
Personalized recognition isn’t meant to be a silver bullet, though — merely a transitional step toward making voice recognition more seamless. “We view personalized recognition not as a single solution, but as a series of innovations that are going to come,” says Mane, who also believes that future improvements of this type “may require more active involvement of our users.”
Cell phones have been a remarkable incubator and driver for many technologies, both hardware- and software-based. So far, adding speech to the mix has resulted in only incremental improvements — such as Google Voice apps’ fine out-of-the-box performance.
But those improvements are gradually paving the way to more important advances, and mobile technology provides a whole new arena for how those new technologies can be aggregated. The next step may not be a phone that understands everything you say, but one that understands enough to be a lot more useful.