Engineers have been working on voice recognition for over 40 years, and it has always been a bit sci-fi – but has its time finally come?
Around 90 per cent of face-to-face communication is commonly believed to consist of body language – and only when you speak or email would you realise this. This explains why, as soon as the telephone was invented, the protocol of the 'telephone voice' was adopted. But a telephone conversation is still preferable to an email if you really want to convey the meaning of a message. There are subtle gradations in how we say things in business and in our personal lives.
Thus it has always been a challenge for computers to recognise the human voice and the meaning in a sentence. The pioneering Dragon Speech Recognition software was invented in 1975 and first appeared on PCs for Microsoft DOS a few years later. The technology has advanced steadily in the intervening period and the software has changed ownership several times, most recently acquired by Nuance.
"Voice recognition technology is an aggressive environment," explains John West, solutions architect for Nuance at the company's Cambridge R&D centre. West explains that the technology is being widely deployed in new generations of smart TVs, smart phones and cars.
With the vast increase in computing power, and the relatively recent growth of mobile communication, there has been renewed interest in voice and speech recognition technology.
Technology giant Google sees voice recognition as an important component of its trademark search-engine technology. These two companies have grown to dominate the development of the technology, acquiring as they have other companies researching and developing new voice-recognition technologies.
Whereas Google's voice-search products are designed almost wholly to work with Google's own software; Dragon Software is a third-party voice recognition engine which it sells to a number of different companies. In fact, it is hard to escape their technology.
Voice recognition everywhere
Voice recognition is an important component of both Android and iOS operating systems. Apple purchased an intelligence agent called Siri, which uses Nuance's voice-recognition technology in tandem with Wolfram Alpha and other syntactical algorithms.
"We started working with Siri when it was an independent company. Siri is an intelligence agent and therefore voice recognition forms only part of the proposition," explains West. "The company uses a range of proprietary and open-source computation knowledge processors.
Google Voice Search is currently integrated into the Google Chrome browser, the Google Chrome OS and, strategically, is an important part of Google's Android OS. Since October 2012, Google Voice Search has also been available on Apple mobile devices and can deliver results faster as the voice processing is on the client side and then sent to Google's servers.
The technology is also built into most in-vehicle voice and communication systems. Carmaker Ford uses Nuance voice-recognition technology in its entertainment and communications 'Sync' system. It is also integrated into GM's MyLink system. Other manufacturers have also created in-vehicle voice and communication systems whose voice-recognition component is predominantly supplied by Nuance.
Yet Ford has been stung by criticism of its user interface. Castigated by third-party quality reports because of the difficulty of using its touch-screen multimedia system, the manufacturer has, Raj Nair, Ford's global product-development chief told the Wall Street Journal, reintroduced tuning and volume knobs for the radio as it redesigns existing models.
Ford was first with the integration of voice recognition and touch screens in a meaningful way in its vehicles. But it has faced continued complaints about voice-activated controls.
Contact centres have been using voice recognition systems for over a decade. However, until recently their use has been limited. But with customer satisfaction being high on the agenda for many companies, improving the ability for their customers to reach the right person as quickly as possible on the phone is a priority.
A survey carried out by Call Centre magazine suggested that customers are increasingly giving the technology a thumbs down. According to the survey, only 18 per cent of contact centres plan to implement speech recognition systems. The big problem seems to be the overall accuracy – particularly with regional accents.
But some companies such as Aviva, according to West, have managed to collapse the options and the hierarchies for customers wanting to reach the right person at the right time which means that customers are able to focus their enquiry more quickly.
"With Aviva, you may ring up and say 'I want to pay my credit card bill'. We map that to an action. There could be as many as 5,000 possible actions before we present that to a human," explains West. "But we've tried to go further. We've tried to automate haggling applications."
Haggling would require a great deal of interpreting the meaning of what the customer is saying. Nuance is one of a number of companies also looking into improving knowledge engines to improve accuracy. But that often means picking up on human errors such as consumers mistaking direct debits for standing orders.
Smart TVs are becoming increasingly complicated to operate. Despite this, we still use the same type of remote control technology that was invented over 40 years ago. The industry has been looking for other types of human interface. There has been some dalliance with gesture control, but it appears that Samsung, LG and Panasonic are opting for voice control as the standard interface. For example, through enhanced voice control capabilities, the Samsung F8000 Smart TV is now able to understand more than 300 commands with much better language recognition rates.
Television is another aggressive environment, points out West, who suggests that because their technology is being integrated in leading TV players such as Samsung, LG and Panasonic, voice recognition is about to become commonplace.
Coping with accents
Understanding the voice of an individual as he or she is speaking a complex series of words which often run into each other is a challenge at the best of times. However, it seems inevitable that despite (or indeed because of) globalisation, the fragmenting and dialecticisation of the world's dominant languages is becoming ever more – ahem – pronounced.
For instance, the typical London accent from 20 years ago is vastly different from that of a 21st-century Londoner growing up today. Linguistic influences are far more global and multicultural these days. This is just one example and every region and city in the English-speaking world has a similar story to tell, innit.
Other languages also have their regional accents and dialects. Whether you're French, Japanese, Chinese or Russian, how these mother tongues are spoken within the vast regions will differ significantly.
Furthermore, imagine if you speak English with the Ukrainian accent or Hindi with an Australian accent or Swedish with an American accent, you can imagine how complicated an algorithm you could work up to sort out all these variations of language.
With increasing processing power, it is possible to create far superior algorithms to cope with such a complicated world. But it is not necessary for all that information to be available in one massive software package sitting on an individual's hard-drive.
It makes great sense to have all this information available in the cloud. Thus when you use Siri or Google Voice Search, the voice recognition engine doesn't sit on your device hogging valuable processor and storage resources, but is processed over a cloud-based infrastructure to a degree.
Indeed, Nuance is starting to roll out a way for its customers to upload their own personal voice profiles into the cloud and to access the profile on numerous Dragon software programs working on different platforms.
But there are times when you do require voice recognition to be hard-wired into the device. Wolfson microelectronics for example has created a relatively simple voice command-and-control capability on its audio hub chipsets. With such a small engine, it does not require a large processor or significant storage resources, as its use would be primarily to turn on a mobile device and subsequently a cloud-based solution, such as Siri, would take over.
Voice recognition is increasingly being incorporated into a number of mobile devices. To use Siri on an iOS device, or Google Voice on an Android device, you have to still physically turn on the voice-recognition engine. It's not possible to constantly have voice recognition running in the background because it will train the battery.
Wolfson has developed its own voice-recognition engine on its chipset which can be configured to listen for just a few commands – such as 'turn on sat-nav' – which would then turn on Siri or Google Voice. It would be particularly useful as a two-factor security measure as you could train it to recognise the voice of one or more individuals. Wolfson technology is in Samsung and a host of other devices.
The operating environment is an important factor not to be overlooked. Plantronics, the professional headset manufacturer has been looking at this issue for a number of years.
According to Steve Marks, principal engineer at Plantronics, the physical design of a headset is the first step. However, this is only relevant where the user is expected to use a headset.
"The growth of MEMS [Microelectro-mechanical systems] based microphones which are far smaller than electret condenser-based microphones will allow far more mics being deployed in an environment allowing for more sophisticated noise cancellation," said Marks.
Being involved in a professional headset market since the get-go and being a close partner with Nuance who package Plantronics headsets with its Dragon Dictate and Dragon Naturally Speaking products has enabled the company to produce headsets specifically designed for voice-recognition applications.
For example, VoxEnable is a Smart App from Expressware that brings contextual intelligence to Nuance's Dragon NaturallySpeaking speech recognition software accelerometer technology in Plantronics' intelligent Voyager PRO UC SmartSet to deliver better voice recognition.
Simply by putting the SmartSet on, speech recognition is automatically activated. When the SmartSet is taken off, speech recognition is set to sleeping.
Voice Recognition technology's usefulness can'only increase as computing power improves. The main challenge will be to ensure that it is available to all despite dialect or language.