A close up of a human eye

Advances in eye tracking and speech synthesis

Eye tracking and speech synthesis are now able to give voice to those with even severely limited movement.

In 'Diamonds are Forever' (1971) James Bond's nemesis Blofeld uses an electronic gadget to synthesise casino owner Willard Whyte's voice to fool our favourite spy. Fast-forward to today and we have technologies that can not only synthesise any voice or accent but that can generate speech by someone glancing at text on a screen. For Tony Nicklinson, suffering from locked-in syndrome after a brain stem stroke in 2005, communication software The Grid'2 linked to an eye-tracker enabled him to argue eloquently for doctors to end his life, until his natural death in August 2012 shortly after losing his High Court appeal.

Nicklinson used a southern English voice by Acapela called 'Graham' that comes with The Grid 2. 'Graham' is a modern text-to-speech engine built using strung-together snippets of real recorded speech that capture changes in intonation and frequency spectrum that make each human voice unique and expressive.

Edinburgh-based CereProc is another company whose high quality British regional voices (including Scottish, Irish, and Black Country) show how far speech synthesis has progressed since we first heard the flat robotic tones of physicist Professor Stephen Hawking, whose voice harks back to an earlier technology based on mathematical models of the human vocal tract.

Eye'm in control

However, authentic synthetic speech is merely the headline technology in breaking down the human isolation of severe disability. The decreasing cost of eye-tracking interfaces over recent years has arguably been more important. Stephen Murray, a professional BMX rider before he was paralysed in 2007, still has his voice but describes his eye-control system from Swedish firm Tobii Technology as like an antidepressant that has put him back in control of his life.

Eye-tracking systems use tiny infrared-sensitive video cameras positioned below a screen to enable users of assistive communication software like to generate speech, control their environment (lights, call bells, television etc), use email, the Web, and social media and even work full-time using only eye movements.

A technique called Pupil Centre Corneal Reflection (PCCR) captures the instances when the eye pauses on a specific area of the screen, and tracks the rapid movements of the eyes between pauses. PCCR works by filming (at 30 to 60 frames per second) the reflections on the cornea (the transparent front of the eye) and in the pupil from an infrared LED light source. Image processing algorithms estimate the position of the eye and the point of gaze by analysing the vectors between the pupil centre and corneal reflections. Bright pupil eye-tracking, where the infrared LED is placed close to the optical axis of the camera (causing the pupil to appear lit up) is the most widely used form of lighting.

Some systems are so accurate that babies can use them. Among the youngest users of the Eyegaze Edge made by LC Technologies (an American company that built its first systems in 1986) is a 13-month-old baby girl with spinal muscular atrophy. "She is smart, understands cause and effect, and is able to run picture-based programs in The Grid 2," says Nancy Cleveland, the company's medical director.

A tiny red dot serves as a screen cursor (in effect the x-y coordinate of the gaze-point) to show where the eye is pointing to on the onscreen pictures of keys and symbols. These graphics make an audible click and flash when they activate, based on a gaze-time that can be set for each specific user. The shortest activation gaze-time is around one-fifth of a second.

Eyegaze Edge systems can work with only one eye, giving freedom of head position for the user, which means babies and adults who have to lie on one side with their head at an angle can use them. LC Technologies' youngest user before was an 18 month-old who couldn't move or speak and was on a ventilator. "He figured out the system in no time at all and is now three years old and uses the system every day," says Cleveland. "A lot of what he understood at first was cause and effect. For example, looking at a cell with a picture of a lion would play a video of a lion. Now there is serious effort being made to teach him to use symbols to communicate."

Algorithms that interpret eye movements are the main patented IP behind these systems, most of which run on PC hardware with relatively low-cost software programs like The Grid 2. But eye-tracking systems still cost several thousand pounds because of the high cost of the camera hardware.

Mass market eye-gaze interaction

Tobii Technology started life in 2001 with eye-tracking systems for studying human behaviour and human-computer interaction. Tobii is now a leading seller of eye-controlled all-in-one computers for people with disabilities. Its recent projects includea concept eye-controlled laptop built by Lenovo; field-tests of an eye-tracking system for driver drowsiness-detection in cars; an Asteroids arcade game that works with both eye and head movements; a prototype eye-controlled television made by Hai; and most recently a concept tablet with embedded eye tracking by NTT Docomo points the way forward.

In March 2012 the company was the recipient of $21m from Intel Capital towards bringing its technology to the mass market. "Computer peripheral eye-trackers used in assistive communication cost around '4,000. To bring that price further down you need consumer volumes," says Sara Hyl'en Tobii's marketing director.

As part of its strategy to bring gaze interaction into the mainstream Tobii now has a 3W single-board eye-tracking camera component that can be integrated into any product. It includes system-independent processing and measures 200 x 25 x 15mm.

Back to the voice of the future

Off-the-peg text-to-speech engines are generally bundled into the communications software and so are not costly. But the future takes us back to Willard Whyte. Today's version of 'voice transformation' means capturing a small speech sample to quickly produce a custom voice. "The goal over the next three years is to be able to produce any voice in this way," says Acapela's chief technology office Fabrice Malfrére.

Voice transformation uses Hidden Markov Models that 'learn' from a small database of information relating to linguistics and prosody (the music of speech) rather like the databases in today's unit selection-based voices. From this material it generates parameters to create speech from a mathematical speech model (vocoder).

Eventually anyone may be able to connect to a website, record 100 sentences or so and automatically get a synthetic version of their voice. R&D systems already exist but for the moment they require more recordings to be able to produce commercially usable speech synthesis. Malfrére sees this technique as being a way to quickly and cheaply add unique voices to all kinds of products as part of the brand-identity whether it is car GPS systems or voices that read the newspaper on your smartphone.

"Improving the quality of long pieces of text is the next challenge," says Malfrére. Building a text-to-speech synthesiser that could read a book (or this article) in a natural way is a task related to the computer understanding of meaning which means using elements of language-context analysis, text pattern recognition, sentiment- and humour-analysis.

Cloud computing would be one way to handle the complex processing, says Malfrére, allowing owners of smartphones, tablets and e-books to access reading services on-demand from a smart server.

Perhaps the population of ageing and increasingly infirm baby boomers who enjoyed James Bond gadgetry first time round will be equally appreciative of the modern successors.

Recent articles

Info Message

Our sites use cookies to support some functionality, and to collect anonymous user data.

Learn more about IET cookies and how to control them