Speech processing research increases intelligibility
Signal-processing and speech sciences are being jointly developed to improve the intelligibility of speech.
"You what…?" are the last words the police want to hear over their radios when in hot pursuit with sirens blaring. The same goes in court if a jury can't understand the recording of a critical 999 call made outside a noisy nightclub. In both cases, it would be tempting to reach for help from a speech-enhancement algorithm to separate the message from the medium.
And yet research by the Centre for Law Enforcement Audio Research (CLEAR) has shown that most speech-enhancement techniques improve sound quality at the expense of intelligibility, particularly when the signal-to-noise ratio (SNR) is very low. Closely related as they are, speech quality and intelligibility are not identical.
"You can find speech signals that an average human listener would judge as good quality but they wouldn't be able to understand all the words," explains Patrick Naylor, reader in speech and audio signal processing at Imperial College London and a member of CLEAR. "Equally, it is sometimes possible to pick out all the words in a poor-quality signal that is full of background noise."
A great deal of the work on the intelligibility of speech has been done to improve the effectiveness of hearing aids and telecoms channels, where the SNR is typically better than 10 to 20dB. But in law-enforcement situations, the audio quality can be far worse, with SNRs of 0dB or lower quite common. And intelligibility is far more important than sound quality. If you cannot make out all the words in a recorded phone call or police interview tape, it loses its value as evidence. Similarly, those using communications equipment links in noisy environments, such as police drivers, put themselves and others at risk if they misunderstand what they are hearing or react slowly because they are trying to pick the signal out of the noise.
The CLEAR project was set up in 2007 to address these problems, with Home Office funding until 2012 for the team of four scientists from Imperial College's communications and signal processing group, and two from UCL's speech, hearing and phonetic sciences group. By pooling their understandings of both human perception and signal processing, the team hopes to pin down the factors that affect intelligibility and then to develop speech-enhancement algorithms designed to suit the human auditory system.
For the last couple of years, Mike Brookes, director of CLEAR and a reader in signal processing at Imperial, has been working with colleagues to systematically measure the effect of various speech-enhancement algorithms on noisy and distorted speech. They have been doing this using formal subjective tests, in which people repeat aloud sentences that they hear buried in a noisy signal played through headphones. The number of correct keywords in their responses is taken as a measure of intelligibility. Each signal-processing algorithm is characterised in terms of whether it improves the number of words the listeners get right. The team has also been measuring how noise affects task-based performance. The idea is to find out whether processed speech is less tiring to listen to than the noisy version, by measuring the time taken to complete a standard task.
To ensure the tests are statistically meaningful, sound samples have to be designed carefully, to ensure listeners can't get them right just by guessing from the context. So "The king wore a golden crown" isn't a good test sentence, whereas "John and Paul were talking about the poodle" is. Learning effects must also be taken into account because listeners may get more words right simply because they've heard the sentence before.
Shaping the signal spectrum
Results so far have been very illuminating. Most digital speech-enhancement techniques work by either removing background noise or shaping the noise spectrum of the signal. One of the most popular noise-reduction methods involves estimating the power spectrum of the noise, at times and frequencies where speech is absent, and removing it from the speech signal. But in very noisy situations, parts of the power spectrum most vital to the intelligibility of speech are also subtracted.
"Most of the energy in speech is in the vowels, whereas consonants are more likely to resemble noise," explains Gaston Hilkhusyen, one of the two speech experts from UCL in the CLEAR team. As a result, 'B's and 'D's can sound very similar, for instance, making it easy to confuse 'dead' with 'bed' or 'Deb' - with potentially unfortunate results.
It also turns out that in task-based processing, the brain has to work just as hard to process noise-reduced speech as speech in noise. Another member of the CLEAR team, UCL's Mark Huckvale, has recently published the results of experiments to measure how quickly people can recognise numbers in noise. It's still not clear why noise reduction doesn't help, and the issue requires further investigation, but Huckvale thinks that perhaps noise reduction doesn't restore the quiet sounds at the start of words that provide useful cues, or perhaps that it adds processing artefacts which are as distracting as the original noise.
Using this growing under-standing of how noise and distortion degrade intelligibility in speech, Brookes and colleagues are developing new kinds of speech-cleaning algorithms designed to work with signals degraded by large amounts of background interference.
One of the important features for intelligibility is retaining the steepness of the hills and valleys of the time-domain envelope of speech, which represent attack and decay. The effect of adding noise is to distort the envelope by filling in the valleys. When you then apply noise suppression, the envelope is distorted in an unpredictable way, says Brookes.
"You might get an apparently improved envelope - so it sounds better - but you can't necessarily make out the words," he says.
One promising technique the team has been working on is to suppress noise using binary frequency masking in the time-frequency domain. The noisy signal can be represented graphically with a spectrogram (see panel, right), in which the energy in each frequency range (coloured from blue to red) is plotted as a function of time to show power envelopes for each frequency band.
Splitting signals into multiple frequency bands in this way shows up which parts of a noisy signal are speech, because speech harmonics (vibrations at frequencies that are multiples of the fundamental) that vary with the pitch of the voice appear approximately as evenly spaced horizontal lines. Superimposed on these stripy sections are high-energy red areas, representing the resonances of the speaker's vocal tract. These patterns are rarely seen from noise sources.
By digitally masking all the parts of the spectrogram where the noise is louder than the speech, and then resynthesising speech from the remainder, the result sounds clear.
"You are selecting time-frequency elements to be 'on' or 'off', according to whether they are speech or not," explains Naylor.
An ideal algorithm would be able to identify reliably the characteristics of speech in any signal, and then mask out anything that didn't follow the pattern - but we are not there yet. Nevertheless the CLEAR team is already obtaining worthwhile results for some types of noise, such as mains hum.
The time-frequency binary masking method also has the advantage that, although it is suitable for cleaning recorded speech, it will be usable in quasi-real time, with a delay of less than a second.
The group has also developed a way of automatically modifying the power spectrum of a speech signal to maximise its intelligibility, rather as one would use a graphic equaliser on a hi-fi to pick out one instrument in an orchestra. Capturing a signal in a real acoustic environment with a microphone and storing or transmitting it alters the signal in ways that can decrease its quality and/or intelligibility. If you can find a good way of estimating the distortion added by this process, you can use this knowledge to readjust the power spectrum to clean up the speech. Fortunately, there is an existing tool that may fit the job.
The long-term average speech spectrum (LTASS) is a plot of speech loudness as a function of frequency, and is the traditional starting point for studying speech perception under adverse conditions and the basis for prescriptive fittings of hearing aids. Brookes and colleagues' new estimation technique is using LTASS as one building block in its design of a suitable equaliser.
"The equalising technique requires us to estimate the frequency response of the channel, which arises from a combination of the room acoustics, microphone properties and, if relevant, tape recorder characteristics. It is not normally possible to measure the frequency response of the channel directly, so the LTASS approach allows us to estimate it from the available signal," Brookes explains.
CLEAR's grant runs to 2012 but, says Brookes, the idea is for the group to become a self-funding centre of excellence that can handle everything from evaluating audio equipment such as interview recorders and communications devices all the way to licensing specially developed speech-cleaning algorithms.
While the group is not directly concerned with legal traceability, anything that is going to be used in court will have to have a well-documented path showing what's been done to the sound, in much the same way as one needs to detail how a digital image has been enhanced for evidentiary purposes, to ensure that artefacts have not been introduced.
It's interesting to speculate whether improvements in speech intelligibility and cleaning might influence the current discussions in the UK on admitting intercepted communications in court. Recordings from free-standing listening devices are currently admissible, as are recordings where one of the speakers is an undercover police officer. The Chilcot report of January 2008 concluded that, in certain circumstances, intercepts of land lines and mobile phones could also be allowed.
Implementing these findings currently depends on a number of rather detailed safeguards being met, which include the potential cost of transcribing phone conversations. If more sensitive evidence of this sort is to enter court, the expertise being developed by CLEAR could well become critical.