Dapper robot about to play a grand piano

Microsoft AI taught to sing in three languages

Image credit: Dreamstime

Researchers from Microsoft Asia and Zhejiang University are developing a singing voice synthesis tool trained on music scraped from the internet. The tool can “sing” in Mandarin, Cantonese and English.

Singing voices have far more complex patterns of tone and rhythm than the ordinary spoken voice, meaning that synthesised singing tools must account for pitch and duration of individual units of sound (phonemes). This is a considerable challenge, given that there are few publicly available singing training datasets.

In order to train their tool, DeepSinger, the researchers, scraped tens of thousands of songs from various music websites.

The songs were heavily processed to put this information in a useful format for training. The songs were filtered for length (between one and five minutes long) and adequate vocal quality, cutting the dataset down to 92 hours of songs performed by 89 singers in three languages.

The songs were separated into vocal and instrumental tracks using the popular open-source tool Spleeter, then the vocal tracks were further broken down into individual sentences and phonemes. Crucially, the researchers designed a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in the lyrics, meaning that they did not have the extremely time-consuming job of manually labelling every phoneme in the 92 hours of songs.

The filtered, labelled data were used to train a feed-forward neural network based on a Microsoft text-to-speech tool (FastSpeech), which is also in development.

The result is a synthetic voice which can generate a synthesised singing voice with high pitch accuracy – above 85 per cent for all languages, compared with 95 per cent for human singers – and good “voice naturalness” across Mandarin, Cantonese and English, as judged by native speakers.

While the technology remains in development, it could have a range of applications in the music industry in the future. For instance, it could be employed in production of synthesised music, like Yamaha’s Vocaloid singing synthesis software, or used to create audio “deepfakes” to stand in for musicians and eliminate the need for pick-ups sessions after recording. Musical deepfakes have become a controversial subject, with performer and producer Jay-Z recently making copyright claims against a YouTube channel which featured AI-generated content mimicking his voice for satirical purposes.

Samples of audio generated by DeepSinger can be listened to here. The paper describing the process (which has not yet been peer reviewed and published) is available here [PDF].

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles