WaveNet
Introduced in 2016, WaveNet was one of the first AI models to generate natural-sounding speech. Since then, it has inspired research, products, and applications in Google — and beyond.
The challenge
For decades, computer scientists tried reproducing nuances of the human voice to make computer-generated voices more natural.
Most text-to-speech systems relied on “concatenative synthesis” — a pain-staking process of cutting voice recordings into phonetic sounds and recombining them to form new words and sentences - or DSP (digital signal processing) algorithms known as "vocoders".
The resulting voices often sounded mechanical and contained artifacts such as glitches, buzzes and whistles. Making changes required entirely new recordings — an expensive and time-consuming process.
WaveNet took a different approach to audio generation by using a neural network to model predict individual audio samples. This approach allowed WaveNet to produce high-fidelity, synthetic audio, allowing people to interact more naturally with their digital products
Learning from human speech
WaveNet is a generative model trained on human speech samples. It creates waveforms of speech patterns by predicting which sounds are most likely to follow each other, each built one sample at a time, with up to 24,000 samples per second of sound.
The model incorporates natural-sounding elements, such as lip-smacking and breathing patterns. And includes vital layers of communication like intonation, accents, emotion — delivering a richness and depth to computer-generated voices.
For example, when we first introduced WaveNet, we created American English and Mandarin Chinese voices that narrowed the gap between human and computer-generated voices by 50%.
Rapid advances
Early versions of WaveNet were time consuming to interact with, taking hours to generate just one second of audio.
Using a technique called distillation — transferring knowledge from a larger to smaller model — we reengineered WaveNet to run 1,000 times faster than our research prototype, creating one second of speech in just 50 milliseconds.
In parallel, we also developed WaveRNN — a simpler, faster, and more computationally efficient model that could run on devices, like mobile phones, rather than in a data center.
The power of voice
Both WaveNet and WaveRNN became crucial components of many of Google’s best known services such as the Google Assistant, Maps Navigation, Voice Search and Cloud Text-To-Speech.
They also helped inspire entirely new product experiences. For example, an extension known as WaveNetEQ helped improve the quality of calls for Duo, Google’s video-calling app.
But perhaps one of its most profound impacts was helping people living with progressive neurological diseases like ALS (amyotrophic lateral sclerosis) regain their voice.
In 2014, former NFL linebacker Tim Shaw’s voice deteriorated due to his ALS. To help, Google’s Project Euphonia (developed a service to better understand Shaw’s impaired speech.
WaveRNN was combined with other speech technologies and a dataset of archive media interviews to create a natural-sounding version of Shaw’s voice, helping him speak again.
Widespread legacy
WaveNet demonstrated an entirely new approach to voice synthesis that helped people regain their voices, translate content across multiple languages, create custom audio content, and much more.
Its emergence also unlocked new research approaches and technologies for generating natural sounding voices.
Today, thanks to WaveNet, there is a new generation of voice synthesis products that continue its legacy and help billions of people around the world overcome barriers in communication, culture, and commerce.