Using WaveNet technology to reunite speech-impaired users with their original voices
This post details a recent project we undertook with Google and ALS campaigner Tim Shaw, as part of Google’s Euphonia project. We demonstrate an early proof of concept of how text-to-speech technologies can synthesise a high-quality, natural sounding voice using minimal recorded speech data.
As a teenager, Tim Shaw put everything he had into football practice: his dream was to join the NFL. After playing for Penn State in college, his ambitions were finally realised: the Carolina Panthers drafted him at age 23, and he went on to play for the Chicago Bears and Tennessee Titans, where he broke records as a linebacker. After six years in the NFL, on the cusp of greatness, his performance began to falter. He couldn’t tackle like he once had; his arms slid off the pullup bar. At home, he dropped bags of groceries, and his legs began to buckle underneath him. In 2013 Tim was cut from the Titans but he resolved to make it onto another team. Tim practiced harder than ever, yet his performance continued to decline. Five months later, he finally discovered the reason: he was diagnosed with Amyotrophic lateral sclerosis (ALS, commonly known as Lou Gehrig’s disease). In ALS, the neurons that control a person’s voluntary muscles die, eventually leading to a total loss of control over one’s body. ALS has no known cause, and, as of today, has no cure.
Today, Tim is a powerful advocate for ALS research. Earlier this year, he published a letter to his younger self advising acceptance–“otherwise, you’ll grieve yourself to death.” Now a wheelchair user, he lives under the constant care of his parents. People with ALS have trouble moving, and the disease makes speaking, swallowing, and even breathing on their own difficult and then impossible. Not being able to communicate can be one of the hardest aspects for people with ALS and their families. As Tim put it: “it’s beyond frustrating not to be able to express what’s going on in my mind. I’m smarter than ever but I just can’t get it out.”
Losing one’s voice can be socially devastating. Today, the main option available to people to preserve their voice is message banking, wherein people with ALS can digitally record and store personally meaningful phrases using their natural inflection and intonation. Message banking is a source of great comfort for people with ALS and their families, helping to preserve a core part of their identity - their voice - through a deeply challenging time. But message banking lacks flexibility, resulting in a static dataset of phrases. Imagine being told you will never be able to speak again. Now imagine that you were given the chance to preserve your voice by recording as much of it as possible. How would you decide what to record? How would you capture what you most want to be able to say in the future? Would it be a meaningful story, a favorite phrase or a simple “I love you”? The process can be time consuming and emotionally draining, especially as someone’s voice degrades. And people who aren’t able to record phrases in time are left to choose a generic computer synthesized voice that lacks the same power of connection as their own.
Building more natural-sounding voice technologies
At DeepMind, we’ve been collaborating with Google and people like Tim Shaw to help develop technologies that can make it easier for people with speech difficulties to communicate. The challenges of this are two-fold. Firstly, we must have technology that can recognise the speech of people with non-standard pronunciation–something Google AI has been researching through Project Euphonia. Secondly, we’d ideally like people to be able to communicate using their original voice. Stephen Hawking, who also suffered from ALS, communicated with a famously unnatural sounding text-to-speech synthesiser. Thus, the second challenge is customising text-to-speech technology to the user’s natural speaking voice.
Creating natural sounding speech is considered a “grand challenge” in the field of AI. With WaveNet and Tacotron, we’ve seen tremendous breakthroughs in the quality of text-to-speech systems. However, whilst it is possible to create natural sounding voices that sound like specific people in certain contexts – as we demonstrated in collaboration with John Legend last year – developing synthetic voices requires many hours of studio recording time with a very specific script – a luxury that many people with ALS simply don’t have. Creating machine learning models that require less training data is an active area of research at DeepMind, and is crucial for use cases such as this where we need to recreate a voice with just a handful of audio recordings. We’ve helped do this by harnessing our WaveNet work and the novel approaches demonstrated in our paper, Sample Efficient Adaptive Text-to-Speech (TTS) - where we showed that it’s possible to create a high quality voice using small amounts of speech data.
Which brings us back to Tim. Tim and his family were instrumental in our recent research. Our goal was to provide Tim and his family an opportunity to hear his original speaking voice again. Thanks to Tim’s time in the media spotlight, resulting in about thirty minutes of high-quality audio recordings, we were able to apply the methodologies from WaveNet and TTS to recreate his former voice.
Following a six-month effort, Google’s AI team visited Tim and his family to show him the results of their work. The meeting was captured for the new YouTube Originals learning series, “The Age of A.I.” hosted by Robert Downey Jr. Tim and his family were able to hear his old voice for the first time in years, as the model – trained on Tim’s NFL audio recordings – read out the letter he’d recently written to his younger self.
“I don’t remember that voice,” Tim remarked. His father responded, “we do.” Later, Tim recounted–"it has been so long since I've sounded like that, I feel like a new person. I felt like a missing part was put back in place. It's amazing. I'm just thankful that there are people in this world that will push the envelope to help other people."
How the technology works
To understand how the technology works, it’s important to first understand WaveNet. WaveNet is a generative model trained on many hours of speech and text data from diverse speakers. It can then be fed arbitrary new text to be synthesized into a natural-sounding spoken sentence.
Last year, in our Sample Efficient Adaptive Text-to-Speech paper, we illustrated that it’s possible to train a new voice with minutes, rather than hours, of voice recordings through a process called fine-tuning. This involves first training a large WaveNet model on up to thousands of speakers, which takes a few days, until it can produce the basics of natural sounding speech. Then, we take the small corpus of data for the target speaker and intelligently adapt the model, adjusting the weights so that we can create a single model that matches the target speaker. The concept of fine-tuning is similar to how people learn. For example, if you are attempting to learn calculus, you should first understand the foundations of basic algebra, and then apply these simpler concepts to help solve more complex equations.
Taking the research a step further with WaveRNN and Tacotron
After this publication, we continued to iterate on our models. First, we migrated from WaveNet to WaveRNN, which is a more efficient text to speech model co-developed by Google AI and DeepMind. WaveNet requires a second distillation step to speed it up to serve requests in real-time, which makes fine-tuning more challenging. WaveRNN, on the other hand, does not require a second training step and can synthesize speech much faster than a WaveNet model that has not been distilled.
In addition to speeding up the models by switching to WaveRNN, we collaborated with Google AI to improve the quality of the models. Google AI researchers demonstrated that a similar fine-tuning approach could be applied to the related Google Tacotron model, which we use in conjunction with WaveRNN to synthesise realistic voices. By combining these technologies trained on audio clips of Tim Shaw from his NFL days, we were able to generate an authentic sounding voice that resembles how Tim sounded before his speech degraded. While the voice is not yet perfect – lacking the expressiveness, quirks, and controllability of a real voice – we’re excited that the combination of WaveRNN and Tacotron may help people like Tim preserve an important part of their identity, and we would like to one day integrate it into speech-generation devices.
We’re honored to have briefly reunited Tim with his voice. At this stage, it’s too early to know where our research will take us, but we are looking at ways to combine the Euphonia speech recognition systems with the speech synthesis technology so that people like Tim can more easily communicate. We hope that our research can eventually be shared more widely with those who need it most in order to communicate with their loved ones–there are thousands of people in the world who this work might one day benefit. As Tim wrote in his letter to his younger self–what matters, in the end, is “the relationships and the people you have in your life who love you and care about you.”