DeepMind wowed the research community several years ago by defeating grandmasters in the ancient game of Go, and more recently saw its self-taught agents thrash pros in the video game StarCraft II. Now, the UK-based AI company has delivered another impressive innovation, this time in text-to-speech (TTS).
Text-to-speech (TTS) systems take natural language text as input and produce synthetic human-like speech as their output. The text-to-speech synthesis pipelines are complex, comprising multiple processing stages such as text normalisation, aligned linguistic featurisation, mel-spectrogram synthesis, raw audio waveform synthesis and so on.
Although contemporary TTS systems like those used in digital assistants like Siri boast high-fidelity speech synthesis and wide real-world deployment, even the best of them still have drawbacks. Each stage requires expensive “ground truth” annotations to supervise the outputs, and the systems cannot train directly from characters or phonemes as input to synthesize speech in the end-to-end manner increasingly favoured in other machine learning domains.