Human Speech A.I. Cloned From 5 Second Samples

US Not long till the robots take over - listen      12/02/20

Speech synthesis has been a thing for a while, we’ve probably heard it evolve from the Stephen Hawking  robo synth voice to more complex and convincing examples.

In a link sent in by John P Shea - there are examples of entire vocal  synthesis characters taken from just a 5 second human speech sample and applied to written text in native and other languages.

This is work done at Cornell University ( on Computation and Language by a team of smart folks.

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples

The core technology is Google’s Tacotron  end to end speech synthesis.

The intonation and nuance is remarkable and uses something called a Neural Vocoder.

There were several thousand spoken samples used to train the technology, but it is unclear as to how long it takes to generate the synthesized voices - eg it is close to real-time or require significant computational resources and time.

But the results are very impressive, and when you check the voices being used to speak in non-native languages called Cross Language Voice Cloning with varying degrees of accent control - its starts to get pretty mind blowing. I wonder if this technology can be applied to singing? Imagine the mischief that could be caused..

It's worth checking out the links below to explore the nuance and audio examples of this stuff.

But perhaps the best of it is when it goes wrong:


< More News: Like This
Even more news...


Want Our Newsletter?


Featured Video

Brand new super-synth from UDO

Featured Video

Virtual knobs and jacks controlling over 160 virtual modules

Featured Video

Now ready for preorder

Featured Video

The sampler pedal that listens and moves with you

Featured Video

The Mantis has arrived

Featured Video

Bi-timbral noise-based soft synth with a wide range of possibilities

Computer Music Chronicles: The Amiga as a Guitar Pedal 

Older Music Machines & the People Who Still Use Them

SUPERBOOTH 2024: PWM – Mantis Polysynth 

The Mantis has arrived

With a lot of utillity

How Influential Were The Yellow Magic Orchestra? 

Overview of boundary-pushing electronic group

3 Maverick Synth Makers who Did it for Themselves 

Innovation is the main focus for these builders.

6 Instruments Fatally Flawed at Release 

These synths took a little time to reach their potential

Hey there, we use Cookies to customize your experience on