With the popularity of so many streaming platforms, content is becoming pretty diverse and different. More and more people are watching foreign-language shows like “Money Heist” and “Dark” as they are good shows and are available worldwide. However, some of us like to watch our shows in a language which we understand. Sometimes subtitles are not enough. Having said that, dubbing foreign shows to another language can become time-consuming and expensive for production companies. This is the primary reason why many shows do not get dubbed in foreign-languages. Well, Amazon researchers may have a solution to this problem.
In a paper published on the pre-print server Arxiv.org, Amazon researchers theorised and tested a new “speech-to-speech” tech. It uses AI to convert original speech to a translated speech and refine the translated speech to make it sound more human-like. This is just a first step towards developing an easier and much cheaper way of dubbing shows and movies.
How It Works
This “speech-to-speech” tech is much more complicated than it sounds. Translating an original speech to a foreign speech using computers is a hectic task. It is not translating a language to another just from the audio resource, but there are several steps involved.
The automated dubbing process essentially includes 3 steps. First, the original speech needs to be converted in a text format. The second step involves translating the text to the desired language. Finally, the translated text generates the new speech.
Now, there are complications of developing the new speech from the translated text-to-speech. The translated speech should match the speed and emotion of the original speech. It should also carry the background sounds and eliminate the reverberation.
To make this complicated process work, Amazon researchers confirmed that their speech-to-speech tech has been trained on more than 150 million English-Italian pairs of phrase to determine the speed of a speech segment of the translated speech to match the speed of the original speech. This step ensures the pauses and breaks in the translated speech to match the original speech.
A model in the text-to-speech phase has trained on 47 hours of speech recordings. This model generates a context sequence from the text that is fed into a pre-trained vocoder, that coverts the sequence into a speech waveform.
This tech is also able to extract background sounds from the original audio and put it in the translated audio to make it more similar to the original audio. Lastly, a separate step called the re-reverberation step is applied to add the reverberation of the original audio to the translated one.
Will It Be Useful?
The process is surely a complicated one, but researchers wrote that their future work will be devoted to the improvements of the automatic dubbing. It can eliminate the need for voice actors to dub a show or a film to another language. It will become less time-consuming and much cheaper to dub content to the desired language. And yes, it will benefit the production houses to deliver more shows and films to viewers by making the list much more diverse.