AI researchers at Google and the University College London have detailed an AI model that can control speech characteristics like pitch, emotion and speaking rate with just 30 minutes of data. Their paper, which has been published by the International Conference on Learning Representations (ICLR), details how the researchers trained the AI system for 300,000 steps across 32 of Google’s custom-designed tensor processing units (TPUs).
According to the study, using just 30 minutes of labeled data enabled the AI algorithm to have a ‘significant degree’ of control over speech rate, valence, and arousal. The researchers further said that the new system can produce visual representations of frequencies called spectrograms by training a second model, such as DeepMind’s WaveNet, to act as a vocoder – a voice codec that analyzes and synthesizes voice data.
What’s really interesting is that the new AI model seems to address a critical limitation of an earlier study that investigated the use of ‘style tokens’, which represented different categories of emotion, to control speech effects. While that model achieved good results with only 5 percent of labeled data, it wasn’t able to satisfactorily modify speech samples that used different tones, stress, intonations and rhythms while conveying the same emotion.
The labeled data set included a total of around 45 hours of audio, including 72,405 recordings of 5-second each from 40 English speakers. The speakers were all trained voice actors who read pre-written texts with varying levels of valence (emotions like sadness or happiness) and arousal (excitement or energy). The researchers then used those recordings to obtain six ‘affective states’ that were then modeled and used as labels for the AI algorithm to train on.
While the researchers admit that new AI model can make it easier for unscrupulous parties to spread misinformation or commit fraud, they also claim that the benefits in this case far outweighs the possible risks because the study can eventually improve human-computer interfaces significantly.