Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
This work is about building an AI vocoder that is able to synthesize believable singing
from MIDI and lyrics as inputs.
But first, what is a vocoder?
It works kinda like this.
Fellow Scholars who are fans of Jean-Michel Jarre's music are likely very familiar with
this effect, I've put a link to an example song in the video description.
Make sure to leave a comment with your favorite songs with vocoders so I and other Fellow
Scholars can also nerd out on them.
And now about the MIDI and lyrics terms.
The lyrics part is a simple text file containing the words that this synthesized voice should
sing, and the MIDI is data that describes the pitch, length and the velocity of each
sound.
With a little simplification, we could say that the score is given as an input, and the
algorithm has to output the singing footage.
We will talk about the algorithm in a moment, but for now, let's listen to it.
Wow.
So this is a vocoder.
This means it separates the pitch and timbre components of the voice, therefore the waveforms
are not generated directly, which is a key difference from Google DeepMind's WaveNet.
This leads to two big advantages: One, the generation times are quite favorable.
And by favorable, I guess you're hoping for real time.
Well, hold on to your papers, because it is not real time, it is 10-15 times real-time!
And two, this way, the algorithm will only need a modest amount of training data to function
well.
Here, you can see the input phonemes that make up the syllables of the lyrics, each
typically corresponding to one note.
This is then connected to a modified WaveNet architecture that uses 2-by-1 dilated convolutions.
This means that the dilation factor is doubled in each layer, thereby introducing an exponential
growth in the receptive field of the model.
This helps us keep the parameter count down, which enables training on small datasets.
As validation, the mean opinion scores have been recorded, in a previous episode, we discussed
that this is a number that describes how a sound sample would pass as genuine human speech
or singing.
The test showed that this new method is well ahead of the competition, approximately midway
between the previous works and the reference singing footage.
There are plenty of other tests in the paper, this is just one of many, so make sure to
have a look.
This is one important stepping stone towards synthesizing singing that is highly usable
in digital media and where generation is faster than real time.
Creating a MIDI input is a piece of cake with a midi master keyboard, or we can even draw
the notes by hand in many digital audio workstation programs.
After that, writing the lyrics is as simple as it gets and doesn't need any additional
software.
Tools like this are going to make this process accessible to everyone.
Loving it.
If you would like to help us create more elaborate videos, please consider supporting us on Patreon.
We also support one-time payments through cryptos like Bitcoin, Ethereum and Litecoin.
Everything is available in the video description.
Thanks for watching and for your generous support, and I'll see you next time!
No comments:
Post a Comment