Wavenet: a generative model for raw audio
Van den Oord et al.
This paper presents a generative, probabilistic and autoregressive approach based on PixelCNN.
- it can generate raw speech signals with subjective naturalness
- dilated causal convolution to deal with temporal dependencies
Because it is a generative model, it calculates the conditional probability such that p(x) = p(xt | x1 … xt-1) and each audio sample is conditioned on previous time steps.
It is based on causal convolution: the order cannot be changed on time steps. For very long sequences, it is typically faster than RNN. It is dilated: such convolution effectively allows the network to operate on a coarser scale than a normal convolution. Then it can stride.
Experiments for this paper
Multi-speaker speech generation
This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection. A single WaveNet was able to model speech from any of the speakers by conditioning it on a one-hot encoding of a speaker(speaker ID and model).
Text to Speech
It works better than LSTM… (Well I have to question: what about others? what about nsync? is it better?)
Well, this paper coincides(or just because of my lack of knowledge) with my idea a few days ago.
A few idea: can usage of GAN improve the result?