Wavenet a generative model for raw audio

Posted by Kaiyuan Chen on December 3, 2017

Wavenet: a generative model for raw audio

Van den Oord et al.

This paper presents a generative, probabilistic and autoregressive approach based on PixelCNN.

  • it can generate raw speech signals with subjective naturalness
  • dilated causal convolution to deal with temporal dependencies


Because it is a generative model, it calculates the conditional probability such that p(x) = p(xt | x1 … xt-1) and each audio sample is conditioned on previous time steps.


It is based on causal convolution: the order cannot be changed on time steps. For very long sequences, it is typically faster than RNN. It is dilated: such convolution effectively allows the network to operate on a coarser scale than a normal convolution. Then it can stride.

Experiments for this paper

Multi-speaker speech generation

This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection. A single WaveNet was able to model speech from any of the speakers by conditioning it on a one-hot encoding of a speaker(speaker ID and model).

Text to Speech

It works better than LSTM… (Well I have to question: what about others? what about nsync? is it better?)


Well, this paper coincides(or just because of my lack of knowledge) with my idea a few days ago.

A few idea: can usage of GAN improve the result?