Dropout A Simple Way to Prevent Neural Networks from Overfitting
by Srivastava et al.
The key idea of dropout is to randomly drop units (along with their connections) from the neural network during training. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weight.
For larger neural network, we cannot have separately trained net, so we have individual models different from each other and make neural network model different.
Motivation: natural selection & role of sex in evolution
It can be seen as a stochastic regularization techinque, with removing certain nodes at random.
For any layer l, r(l) is a vector of independent Bernoulli random variables each of which has probability p of being 1. This vector is sampled and multiplied element-wise to create a thin layer as input of next layer.
(PS: Bernoulli is 1 if p, 0 if 1-p)
Error rate greatly decreases when use dropout(from 1.6 -> 1.35) by direct comparison on MNIST, but I do notice they tend to have a larger test-training ratio than what we normally train, like for MNIST it is 6:1 and for some dataset, it even has 1:1.
It does increase the training time(around 2x to 3x), and the gradient being compute is not gradient for final architecture.
Authors leave practical guide for training dropout network. A good dropout net should have n/p units, where n is number of layers and p is probability of retaining a unit.
We should also use greater learning rate because gradient tend to cancel each other when dropout introduces a lot of noise.
The dropout rate tend to in range of 0.5 to 0.8, typically 0.8.