## CH11 Training Deep Neural Nets

a complex layered NN have the following problems

- vanishing gradient problem
- training extremely slow
- overfitting because a lot of parameters

### Vanishing gradient

Gradients often get smaller and smaller as the algorithm progresses down to the lower layers. Because variance of the output of each layer is greater than variance of input, if going forward, the variance keep increasing until activation function saturates at the top layers.

As a result, if the function saturates at 0/1, derivative is close to 0 for sigmoid, then backpropagation nearly = 0

#### Initialization

we need the variance of the outputs of each layer to be equal to the variance of its inputs,2 and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction

Xavier initialization Normal distribution with 0 and std dev

the usage of fully connected can do this by default

```
he_init = tf.contrib.layers.variance_scaling_initializer()
hidden1 = fully_connected(X, n_hidden1, weights_initializer=he_init, scope="h1")
```

#### Non-saturating Activation function

ReLU has a problem: dead ReLU that always output 0 so use a leaky RELU or ELU(exponential linear unit) activation function

#### Batch normalization

To guarantee vanishing/exploding gradient do not come back, we have batch normalization. We can do BN by adding an operation in the model just before the activation function of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer

This is to solve a problem that distribution of each layer’s inputs change in training, and parameter will also change //I want to call it for today because I am having my history final paper

We can call it by

```
import tensorflow as tf
from tensorflow.contrib.layers import batch_norm
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
is_training = tf.placeholder(tf.bool, shape=(), name='is_training')
bn_params = {
'is_training': is_training,
'decay': 0.99,
'updates_collections': None
}
hidden1 = fully_connected(X, n_hidden1, scope="hidden1", normalizer_fn=batch_norm, normalizer_params=bn_params)
```