Neural Networks - Data Analysis

Tuning
1. High Bias
  1. Data is too roughly modelled (underfitting)
  2. Bigger Network
  3. Train Longer
  4. NN architecture search
2. High Variance
  1. Data is too good modelled (overfitting)
  2. More Data
  3. Regularization
    1. Weight Decay
      1. L2 regularization : add (lamba/2*m)* ||W||F to cost
      2. L1 regularization: add (lamba/2*m)* ||W||F to cost
      3. Regularization required also to partial differential: W= W-alpha*dW
      4. Intuition for param Lambda: Lambda goes High -> weights goes low -> makes NN more linear
    2. Dropout
      1. Randomly take out certain neurons from network
    3. Data Augmentation
      1. Adding more training data of distorting existing data
    4. Early Stopping
      1. Stop earlier, where train error and dev error are at min
3. Optimization Problem
  1. Data not normalized -> slower training process
  2. Normalize data to have mean=0, and std=1
  3. Vanishing/ exploding gradients
    1. Deep network have issue of becomming to high or two low throughout network
  4. Gradient Checking
    1. Compare cost function, when increased and decreased by small value epsilon
  5. Optimization Algorithms
    1. Mini-Batch gradient descent
      1. Split the input and output (X,Y) data into small slices / batches, and calculate costs of only these batches
      2. Choosing Batch Size
        small set (m<=2000) -> batch gradient descent
        larger set -> batch size: 64,128,256 or 512
        Make sure batch fits CPU/GPU mem
    2. Exponentially weighted averages
      1. Weights are recalculated based on formula
      2. Bias Correction
        Corrects the starting values of exp. weighted averages using the formula: v(t)=v(t)/(1-beta^t)
    3. Gradient Descent with Momentum
      1. Aim: accelerator horizontal component of gradient descent to converge faster towards solution. Based similarly on formula for exp.weighted averages, just with gradient instead of theta
    4. RMSprop
      1. Aim to slower the vertical component of gradient descent and speed up the horizontal component.
    5. Adam
      1. Combination of RMSprop and Gradient Descent with momentum
      2. Hyperparameter choice: alpha : needs to be tuned, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8
    6. Learning Decay
      1. A method to lowe the learning rate closer it gets to minimum.
      2. Many formulas exist,the most famous is: alpha = 1/(1+decay_rate*epoch_num)
    7. Tuning algorithm's hyperparameters
      1. priorities: darkest - most important, lightest - least. white is fixed.
      2. try random values: dont't use grid
      3. Coarse to fine choice
      4. randomness scale choice, e.g. for alpha - logarithmic scale
    8. Batch Normalization
      1. Idea of normalizing each layer input (Z, not A) of Neural Network
Weights/parameters initialization
1. Zeros? NO!
  Annotations:
  - Zeros will make all neurons of neural network act the same, and behave linear, which loose the sense of having neural network.
  1. Bad - Fails to break Symmetry -> gradient not decreasing
2. Random Init
  Annotations:
  - - Initializing weights to very large random values does not work well. - intializing with small random values does better.
  1. Good - Breaks Symmetry
  2. Bad - large weight -> exploding gradients
3. He Init - the best!
  Annotations:
  - sqrt(2./layers_dims[l-1])
  1. Good - Ensures faster learning speed
  2. works well with ReLU activations
Dataset Split
1. Data > 1M 98% Train, 1% Dev, 1% Test
2. Small Data, 60% Train, 20% Dev, 20% Test
3. Train set from different distribution than Dev/test sets

Next up

Neural Networks - Data Analysis

Description

Resource summary

Media attachments

Similar

	Created by Aleksandar Kovacevic over 8 years ago

	Copied by Aleksandar Kovacevic over 8 years ago