Introductino to Deep Learning: Lec 1

2020년 12월 22일

Course and lecture info

Intro

Deep Learning: Extract patterns from data using neural network.

Finding patterns are usually hand engineered features. Not scalable in practice. Deep learning resurgenced now because of Big Data, hardware, software.

  • Stochastic Gradient Descent (1952)
  • Perceptron: Learanable Weights (1958)
  • Backpropagation: Multi-layer Perceptron (1986)
  • Deep Convolutional NN: Digit Recognition (1995)

The perceptron: Forward Propagation

1  - w0 --.   (bias)
x1 - w1 --.
          .---> sum -> non-linear activation function -> output
x2 - w2 --.
xm - wm --.
y^=g(w0+i=1mxiwi)=g(w0+XTW)where: X=[x1xm]and W=[w1wm]\hat{y} = g(w_0 + \sum_{\substack{i=1}}^mx_iw_i) \\ = g(w_0 + X^TW) \\ \\ \text{where: }X = \begin{bmatrix} x_1\\ \vdots\\ x_m\\ \end{bmatrix} \text{and }W = \begin{bmatrix} w_1\\ \vdots\\ w_m\\ \end{bmatrix}

Common Activation Functions

Probability distribution between 0 and 1. Non-linear functions. The purpose is to introduce non-linearities into the network. Non-linearities allow us to approximate arbitrarily complex functions.

Sigmoid function:

g(z)=σ(z)=11+ezg(z)=g(z)(1g(z))\begin{aligned} g(z) &= \sigma(z) = \frac 1 {1+e^{-z}} \\ g'(z) &= g(z)(1-g(z)) \end{aligned}

Hyperbolic Tangent:

g(z)=ezezez+ezg(z)=1g(z)2\begin{aligned} g(z) &= \frac {e^z - e^{-z}} {e^z + e^{-z}} \\ g'(z) &= 1 - g(z)^2 \end{aligned}

Rectified Linear Unit (RELU):

g(z)=max(0,z)g(z)={1,z>00,otherwise\begin{aligned} g(z) &= \text{max}(0, z) \\ g'(z) &= \begin{cases} 1,& z > 0\\ 0,& \text{otherwise} \end{cases} \end{aligned}

Example

We have w0=1w_0=1 and W=[32]W=\begin{bmatrix} 3\\ -2 \end{bmatrix}.

y^=g(w0+XTW)=g(1+[x1x2]T[32])=g(1+3x12x2)\begin{aligned} \hat{y} &= g(w_0 + X^TW) \\ &= g\left(1+\begin{bmatrix} x_1\\ x_2 \end{bmatrix}^T\begin{bmatrix} 3\\ -2 \end{bmatrix}\right)\\ &= g(1+3x_1-2x_2) \end{aligned}

Which is a 2-D line.

Test with X=[12]X = \begin{bmatrix} -1\\ 2 \end{bmatrix}.

y^=g(1+(31)(22))=g(6)0.002\begin{aligned} \hat{y} &= g (1 + (3 * -1) - (2 * 2))\\ &= g(-6) \approx 0.002 \end{aligned}

In the result,

z<0,y<0.5z>0,y>0.5z < 0, y < 0.5 \\ z > 0, y > 0.5

y^=g(1+3x12x2)\hat{y}= g(1+3x_1-2x_2) is a straightforward formula. In reality, the formula is hard to identify due to the size of the input and result.

Building Neural Networks with Perceptrons

y^=g(w0+XTW)\hat{y} = g(w_0 + X^TW)

Simplified:

z=w0+j=1mxjwiz = w_0 + \sum_{\substack{j=1}}^mx_jw_i

Multi Output Perceptron like y1=g(z1),y2=g(z2)y_1 = g(z_1), y_2 = g(z_2):

zi=w0,i+j=1mxjwj,iz_i = w_{0,i} + \sum_{\substack{j=1}}^mx_jw_{j,i}

Because all inputs are densely connected to all outputs, these layers are called Dense layers.

class MyDenseLayer(tf.keras.layers.Layer):
  def __init__(self, input_dim, output_dim):
    super(MyDenseLayer, self).__init__()

    # Initialize weights and bias
    self.w = self.add_weight([input_dim, output_dim])
    self.b = self.add_weight([1, output_dim])
  
  def call(self, inputs):
    # Forward propagate the inputs
    z = tf.matmul(inputs, self.W) + self.b

    # Feed through a non-linear activation
    output = tf.math.sigmoid(z)

    return output
import tensorflow as tf

layer = tf.keras.layers.Dense(units=2)

Single Layer Neural Network

"Hidden" layers are underlying between the input and the final output.

zi=w0,i(1)+j=1mxjwj,i(1)y^i=g(w0,i(2)+j=1d1zjwj,i(2))\begin{aligned} z_i &= w_{0,i}^{(1)} + \sum_{\substack{j=1}}^mx_jw_{j,i}^{(1)} \\ \hat{y}_i &= g\left(w_{0,i}^{(2)} + \sum_{\substack{j=1}}^{d_1}z_jw_{j,i}^{(2)}\right) \end{aligned}
import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Dense(n),
  tf.keras.layers.Dense(2)
])

Deep Neural Network

Make many layers in the network.

zk,i=w0,i(k)+j=1nk1g(zk1,j)wj,i(k)z_{k,i} = w_{0,i}^{(k)} + \sum_{\substack{j=1}}^{n_{k-1}}g(z_{k-1,j})w_{j,i}^{(k)}
import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Dense(n_1),
  tf.keras.layers.Dense(n_2),
  # ...
  tf.keras.layers.Dense(2)
])

Applying Neural Networks

Example: "Will I pass this class?"

A simple two feature model.

x1=Number for lectures you attendx2=Hours spent on the final projectx_1 = \text{Number for lectures you attend} \\ x_2 = \text{Hours spent on the final project}

Need to train the network first.

Quantifying Loss

The loss of our network measures the cost incurred from incorrect predictions.

L(f(x(i);W),y(i))L(f(x^{(i)};W),y^{(i)})

Fix answers to move closer towards to the true answers.

Empirical Loss

The empirical loss measures the total loss over out entire dataset. Average of all individual losses.

J(W)=1ni=1nL(f(x(i);W),y(i))J(W) = \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})

Binary Cross Entropy Loss

Cross entropy loss can be used with models that output a probability between 0 and 1. Introduced by Claude Shannon.

J(W)=1ni=1ny(i)log(f(x(i);W))+(1y(1))log(1f(x(i);W))Predicted:f(x(i);W)Actual:y(i)\begin{aligned} J(W) &= \frac 1 n \sum_{\substack{i=1}}^n y^{(i)}\log(f(x^{(i)};W))+(1-y^{(1)})\log(1-f(x^{(i)};W))\\ \text{Predicted}&: f(x^{(i)};W)\\ \text{Actual}&: y^{(i)} \end{aligned}
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, predicted))

Compares how different these two distributions.

Mean Squared Error Loss

Mean squared error loss can be used with regression models that output continuous real numbers. Possible numbers rather than true or false.

J(W)=1ni=1n(y(i)f(x(i);W))2J(W) = \frac 1 n \sum_{\substack{i=1}}^n (y^{(i)} - f(x^{(i)};W))^2
loss = tf.reduce_mean(tf.square(tf.subtract(y, predicted)))

Training Neural Networks

Loss Optimization

We want to find the network weights that achieve the lowest loss.

W=arg min(w)1ni=1nL(f(x(i);W),y(i))=arg min(w)J(W)W={W(0),W(1),}\begin{aligned} W^* &= \argmin(w) \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})\\ &= \argmin(w) J(W)\\ W &= \{W^{(0)},W^{(1)},\dots\} \end{aligned}

Find the each of the weights W.

  1. Pick random place as an initial (w0,w1)(w_0, w_1).
  2. Compute gradient. J(W)W\frac {\partial J(W)} {\partial W} to find maximum ascent.
  3. Take small step in opposite direction of gradient
  4. Repeat until convergence to local minimum

Graident Descent

Algorithm:

  1. Initialize weights randomly  N(0,σ2)~N(0, \sigma^2)
  2. Loop until convergence:
    1. Compute gradient. J(W)W\frac {\partial J(W)} {\partial W}
    2. Update weights, WWηJ(W)WW \gets W - \eta \frac {\partial J(W)} {\partial W} (η\eta: learning rate, how much of a step to repeat each iteration)
  3. Return weights
import tensorflow as tf
weights = tf.Variable([tf.random.normal()])

while True:
  with tf.GradientTape() as g:
    loss = compute_loss(weights)
    gradient = g.gradient(loss, weights) # Backpropagation

  weights = weights - lr * gradient

Computing Gradients: Backpropagation

Gradients shows how does a small change in one wieght (ex. w2w_2) affect the final loss J(W)J(W).

Use chain rule here.

J(W)W2=J(W)y^y^W2J(W)W1=J(W)y^y^W1J(W)W1=J(W)y^z1W1z1W1\begin{aligned} \frac {\partial J(W)} {\partial W_2} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_2} \\ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_1} \\ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial z_1} {\partial W_1} * \frac {\partial z_1} {\partial W_1} \end{aligned}

Repeat this for every weight in the network using gradients from later layers. The most frameworks provide the function to calculate this under the hood.

Neural Networks in Practice: Optimization

Training neural networks is difficult.

Loss functions can be difficult to optimize. Optimization through gradient descent: WWηJ(W)WW \gets W - \eta \frac {\partial J(W)} {\partial W}. How to decide the learning rate η\eta?

  • Small learning rates: converges slowly and gets stuck in false local minima
  • Large learning rates: overshoot, become unstable and diverge
  • Stable learning rates: converge smoothly and avoid local minima

Approaches

  1. Try many different learning rate. Or,
  2. Adaptive learning rate that adapts to the landscape
    • no longer fixed, many algorithms

Adaptive algorithm

  • SGD tf.keras.optimizers.SGD
  • Adam tf.keras.optimizers.Adam
  • Adadelta tf.keras.optimizers.Adadelta
  • Adagrad tf.keras.optimizers.Adagrad
  • RMSProp tf.keras.optimizers.RMSProp

Ref.:

  • Kiefer & Wolfowitz. "Stochastic Estimation of the Maximum of a regression Function." 1952.
  • Kingma et al. "Adam: A Method for Stochastic Optimization." 2014.
  • Zeiler et al. "ADADELTA: An Adaptive Learning Rate Method." 2012.
  • Duchi et al. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." 2011.
import tensorflow as tf
model = tf.keras.Sequential([...])

# pick your favorite optimizer
optimizer = tf.keras.optimizer.SGD()

while True:
  # forward pass through the network
  prediction = model(x)

  with tf.GradientTape() as tape:
    # compute the loss
    loss = compute_loss(y, prediction)
  
  # update the weights using the gradient
  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(grads, model.trainable_variables))

Neural Networks in Practice: Mini-batches

Calculating every partials are expensive tasks. Pick single data point ii and compute the gradient.

Pick single data point. Easy to compute but very noisy (stochastic):

  1. Initialize weights randomly  N(0,σ2)~N(0, \sigma^2)
  2. Loop until convergence:
    1. Pick single data point ii
    2. Compute gradient. Ji(W)W\frac {\partial J_i(W)} {\partial W}
    3. Update weights, WWηJ(W)WW \gets W - \eta \frac {\partial J(W)} {\partial W}
  3. Return weights

Mini batch of points. Fast to compute and a much better estimate of the true gradient.

  1. Initialize weights randomly  N(0,σ2)~N(0, \sigma^2)
  2. Loop until convergence:
    1. Pick batch of BB data points
    2. Compute gradient. J(W)W=1Bk=1BJk(W)W\frac {\partial J(W)} {\partial W} = \frac 1 B \sum_{k=1}^B \frac {\partial J_k(W)} {\partial W}
    3. Update weights, WWηJ(W)WW \gets W - \eta \frac {\partial J(W)} {\partial W}
  3. Return weights
  • More accurate estimation of gradient
    • Smoother convergence
    • Allows for larger learning rates
  • Mini-batches lead to fast training
    • Can parallelize computation
    • achieve significant speed increases on GPU's

Neural Networks in Practice: Overfitting

Overfitting is a general problem in machine learning.

  • Underfitting: Model does not have capacity to fully learn the data
  • Ideal fit
  • Overfitting: Too complex, extra parameters, does not generalize well

Regularization

Technique that constrains our optimization problem to discourage complex models. Improve generalization of of our model on unseen data.

Regularization 1: Dropout

  • During training, randomly set some activations to 0
  • Typically 'drop' 50% of activations in layer
  • Forces network to not rely on any 1 node
  • Repeat every iteration
  • Build more robust representation of its prediction
  • Generalize better to new test data
tf.keras.layers.Dropout(p=0.5)

Regularization 2: Early Stopping

  • Stop training before we have a chance to overfit.
  • Find the inflection point that diverge the loss from the testing data. The point will be between underfitting and overfitting.

Summary

  • The Perceptron
    • Structural building blocks
    • Nonlinear activation functions
  • Neural Networks
    • Stacking Perceptrons to form neural networks
    • Optimization through backpropagation
  • Training in Practice
    • Adaptive learning
    • Batching
    • Regularization