Course and lecture info

Intro

Deep Learning: Extract patterns from data using neural network.

Finding patterns are usually hand engineered features. Not scalable in practice. Deep learning resurgenced now because of Big Data, hardware, software.

Stochastic Gradient Descent (1952)
Perceptron: Learanable Weights (1958)
Backpropagation: Multi-layer Perceptron (1986)
Deep Convolutional NN: Digit Recognition (1995)

The perceptron: Forward Propagation

1  - w0 --.   (bias)
x1 - w1 --.
          .---> sum -> non-linear activation function -> output
x2 - w2 --.
xm - wm --.

$$ \hat{y} = g(w_0 + \sum_{\substack{i=1}}^mx_iw_i) \ = g(w_0 + X^TW) \ \ \text{where: }X = \begin{bmatrix} x_1\ \vdots\ x_m\ \end{bmatrix} \text{and }W = \begin{bmatrix} w_1\ \vdots\ w_m\ \end{bmatrix} $$

Common Activation Functions

Probability distribution between 0 and 1. Non-linear functions. The purpose is to introduce non-linearities into the network. Non-linearities allow us to approximate arbitrarily complex functions.

Sigmoid function:

$$ \begin{aligned} g(z) &= \sigma(z) = \frac 1 {1+e^{-z}} \ g'(z) &= g(z)(1-g(z)) \end{aligned} $$

Hyperbolic Tangent:

$$ \begin{aligned} g(z) &= \frac {e^z - e^{-z}} {e^z + e^{-z}} \ g'(z) &= 1 - g(z)^2 \end{aligned} $$

Rectified Linear Unit (RELU):

$$ \begin{aligned} g(z) &= \text{max}(0, z) \ g'(z) &= \begin{cases} 1,& z > 0\ 0,& \text{otherwise} \end{cases} \end{aligned} $$

Example

We have $w_0=1$ and $W=\begin{bmatrix} 3\ -2 \end{bmatrix}$.

$$ \begin{aligned} \hat{y} &= g(w_0 + X^TW) \ &= g\left(1+\begin{bmatrix} x_1\ x_2 \end{bmatrix}^T\begin{bmatrix} 3\ -2 \end{bmatrix}\right)\ &= g(1+3x_1-2x_2) \end{aligned} $$

Which is a 2-D line.

Test with $X = \begin{bmatrix} -1\ 2 \end{bmatrix}$.

$$ \begin{aligned} \hat{y} &= g (1 + (3 * -1) - (2 * 2))\ &= g(-6) \approx 0.002 \end{aligned} $$

In the result,

$$ z < 0, y < 0.5 \ z > 0, y > 0.5 $$

$\hat{y}= g(1+3x_1-2x_2)$ is a straightforward formula. In reality, the formula is hard to identify due to the size of the input and result.

Building Neural Networks with Perceptrons

$\hat{y} = g(w_0 + X^TW)$

Simplified:

$$ z = w_0 + \sum_{\substack{j=1}}^mx_jw_i $$

Multi Output Perceptron like $y_1 = g(z_1), y_2 = g(z_2)$:

$$ z_i = w_{0,i} + \sum_{\substack{j=1}}^mx_jw_{j,i} $$

Because all inputs are densely connected to all outputs, these layers are called Dense layers.

class MyDenseLayer(tf.keras.layers.Layer):
  def __init__(self, input_dim, output_dim):
    super(MyDenseLayer, self).__init__()

    # Initialize weights and bias
    self.w = self.add_weight([input_dim, output_dim])
    self.b = self.add_weight([1, output_dim])
  
  def call(self, inputs):
    # Forward propagate the inputs
    z = tf.matmul(inputs, self.W) + self.b

    # Feed through a non-linear activation
    output = tf.math.sigmoid(z)

    return output

import tensorflow as tf

layer = tf.keras.layers.Dense(units=2)

Single Layer Neural Network

"Hidden" layers are underlying between the input and the final output.

$$ \begin{aligned} z_i &= w_{0,i}^{(1)} + \sum_{\substack{j=1}}^mx_jw_{j,i}^{(1)} \ \hat{y}i &= g\left(w{0,i}^{(2)} + \sum_{\substack{j=1}}^{d_1}z_jw_{j,i}^{(2)}\right) \end{aligned} $$

import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Dense(n),
  tf.keras.layers.Dense(2)
])

Deep Neural Network

Make many layers in the network.

$$ z_{k,i} = w_{0,i}^{(k)} + \sum_{\substack{j=1}}^{n_{k-1}}g(z_{k-1,j})w_{j,i}^{(k)} $$

import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Dense(n_1),
  tf.keras.layers.Dense(n_2),
  # ...
  tf.keras.layers.Dense(2)
])

Applying Neural Networks

Example: "Will I pass this class?"

A simple two feature model.

$$ x_1 = \text{Number for lectures you attend} \ x_2 = \text{Hours spent on the final project} $$

Need to train the network first.

Quantifying Loss

The loss of our network measures the cost incurred from incorrect predictions.

$$ L(f(x^{(i)};W),y^{(i)}) $$

Fix answers to move closer towards to the true answers.

Empirical Loss

The empirical loss measures the total loss over out entire dataset. Average of all individual losses.

$$ J(W) = \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)}) $$

Binary Cross Entropy Loss

Cross entropy loss can be used with models that output a probability between 0 and 1. Introduced by Claude Shannon.

$$ \begin{aligned} J(W) &= \frac 1 n \sum_{\substack{i=1}}^n y^{(i)}\log(f(x^{(i)};W))+(1-y^{(1)})\log(1-f(x^{(i)};W))\ \text{Predicted}&: f(x^{(i)};W)\ \text{Actual}&: y^{(i)} \end{aligned} $$

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, predicted))

Compares how different these two distributions.

Mean Squared Error Loss

Mean squared error loss can be used with regression models that output continuous real numbers. Possible numbers rather than true or false.

$$ J(W) = \frac 1 n \sum_{\substack{i=1}}^n (y^{(i)} - f(x^{(i)};W))^2 $$

loss = tf.reduce_mean(tf.square(tf.subtract(y, predicted)))

Training Neural Networks

Loss Optimization

We want to find the network weights that achieve the lowest loss.

$$ \begin{aligned} W^* &= \argmin(w) \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})\ &= \argmin(w) J(W)\ W &= {W^{(0)},W^{(1)},\dots} \end{aligned} $$

Find the each of the weights W.

Pick random place as an initial $(w_0, w_1)$.
Compute gradient. $\frac {\partial J(W)} {\partial W}$ to find maximum ascent.
Take small step in opposite direction of gradient
Repeat until convergence to local minimum

Graident Descent

Algorithm:

Initialize weights randomly $~N(0, \sigma^2)$
Loop until convergence:
1. Compute gradient. $\frac {\partial J(W)} {\partial W}$
2. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$ ($\eta$: learning rate, how much of a step to repeat each iteration)
Return weights

import tensorflow as tf
weights = tf.Variable([tf.random.normal()])

while True:
  with tf.GradientTape() as g:
    loss = compute_loss(weights)
    gradient = g.gradient(loss, weights) # Backpropagation

  weights = weights - lr * gradient

Computing Gradients: Backpropagation

Gradients shows how does a small change in one wieght (ex. $w_2$) affect the final loss $J(W)$.

Use chain rule here.

$$ \begin{aligned} \frac {\partial J(W)} {\partial W_2} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_2} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_1} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial z_1} {\partial W_1} * \frac {\partial z_1} {\partial W_1} \end{aligned} $$

Repeat this for every weight in the network using gradients from later layers. The most frameworks provide the function to calculate this under the hood.

Neural Networks in Practice: Optimization

Training neural networks is difficult.

Loss functions can be difficult to optimize. Optimization through gradient descent: $W \gets W - \eta \frac {\partial J(W)} {\partial W}$. How to decide the learning rate $\eta$?

Small learning rates: converges slowly and gets stuck in false local minima
Large learning rates: overshoot, become unstable and diverge
Stable learning rates: converge smoothly and avoid local minima

Approaches

Try many different learning rate. Or,
Adaptive learning rate that adapts to the landscape
- no longer fixed, many algorithms

Adaptive algorithm

SGD tf.keras.optimizers.SGD
Adam tf.keras.optimizers.Adam
Adadelta tf.keras.optimizers.Adadelta
Adagrad tf.keras.optimizers.Adagrad
RMSProp tf.keras.optimizers.RMSProp

Ref.:

Kiefer & Wolfowitz. "Stochastic Estimation of the Maximum of a regression Function." 1952.
Kingma et al. "Adam: A Method for Stochastic Optimization." 2014.
Zeiler et al. "ADADELTA: An Adaptive Learning Rate Method." 2012.
Duchi et al. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." 2011.

import tensorflow as tf
model = tf.keras.Sequential([...])

# pick your favorite optimizer
optimizer = tf.keras.optimizer.SGD()

while True:
  # forward pass through the network
  prediction = model(x)

  with tf.GradientTape() as tape:
    # compute the loss
    loss = compute_loss(y, prediction)
  
  # update the weights using the gradient
  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(grads, model.trainable_variables))

Neural Networks in Practice: Mini-batches

Calculating every partials are expensive tasks. Pick single data point $i$ and compute the gradient.

Pick single data point. Easy to compute but very noisy (stochastic):

Initialize weights randomly $~N(0, \sigma^2)$
Loop until convergence:
1. Pick single data point $i$
2. Compute gradient. $\frac {\partial J_i(W)} {\partial W}$
3. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
Return weights

Mini batch of points. Fast to compute and a much better estimate of the true gradient.

Initialize weights randomly $~N(0, \sigma^2)$
Loop until convergence:
1. Pick batch of $B$ data points
2. Compute gradient. $\frac {\partial J(W)} {\partial W} = \frac 1 B \sum_{k=1}^B \frac {\partial J_k(W)} {\partial W}$
3. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
Return weights

More accurate estimation of gradient
- Smoother convergence
- Allows for larger learning rates
Mini-batches lead to fast training
- Can parallelize computation
- achieve significant speed increases on GPU's

Neural Networks in Practice: Overfitting

Overfitting is a general problem in machine learning.

Underfitting: Model does not have capacity to fully learn the data
Ideal fit
Overfitting: Too complex, extra parameters, does not generalize well

Regularization

Technique that constrains our optimization problem to discourage complex models. Improve generalization of of our model on unseen data.

Regularization 1: Dropout

During training, randomly set some activations to 0
Typically 'drop' 50% of activations in layer
Forces network to not rely on any 1 node
Repeat every iteration
Build more robust representation of its prediction
Generalize better to new test data

tf.keras.layers.Dropout(p=0.5)

Regularization 2: Early Stopping

Stop training before we have a chance to overfit.
Find the inflection point that diverge the loss from the testing data. The point will be between underfitting and overfitting.

Summary

The Perceptron
- Structural building blocks
- Nonlinear activation functions
Neural Networks
- Stacking Perceptrons to form neural networks
- Optimization through backpropagation
Training in Practice
- Adaptive learning
- Batching
- Regularization

Introductino to Deep Learning: Lec 1