Course and lecture info

Intro

Deep Learning: Extract patterns from data using neural network.

Finding patterns are usually hand engineered features. Not scalable in practice. Deep learning resurgenced now because of Big Data, hardware, software.

The perceptron: Forward Propagation

1  - w0 --.   (bias)
x1 - w1 --.
          .---> sum -> non-linear activation function -> output
x2 - w2 --.
xm - wm --.

$$ \hat{y} = g(w_0 + \sum_{\substack{i=1}}^mx_iw_i) \ = g(w_0 + X^TW) \ \ \text{where: }X = \begin{bmatrix} x_1\ \vdots\ x_m\ \end{bmatrix} \text{and }W = \begin{bmatrix} w_1\ \vdots\ w_m\ \end{bmatrix} $$

Common Activation Functions

Probability distribution between 0 and 1. Non-linear functions. The purpose is to introduce non-linearities into the network. Non-linearities allow us to approximate arbitrarily complex functions.

Sigmoid function:

$$ \begin{aligned} g(z) &= \sigma(z) = \frac 1 {1+e^{-z}} \ g'(z) &= g(z)(1-g(z)) \end{aligned} $$

Hyperbolic Tangent:

$$ \begin{aligned} g(z) &= \frac {e^z - e^{-z}} {e^z + e^{-z}} \ g'(z) &= 1 - g(z)^2 \end{aligned} $$

Rectified Linear Unit (RELU):

$$ \begin{aligned} g(z) &= \text{max}(0, z) \ g'(z) &= \begin{cases} 1,& z > 0\ 0,& \text{otherwise} \end{cases} \end{aligned} $$

Example

We have $w_0=1$ and $W=\begin{bmatrix} 3\ -2 \end{bmatrix}$.

$$ \begin{aligned} \hat{y} &= g(w_0 + X^TW) \ &= g\left(1+\begin{bmatrix} x_1\ x_2 \end{bmatrix}^T\begin{bmatrix} 3\ -2 \end{bmatrix}\right)\ &= g(1+3x_1-2x_2) \end{aligned} $$

Which is a 2-D line.

Test with $X = \begin{bmatrix} -1\ 2 \end{bmatrix}$.

$$ \begin{aligned} \hat{y} &= g (1 + (3 * -1) - (2 * 2))\ &= g(-6) \approx 0.002 \end{aligned} $$

In the result,

$$ z < 0, y < 0.5 \ z > 0, y > 0.5 $$

$\hat{y}= g(1+3x_1-2x_2)$ is a straightforward formula. In reality, the formula is hard to identify due to the size of the input and result.

Building Neural Networks with Perceptrons

$\hat{y} = g(w_0 + X^TW)$

Simplified:

$$ z = w_0 + \sum_{\substack{j=1}}^mx_jw_i $$

Multi Output Perceptron like $y_1 = g(z_1), y_2 = g(z_2)$:

$$ z_i = w_{0,i} + \sum_{\substack{j=1}}^mx_jw_{j,i} $$

Because all inputs are densely connected to all outputs, these layers are called Dense layers.

class MyDenseLayer(tf.keras.layers.Layer):
  def __init__(self, input_dim, output_dim):
    super(MyDenseLayer, self).__init__()

    # Initialize weights and bias
    self.w = self.add_weight([input_dim, output_dim])
    self.b = self.add_weight([1, output_dim])
  
  def call(self, inputs):
    # Forward propagate the inputs
    z = tf.matmul(inputs, self.W) + self.b

    # Feed through a non-linear activation
    output = tf.math.sigmoid(z)

    return output
import tensorflow as tf

layer = tf.keras.layers.Dense(units=2)

Single Layer Neural Network

"Hidden" layers are underlying between the input and the final output.

$$ \begin{aligned} z_i &= w_{0,i}^{(1)} + \sum_{\substack{j=1}}^mx_jw_{j,i}^{(1)} \ \hat{y}i &= g\left(w{0,i}^{(2)} + \sum_{\substack{j=1}}^{d_1}z_jw_{j,i}^{(2)}\right) \end{aligned} $$

import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Dense(n),
  tf.keras.layers.Dense(2)
])

Deep Neural Network

Make many layers in the network.

$$ z_{k,i} = w_{0,i}^{(k)} + \sum_{\substack{j=1}}^{n_{k-1}}g(z_{k-1,j})w_{j,i}^{(k)} $$

import tensorflow as tf

model = tf.keras.Sequential([
  tf.keras.layers.Dense(n_1),
  tf.keras.layers.Dense(n_2),
  # ...
  tf.keras.layers.Dense(2)
])

Applying Neural Networks

Example: "Will I pass this class?"

A simple two feature model.

$$ x_1 = \text{Number for lectures you attend} \ x_2 = \text{Hours spent on the final project} $$

Need to train the network first.

Quantifying Loss

The loss of our network measures the cost incurred from incorrect predictions.

$$ L(f(x^{(i)};W),y^{(i)}) $$

Fix answers to move closer towards to the true answers.

Empirical Loss

The empirical loss measures the total loss over out entire dataset. Average of all individual losses.

$$ J(W) = \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)}) $$

Binary Cross Entropy Loss

Cross entropy loss can be used with models that output a probability between 0 and 1. Introduced by Claude Shannon.

$$ \begin{aligned} J(W) &= \frac 1 n \sum_{\substack{i=1}}^n y^{(i)}\log(f(x^{(i)};W))+(1-y^{(1)})\log(1-f(x^{(i)};W))\ \text{Predicted}&: f(x^{(i)};W)\ \text{Actual}&: y^{(i)} \end{aligned} $$

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, predicted))

Compares how different these two distributions.

Mean Squared Error Loss

Mean squared error loss can be used with regression models that output continuous real numbers. Possible numbers rather than true or false.

$$ J(W) = \frac 1 n \sum_{\substack{i=1}}^n (y^{(i)} - f(x^{(i)};W))^2 $$

loss = tf.reduce_mean(tf.square(tf.subtract(y, predicted)))

Training Neural Networks

Loss Optimization

We want to find the network weights that achieve the lowest loss.

$$ \begin{aligned} W^* &= \argmin(w) \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})\ &= \argmin(w) J(W)\ W &= {W^{(0)},W^{(1)},\dots} \end{aligned} $$

Find the each of the weights W.

  1. Pick random place as an initial $(w_0, w_1)$.
  2. Compute gradient. $\frac {\partial J(W)} {\partial W}$ to find maximum ascent.
  3. Take small step in opposite direction of gradient
  4. Repeat until convergence to local minimum

Graident Descent

Algorithm:

  1. Initialize weights randomly $~N(0, \sigma^2)$
  2. Loop until convergence:
    1. Compute gradient. $\frac {\partial J(W)} {\partial W}$
    2. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$ ($\eta$: learning rate, how much of a step to repeat each iteration)
  3. Return weights
import tensorflow as tf
weights = tf.Variable([tf.random.normal()])

while True:
  with tf.GradientTape() as g:
    loss = compute_loss(weights)
    gradient = g.gradient(loss, weights) # Backpropagation

  weights = weights - lr * gradient

Computing Gradients: Backpropagation

Gradients shows how does a small change in one wieght (ex. $w_2$) affect the final loss $J(W)$.

Use chain rule here.

$$ \begin{aligned} \frac {\partial J(W)} {\partial W_2} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_2} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_1} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial z_1} {\partial W_1} * \frac {\partial z_1} {\partial W_1} \end{aligned} $$

Repeat this for every weight in the network using gradients from later layers. The most frameworks provide the function to calculate this under the hood.

Neural Networks in Practice: Optimization

Training neural networks is difficult.

Loss functions can be difficult to optimize. Optimization through gradient descent: $W \gets W - \eta \frac {\partial J(W)} {\partial W}$. How to decide the learning rate $\eta$?

Approaches

  1. Try many different learning rate. Or,
  2. Adaptive learning rate that adapts to the landscape
    • no longer fixed, many algorithms

Adaptive algorithm

Ref.:

import tensorflow as tf
model = tf.keras.Sequential([...])

# pick your favorite optimizer
optimizer = tf.keras.optimizer.SGD()

while True:
  # forward pass through the network
  prediction = model(x)

  with tf.GradientTape() as tape:
    # compute the loss
    loss = compute_loss(y, prediction)
  
  # update the weights using the gradient
  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(grads, model.trainable_variables))

Neural Networks in Practice: Mini-batches

Calculating every partials are expensive tasks. Pick single data point $i$ and compute the gradient.

Pick single data point. Easy to compute but very noisy (stochastic):

  1. Initialize weights randomly $~N(0, \sigma^2)$
  2. Loop until convergence:
    1. Pick single data point $i$
    2. Compute gradient. $\frac {\partial J_i(W)} {\partial W}$
    3. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
  3. Return weights

Mini batch of points. Fast to compute and a much better estimate of the true gradient.

  1. Initialize weights randomly $~N(0, \sigma^2)$
  2. Loop until convergence:
    1. Pick batch of $B$ data points
    2. Compute gradient. $\frac {\partial J(W)} {\partial W} = \frac 1 B \sum_{k=1}^B \frac {\partial J_k(W)} {\partial W}$
    3. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
  3. Return weights

Neural Networks in Practice: Overfitting

Overfitting is a general problem in machine learning.

Regularization

Technique that constrains our optimization problem to discourage complex models. Improve generalization of of our model on unseen data.

Regularization 1: Dropout

tf.keras.layers.Dropout(p=0.5)

Regularization 2: Early Stopping

Summary

색상을 바꿔요

눈에 편한 색상을 골라보세요 :)

Darkreader 플러그인으로 선택한 색상이 제대로 표시되지 않을 수 있습니다.