2020년 12월 22일

# Intro

Deep Learning: Extract patterns from data using neural network.

Finding patterns are usually hand engineered features. Not scalable in practice. Deep learning resurgenced now because of Big Data, hardware, software.

• Perceptron: Learanable Weights (1958)
• Backpropagation: Multi-layer Perceptron (1986)
• Deep Convolutional NN: Digit Recognition (1995)

# The perceptron: Forward Propagation

1  - w0 --.   (bias)
x1 - w1 --.
.---> sum -> non-linear activation function -> output
x2 - w2 --.
xm - wm --.
$\hat{y} = g(w_0 + \sum_{\substack{i=1}}^mx_iw_i) \\ = g(w_0 + X^TW) \\ \\ \text{where: }X = \begin{bmatrix} x_1\\ \vdots\\ x_m\\ \end{bmatrix} \text{and }W = \begin{bmatrix} w_1\\ \vdots\\ w_m\\ \end{bmatrix}$

## Common Activation Functions

Probability distribution between 0 and 1. Non-linear functions. The purpose is to introduce non-linearities into the network. Non-linearities allow us to approximate arbitrarily complex functions.

Sigmoid function:

\begin{aligned} g(z) &= \sigma(z) = \frac 1 {1+e^{-z}} \\ g'(z) &= g(z)(1-g(z)) \end{aligned}

Hyperbolic Tangent:

\begin{aligned} g(z) &= \frac {e^z - e^{-z}} {e^z + e^{-z}} \\ g'(z) &= 1 - g(z)^2 \end{aligned}

Rectified Linear Unit (RELU):

\begin{aligned} g(z) &= \text{max}(0, z) \\ g'(z) &= \begin{cases} 1,& z > 0\\ 0,& \text{otherwise} \end{cases} \end{aligned}

## Example

We have $w_0=1$ and $W=\begin{bmatrix} 3\\ -2 \end{bmatrix}$.

\begin{aligned} \hat{y} &= g(w_0 + X^TW) \\ &= g\left(1+\begin{bmatrix} x_1\\ x_2 \end{bmatrix}^T\begin{bmatrix} 3\\ -2 \end{bmatrix}\right)\\ &= g(1+3x_1-2x_2) \end{aligned}

Which is a 2-D line.

Test with $X = \begin{bmatrix} -1\\ 2 \end{bmatrix}$.

\begin{aligned} \hat{y} &= g (1 + (3 * -1) - (2 * 2))\\ &= g(-6) \approx 0.002 \end{aligned}

In the result,

$z < 0, y < 0.5 \\ z > 0, y > 0.5$

$\hat{y}= g(1+3x_1-2x_2)$ is a straightforward formula. In reality, the formula is hard to identify due to the size of the input and result.

# Building Neural Networks with Perceptrons

$\hat{y} = g(w_0 + X^TW)$

Simplified:

$z = w_0 + \sum_{\substack{j=1}}^mx_jw_i$

Multi Output Perceptron like $y_1 = g(z_1), y_2 = g(z_2)$:

$z_i = w_{0,i} + \sum_{\substack{j=1}}^mx_jw_{j,i}$

Because all inputs are densely connected to all outputs, these layers are called Dense layers.

class MyDenseLayer(tf.keras.layers.Layer):
def __init__(self, input_dim, output_dim):
super(MyDenseLayer, self).__init__()

# Initialize weights and bias

def call(self, inputs):
# Forward propagate the inputs
z = tf.matmul(inputs, self.W) + self.b

# Feed through a non-linear activation
output = tf.math.sigmoid(z)

return output
import tensorflow as tf

layer = tf.keras.layers.Dense(units=2)

## Single Layer Neural Network

"Hidden" layers are underlying between the input and the final output.

\begin{aligned} z_i &= w_{0,i}^{(1)} + \sum_{\substack{j=1}}^mx_jw_{j,i}^{(1)} \\ \hat{y}_i &= g\left(w_{0,i}^{(2)} + \sum_{\substack{j=1}}^{d_1}z_jw_{j,i}^{(2)}\right) \end{aligned}
import tensorflow as tf

model = tf.keras.Sequential([
tf.keras.layers.Dense(n),
tf.keras.layers.Dense(2)
])

## Deep Neural Network

Make many layers in the network.

$z_{k,i} = w_{0,i}^{(k)} + \sum_{\substack{j=1}}^{n_{k-1}}g(z_{k-1,j})w_{j,i}^{(k)}$
import tensorflow as tf

model = tf.keras.Sequential([
tf.keras.layers.Dense(n_1),
tf.keras.layers.Dense(n_2),
# ...
tf.keras.layers.Dense(2)
])

# Applying Neural Networks

## Example: "Will I pass this class?"

A simple two feature model.

$x_1 = \text{Number for lectures you attend} \\ x_2 = \text{Hours spent on the final project}$

Need to train the network first.

## Quantifying Loss

The loss of our network measures the cost incurred from incorrect predictions.

$L(f(x^{(i)};W),y^{(i)})$

## Empirical Loss

The empirical loss measures the total loss over out entire dataset. Average of all individual losses.

$J(W) = \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})$

## Binary Cross Entropy Loss

Cross entropy loss can be used with models that output a probability between 0 and 1. Introduced by Claude Shannon.

\begin{aligned} J(W) &= \frac 1 n \sum_{\substack{i=1}}^n y^{(i)}\log(f(x^{(i)};W))+(1-y^{(1)})\log(1-f(x^{(i)};W))\\ \text{Predicted}&: f(x^{(i)};W)\\ \text{Actual}&: y^{(i)} \end{aligned}
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, predicted))

Compares how different these two distributions.

## Mean Squared Error Loss

Mean squared error loss can be used with regression models that output continuous real numbers. Possible numbers rather than true or false.

$J(W) = \frac 1 n \sum_{\substack{i=1}}^n (y^{(i)} - f(x^{(i)};W))^2$
loss = tf.reduce_mean(tf.square(tf.subtract(y, predicted)))

# Training Neural Networks

## Loss Optimization

We want to find the network weights that achieve the lowest loss.

\begin{aligned} W^* &= \argmin(w) \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})\\ &= \argmin(w) J(W)\\ W &= \{W^{(0)},W^{(1)},\dots\} \end{aligned}

Find the each of the weights W.

1. Pick random place as an initial $(w_0, w_1)$.
2. Compute gradient. $\frac {\partial J(W)} {\partial W}$ to find maximum ascent.
3. Take small step in opposite direction of gradient
4. Repeat until convergence to local minimum

## Graident Descent

Algorithm:

1. Initialize weights randomly $~N(0, \sigma^2)$
2. Loop until convergence:
1. Compute gradient. $\frac {\partial J(W)} {\partial W}$
2. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$ ($\eta$: learning rate, how much of a step to repeat each iteration)
3. Return weights
import tensorflow as tf
weights = tf.Variable([tf.random.normal()])

while True:
loss = compute_loss(weights)

weights = weights - lr * gradient

Gradients shows how does a small change in one wieght (ex. $w_2$) affect the final loss $J(W)$.

Use chain rule here.

\begin{aligned} \frac {\partial J(W)} {\partial W_2} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_2} \\ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_1} \\ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial z_1} {\partial W_1} * \frac {\partial z_1} {\partial W_1} \end{aligned}

Repeat this for every weight in the network using gradients from later layers. The most frameworks provide the function to calculate this under the hood.

# Neural Networks in Practice: Optimization

Training neural networks is difficult.

Loss functions can be difficult to optimize. Optimization through gradient descent: $W \gets W - \eta \frac {\partial J(W)} {\partial W}$. How to decide the learning rate $\eta$?

• Small learning rates: converges slowly and gets stuck in false local minima
• Large learning rates: overshoot, become unstable and diverge
• Stable learning rates: converge smoothly and avoid local minima

## Approaches

1. Try many different learning rate. Or,
• no longer fixed, many algorithms

• SGD tf.keras.optimizers.SGD
• Adam tf.keras.optimizers.Adam
• Adadelta tf.keras.optimizers.Adadelta
• Adagrad tf.keras.optimizers.Adagrad
• RMSProp tf.keras.optimizers.RMSProp

Ref.:

• Kiefer & Wolfowitz. "Stochastic Estimation of the Maximum of a regression Function." 1952.
• Kingma et al. "Adam: A Method for Stochastic Optimization." 2014.
• Duchi et al. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." 2011.
import tensorflow as tf
model = tf.keras.Sequential([...])

optimizer = tf.keras.optimizer.SGD()

while True:
# forward pass through the network
prediction = model(x)

# compute the loss
loss = compute_loss(y, prediction)

# update the weights using the gradient
optimizer.apply_gradients(zip(grads, model.trainable_variables))

# Neural Networks in Practice: Mini-batches

Calculating every partials are expensive tasks. Pick single data point $i$ and compute the gradient.

Pick single data point. Easy to compute but very noisy (stochastic):

1. Initialize weights randomly $~N(0, \sigma^2)$
2. Loop until convergence:
1. Pick single data point $i$
2. Compute gradient. $\frac {\partial J_i(W)} {\partial W}$
3. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
3. Return weights

Mini batch of points. Fast to compute and a much better estimate of the true gradient.

1. Initialize weights randomly $~N(0, \sigma^2)$
2. Loop until convergence:
1. Pick batch of $B$ data points
2. Compute gradient. $\frac {\partial J(W)} {\partial W} = \frac 1 B \sum_{k=1}^B \frac {\partial J_k(W)} {\partial W}$
3. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
3. Return weights
• More accurate estimation of gradient
• Smoother convergence
• Allows for larger learning rates
• Mini-batches lead to fast training
• Can parallelize computation
• achieve significant speed increases on GPU's

# Neural Networks in Practice: Overfitting

Overfitting is a general problem in machine learning.

• Underfitting: Model does not have capacity to fully learn the data
• Ideal fit
• Overfitting: Too complex, extra parameters, does not generalize well

## Regularization

Technique that constrains our optimization problem to discourage complex models. Improve generalization of of our model on unseen data.

### Regularization 1: Dropout

• During training, randomly set some activations to 0
• Typically 'drop' 50% of activations in layer
• Forces network to not rely on any 1 node
• Repeat every iteration
• Build more robust representation of its prediction
• Generalize better to new test data
tf.keras.layers.Dropout(p=0.5)

### Regularization 2: Early Stopping

• Stop training before we have a chance to overfit.
• Find the inflection point that diverge the loss from the testing data. The point will be between underfitting and overfitting.

# Summary

• The Perceptron
• Structural building blocks
• Nonlinear activation functions
• Neural Networks
• Stacking Perceptrons to form neural networks
• Optimization through backpropagation
• Training in Practice