Intro

Deep Learning: Extract patterns from data using neural network.

Finding patterns are usually hand engineered features. Not scalable in practice. Deep learning resurgenced now because of Big Data, hardware, software.

• Perceptron: Learanable Weights (1958)
• Backpropagation: Multi-layer Perceptron (1986)
• Deep Convolutional NN: Digit Recognition (1995)

The perceptron: Forward Propagation

1  - w0 --.   (bias)
x1 - w1 --.
.---> sum -> non-linear activation function -> output
x2 - w2 --.
xm - wm --.


$$\hat{y} = g(w_0 + \sum_{\substack{i=1}}^mx_iw_i) \ = g(w_0 + X^TW) \ \ \text{where: }X = \begin{bmatrix} x_1\ \vdots\ x_m\ \end{bmatrix} \text{and }W = \begin{bmatrix} w_1\ \vdots\ w_m\ \end{bmatrix}$$

Common Activation Functions

Probability distribution between 0 and 1. Non-linear functions. The purpose is to introduce non-linearities into the network. Non-linearities allow us to approximate arbitrarily complex functions.

Sigmoid function:

\begin{aligned} g(z) &= \sigma(z) = \frac 1 {1+e^{-z}} \ g'(z) &= g(z)(1-g(z)) \end{aligned}

Hyperbolic Tangent:

\begin{aligned} g(z) &= \frac {e^z - e^{-z}} {e^z + e^{-z}} \ g'(z) &= 1 - g(z)^2 \end{aligned}

Rectified Linear Unit (RELU):

\begin{aligned} g(z) &= \text{max}(0, z) \ g'(z) &= \begin{cases} 1,& z > 0\ 0,& \text{otherwise} \end{cases} \end{aligned}

Example

We have $w_0=1$ and $W=\begin{bmatrix} 3\ -2 \end{bmatrix}$.

\begin{aligned} \hat{y} &= g(w_0 + X^TW) \ &= g\left(1+\begin{bmatrix} x_1\ x_2 \end{bmatrix}^T\begin{bmatrix} 3\ -2 \end{bmatrix}\right)\ &= g(1+3x_1-2x_2) \end{aligned}

Which is a 2-D line.

Test with $X = \begin{bmatrix} -1\ 2 \end{bmatrix}$.

\begin{aligned} \hat{y} &= g (1 + (3 * -1) - (2 * 2))\ &= g(-6) \approx 0.002 \end{aligned}

In the result,

$$z < 0, y < 0.5 \ z > 0, y > 0.5$$

$\hat{y}= g(1+3x_1-2x_2)$ is a straightforward formula. In reality, the formula is hard to identify due to the size of the input and result.

Building Neural Networks with Perceptrons

$\hat{y} = g(w_0 + X^TW)$

Simplified:

$$z = w_0 + \sum_{\substack{j=1}}^mx_jw_i$$

Multi Output Perceptron like $y_1 = g(z_1), y_2 = g(z_2)$:

$$z_i = w_{0,i} + \sum_{\substack{j=1}}^mx_jw_{j,i}$$

Because all inputs are densely connected to all outputs, these layers are called Dense layers.

class MyDenseLayer(tf.keras.layers.Layer):
def __init__(self, input_dim, output_dim):
super(MyDenseLayer, self).__init__()

# Initialize weights and bias

def call(self, inputs):
# Forward propagate the inputs
z = tf.matmul(inputs, self.W) + self.b

# Feed through a non-linear activation
output = tf.math.sigmoid(z)

return output

import tensorflow as tf

layer = tf.keras.layers.Dense(units=2)


Single Layer Neural Network

"Hidden" layers are underlying between the input and the final output.

\begin{aligned} z_i &= w_{0,i}^{(1)} + \sum_{\substack{j=1}}^mx_jw_{j,i}^{(1)} \ \hat{y}i &= g\left(w{0,i}^{(2)} + \sum_{\substack{j=1}}^{d_1}z_jw_{j,i}^{(2)}\right) \end{aligned}

import tensorflow as tf

model = tf.keras.Sequential([
tf.keras.layers.Dense(n),
tf.keras.layers.Dense(2)
])


Deep Neural Network

Make many layers in the network.

$$z_{k,i} = w_{0,i}^{(k)} + \sum_{\substack{j=1}}^{n_{k-1}}g(z_{k-1,j})w_{j,i}^{(k)}$$

import tensorflow as tf

model = tf.keras.Sequential([
tf.keras.layers.Dense(n_1),
tf.keras.layers.Dense(n_2),
# ...
tf.keras.layers.Dense(2)
])


Applying Neural Networks

Example: "Will I pass this class?"

A simple two feature model.

$$x_1 = \text{Number for lectures you attend} \ x_2 = \text{Hours spent on the final project}$$

Need to train the network first.

Quantifying Loss

The loss of our network measures the cost incurred from incorrect predictions.

$$L(f(x^{(i)};W),y^{(i)})$$

Empirical Loss

The empirical loss measures the total loss over out entire dataset. Average of all individual losses.

$$J(W) = \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})$$

Binary Cross Entropy Loss

Cross entropy loss can be used with models that output a probability between 0 and 1. Introduced by Claude Shannon.

\begin{aligned} J(W) &= \frac 1 n \sum_{\substack{i=1}}^n y^{(i)}\log(f(x^{(i)};W))+(1-y^{(1)})\log(1-f(x^{(i)};W))\ \text{Predicted}&: f(x^{(i)};W)\ \text{Actual}&: y^{(i)} \end{aligned}

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, predicted))


Compares how different these two distributions.

Mean Squared Error Loss

Mean squared error loss can be used with regression models that output continuous real numbers. Possible numbers rather than true or false.

$$J(W) = \frac 1 n \sum_{\substack{i=1}}^n (y^{(i)} - f(x^{(i)};W))^2$$

loss = tf.reduce_mean(tf.square(tf.subtract(y, predicted)))


Training Neural Networks

Loss Optimization

We want to find the network weights that achieve the lowest loss.

\begin{aligned} W^* &= \argmin(w) \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})\ &= \argmin(w) J(W)\ W &= {W^{(0)},W^{(1)},\dots} \end{aligned}

Find the each of the weights W.

1. Pick random place as an initial $(w_0, w_1)$.
2. Compute gradient. $\frac {\partial J(W)} {\partial W}$ to find maximum ascent.
3. Take small step in opposite direction of gradient
4. Repeat until convergence to local minimum

Graident Descent

Algorithm:

1. Initialize weights randomly $~N(0, \sigma^2)$
2. Loop until convergence:
1. Compute gradient. $\frac {\partial J(W)} {\partial W}$
2. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$ ($\eta$: learning rate, how much of a step to repeat each iteration)
3. Return weights
import tensorflow as tf
weights = tf.Variable([tf.random.normal()])

while True:
loss = compute_loss(weights)

weights = weights - lr * gradient


Gradients shows how does a small change in one wieght (ex. $w_2$) affect the final loss $J(W)$.

Use chain rule here.

\begin{aligned} \frac {\partial J(W)} {\partial W_2} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_2} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_1} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial z_1} {\partial W_1} * \frac {\partial z_1} {\partial W_1} \end{aligned}

Repeat this for every weight in the network using gradients from later layers. The most frameworks provide the function to calculate this under the hood.

Neural Networks in Practice: Optimization

Training neural networks is difficult.

Loss functions can be difficult to optimize. Optimization through gradient descent: $W \gets W - \eta \frac {\partial J(W)} {\partial W}$. How to decide the learning rate $\eta$?

• Small learning rates: converges slowly and gets stuck in false local minima
• Large learning rates: overshoot, become unstable and diverge
• Stable learning rates: converge smoothly and avoid local minima

Approaches

1. Try many different learning rate. Or,
• no longer fixed, many algorithms

• SGD tf.keras.optimizers.SGD
• Adam tf.keras.optimizers.Adam
• Adadelta tf.keras.optimizers.Adadelta
• Adagrad tf.keras.optimizers.Adagrad
• RMSProp tf.keras.optimizers.RMSProp

Ref.:

• Kiefer & Wolfowitz. "Stochastic Estimation of the Maximum of a regression Function." 1952.
• Kingma et al. "Adam: A Method for Stochastic Optimization." 2014.
• Duchi et al. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." 2011.
import tensorflow as tf
model = tf.keras.Sequential([...])

optimizer = tf.keras.optimizer.SGD()

while True:
# forward pass through the network
prediction = model(x)

# compute the loss
loss = compute_loss(y, prediction)

# update the weights using the gradient


Neural Networks in Practice: Mini-batches

Calculating every partials are expensive tasks. Pick single data point $i$ and compute the gradient.

Pick single data point. Easy to compute but very noisy (stochastic):

1. Initialize weights randomly $~N(0, \sigma^2)$
2. Loop until convergence:
1. Pick single data point $i$
2. Compute gradient. $\frac {\partial J_i(W)} {\partial W}$
3. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
3. Return weights

Mini batch of points. Fast to compute and a much better estimate of the true gradient.

1. Initialize weights randomly $~N(0, \sigma^2)$
2. Loop until convergence:
1. Pick batch of $B$ data points
2. Compute gradient. $\frac {\partial J(W)} {\partial W} = \frac 1 B \sum_{k=1}^B \frac {\partial J_k(W)} {\partial W}$
3. Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
3. Return weights
• More accurate estimation of gradient
• Smoother convergence
• Allows for larger learning rates
• Mini-batches lead to fast training
• Can parallelize computation
• achieve significant speed increases on GPU's

Neural Networks in Practice: Overfitting

Overfitting is a general problem in machine learning.

• Underfitting: Model does not have capacity to fully learn the data
• Ideal fit
• Overfitting: Too complex, extra parameters, does not generalize well

Regularization

Technique that constrains our optimization problem to discourage complex models. Improve generalization of of our model on unseen data.

Regularization 1: Dropout

• During training, randomly set some activations to 0
• Typically 'drop' 50% of activations in layer
• Forces network to not rely on any 1 node
• Repeat every iteration
• Build more robust representation of its prediction
• Generalize better to new test data
tf.keras.layers.Dropout(p=0.5)


Regularization 2: Early Stopping

• Stop training before we have a chance to overfit.
• Find the inflection point that diverge the loss from the testing data. The point will be between underfitting and overfitting.

Summary

• The Perceptron
• Structural building blocks
• Nonlinear activation functions
• Neural Networks
• Stacking Perceptrons to form neural networks
• Optimization through backpropagation
• Training in Practice