Course and lecture info
Intro
Deep Learning: Extract patterns from data using neural network.
Finding patterns are usually hand engineered features. Not scalable in practice. Deep learning resurgenced now because of Big Data, hardware, software.
- Stochastic Gradient Descent (1952)
- Perceptron: Learanable Weights (1958)
- Backpropagation: Multi-layer Perceptron (1986)
- Deep Convolutional NN: Digit Recognition (1995)
The perceptron: Forward Propagation
1 - w0 --. (bias)
x1 - w1 --.
.---> sum -> non-linear activation function -> output
x2 - w2 --.
xm - wm --.
$$ \hat{y} = g(w_0 + \sum_{\substack{i=1}}^mx_iw_i) \ = g(w_0 + X^TW) \ \ \text{where: }X = \begin{bmatrix} x_1\ \vdots\ x_m\ \end{bmatrix} \text{and }W = \begin{bmatrix} w_1\ \vdots\ w_m\ \end{bmatrix} $$
Common Activation Functions
Probability distribution between 0 and 1. Non-linear functions. The purpose is to introduce non-linearities into the network. Non-linearities allow us to approximate arbitrarily complex functions.
Sigmoid function:
$$ \begin{aligned} g(z) &= \sigma(z) = \frac 1 {1+e^{-z}} \ g'(z) &= g(z)(1-g(z)) \end{aligned} $$
Hyperbolic Tangent:
$$ \begin{aligned} g(z) &= \frac {e^z - e^{-z}} {e^z + e^{-z}} \ g'(z) &= 1 - g(z)^2 \end{aligned} $$
Rectified Linear Unit (RELU):
$$ \begin{aligned} g(z) &= \text{max}(0, z) \ g'(z) &= \begin{cases} 1,& z > 0\ 0,& \text{otherwise} \end{cases} \end{aligned} $$
Example
We have $w_0=1$ and $W=\begin{bmatrix} 3\ -2 \end{bmatrix}$.
$$ \begin{aligned} \hat{y} &= g(w_0 + X^TW) \ &= g\left(1+\begin{bmatrix} x_1\ x_2 \end{bmatrix}^T\begin{bmatrix} 3\ -2 \end{bmatrix}\right)\ &= g(1+3x_1-2x_2) \end{aligned} $$
Which is a 2-D line.
Test with $X = \begin{bmatrix} -1\ 2 \end{bmatrix}$.
$$ \begin{aligned} \hat{y} &= g (1 + (3 * -1) - (2 * 2))\ &= g(-6) \approx 0.002 \end{aligned} $$
In the result,
$$ z < 0, y < 0.5 \ z > 0, y > 0.5 $$
$\hat{y}= g(1+3x_1-2x_2)$ is a straightforward formula. In reality, the formula is hard to identify due to the size of the input and result.
Building Neural Networks with Perceptrons
$\hat{y} = g(w_0 + X^TW)$
Simplified:
$$ z = w_0 + \sum_{\substack{j=1}}^mx_jw_i $$
Multi Output Perceptron like $y_1 = g(z_1), y_2 = g(z_2)$:
$$ z_i = w_{0,i} + \sum_{\substack{j=1}}^mx_jw_{j,i} $$
Because all inputs are densely connected to all outputs, these layers are called Dense layers.
class MyDenseLayer(tf.keras.layers.Layer):
def __init__(self, input_dim, output_dim):
super(MyDenseLayer, self).__init__()
# Initialize weights and bias
self.w = self.add_weight([input_dim, output_dim])
self.b = self.add_weight([1, output_dim])
def call(self, inputs):
# Forward propagate the inputs
z = tf.matmul(inputs, self.W) + self.b
# Feed through a non-linear activation
output = tf.math.sigmoid(z)
return output
import tensorflow as tf
layer = tf.keras.layers.Dense(units=2)
Single Layer Neural Network
"Hidden" layers are underlying between the input and the final output.
$$ \begin{aligned} z_i &= w_{0,i}^{(1)} + \sum_{\substack{j=1}}^mx_jw_{j,i}^{(1)} \ \hat{y}i &= g\left(w{0,i}^{(2)} + \sum_{\substack{j=1}}^{d_1}z_jw_{j,i}^{(2)}\right) \end{aligned} $$
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(n),
tf.keras.layers.Dense(2)
])
Deep Neural Network
Make many layers in the network.
$$ z_{k,i} = w_{0,i}^{(k)} + \sum_{\substack{j=1}}^{n_{k-1}}g(z_{k-1,j})w_{j,i}^{(k)} $$
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(n_1),
tf.keras.layers.Dense(n_2),
# ...
tf.keras.layers.Dense(2)
])
Applying Neural Networks
Example: "Will I pass this class?"
A simple two feature model.
$$ x_1 = \text{Number for lectures you attend} \ x_2 = \text{Hours spent on the final project} $$
Need to train the network first.
Quantifying Loss
The loss of our network measures the cost incurred from incorrect predictions.
$$ L(f(x^{(i)};W),y^{(i)}) $$
Fix answers to move closer towards to the true answers.
Empirical Loss
The empirical loss measures the total loss over out entire dataset. Average of all individual losses.
$$ J(W) = \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)}) $$
Binary Cross Entropy Loss
Cross entropy loss can be used with models that output a probability between 0 and 1. Introduced by Claude Shannon.
$$ \begin{aligned} J(W) &= \frac 1 n \sum_{\substack{i=1}}^n y^{(i)}\log(f(x^{(i)};W))+(1-y^{(1)})\log(1-f(x^{(i)};W))\ \text{Predicted}&: f(x^{(i)};W)\ \text{Actual}&: y^{(i)} \end{aligned} $$
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, predicted))
Compares how different these two distributions.
Mean Squared Error Loss
Mean squared error loss can be used with regression models that output continuous real numbers. Possible numbers rather than true or false.
$$ J(W) = \frac 1 n \sum_{\substack{i=1}}^n (y^{(i)} - f(x^{(i)};W))^2 $$
loss = tf.reduce_mean(tf.square(tf.subtract(y, predicted)))
Training Neural Networks
Loss Optimization
We want to find the network weights that achieve the lowest loss.
$$ \begin{aligned} W^* &= \argmin(w) \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})\ &= \argmin(w) J(W)\ W &= {W^{(0)},W^{(1)},\dots} \end{aligned} $$
Find the each of the weights W.
- Pick random place as an initial $(w_0, w_1)$.
- Compute gradient. $\frac {\partial J(W)} {\partial W}$ to find maximum ascent.
- Take small step in opposite direction of gradient
- Repeat until convergence to local minimum
Graident Descent
Algorithm:
- Initialize weights randomly $~N(0, \sigma^2)$
- Loop until convergence:
- Compute gradient. $\frac {\partial J(W)} {\partial W}$
- Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$ ($\eta$: learning rate, how much of a step to repeat each iteration)
- Return weights
import tensorflow as tf
weights = tf.Variable([tf.random.normal()])
while True:
with tf.GradientTape() as g:
loss = compute_loss(weights)
gradient = g.gradient(loss, weights) # Backpropagation
weights = weights - lr * gradient
Computing Gradients: Backpropagation
Gradients shows how does a small change in one wieght (ex. $w_2$) affect the final loss $J(W)$.
Use chain rule here.
$$ \begin{aligned} \frac {\partial J(W)} {\partial W_2} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_2} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_1} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial z_1} {\partial W_1} * \frac {\partial z_1} {\partial W_1} \end{aligned} $$
Repeat this for every weight in the network using gradients from later layers. The most frameworks provide the function to calculate this under the hood.
Neural Networks in Practice: Optimization
Training neural networks is difficult.
Loss functions can be difficult to optimize. Optimization through gradient descent: $W \gets W - \eta \frac {\partial J(W)} {\partial W}$. How to decide the learning rate $\eta$?
- Small learning rates: converges slowly and gets stuck in false local minima
- Large learning rates: overshoot, become unstable and diverge
- Stable learning rates: converge smoothly and avoid local minima
Approaches
- Try many different learning rate. Or,
- Adaptive learning rate that adapts to the landscape
- no longer fixed, many algorithms
Adaptive algorithm
- SGD
tf.keras.optimizers.SGD
- Adam
tf.keras.optimizers.Adam
- Adadelta
tf.keras.optimizers.Adadelta
- Adagrad
tf.keras.optimizers.Adagrad
- RMSProp
tf.keras.optimizers.RMSProp
Ref.:
- Kiefer & Wolfowitz. "Stochastic Estimation of the Maximum of a regression Function." 1952.
- Kingma et al. "Adam: A Method for Stochastic Optimization." 2014.
- Zeiler et al. "ADADELTA: An Adaptive Learning Rate Method." 2012.
- Duchi et al. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." 2011.
import tensorflow as tf
model = tf.keras.Sequential([...])
# pick your favorite optimizer
optimizer = tf.keras.optimizer.SGD()
while True:
# forward pass through the network
prediction = model(x)
with tf.GradientTape() as tape:
# compute the loss
loss = compute_loss(y, prediction)
# update the weights using the gradient
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Neural Networks in Practice: Mini-batches
Calculating every partials are expensive tasks. Pick single data point $i$ and compute the gradient.
Pick single data point. Easy to compute but very noisy (stochastic):
- Initialize weights randomly $~N(0, \sigma^2)$
- Loop until convergence:
- Pick single data point $i$
- Compute gradient. $\frac {\partial J_i(W)} {\partial W}$
- Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
- Return weights
Mini batch of points. Fast to compute and a much better estimate of the true gradient.
- Initialize weights randomly $~N(0, \sigma^2)$
- Loop until convergence:
- Pick batch of $B$ data points
- Compute gradient. $\frac {\partial J(W)} {\partial W} = \frac 1 B \sum_{k=1}^B \frac {\partial J_k(W)} {\partial W}$
- Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$
- Return weights
- More accurate estimation of gradient
- Smoother convergence
- Allows for larger learning rates
- Mini-batches lead to fast training
- Can parallelize computation
- achieve significant speed increases on GPU's
Neural Networks in Practice: Overfitting
Overfitting is a general problem in machine learning.
- Underfitting: Model does not have capacity to fully learn the data
- Ideal fit
- Overfitting: Too complex, extra parameters, does not generalize well
Regularization
Technique that constrains our optimization problem to discourage complex models. Improve generalization of of our model on unseen data.
Regularization 1: Dropout
- During training, randomly set some activations to 0
- Typically 'drop' 50% of activations in layer
- Forces network to not rely on any 1 node
- Repeat every iteration
- Build more robust representation of its prediction
- Generalize better to new test data
tf.keras.layers.Dropout(p=0.5)
Regularization 2: Early Stopping
- Stop training before we have a chance to overfit.
- Find the inflection point that diverge the loss from the testing data. The point will be between underfitting and overfitting.
Summary
- The Perceptron
- Structural building blocks
- Nonlinear activation functions
- Neural Networks
- Stacking Perceptrons to form neural networks
- Optimization through backpropagation
- Training in Practice
- Adaptive learning
- Batching
- Regularization