# Course and lecture info

# Intro

**Deep Learning**: Extract patterns from data using neural network.

Finding patterns are usually hand engineered features. Not scalable in practice. Deep learning resurgenced now because of Big Data, hardware, software.

- Stochastic Gradient Descent (1952)
- Perceptron: Learanable Weights (1958)
- Backpropagation: Multi-layer Perceptron (1986)
- Deep Convolutional NN: Digit Recognition (1995)

# The perceptron: Forward Propagation

```
1 - w0 --. (bias)
x1 - w1 --.
.---> sum -> non-linear activation function -> output
x2 - w2 --.
xm - wm --.
```

$$ \hat{y} = g(w_0 + \sum_{\substack{i=1}}^mx_iw_i) \ = g(w_0 + X^TW) \ \ \text{where: }X = \begin{bmatrix} x_1\ \vdots\ x_m\ \end{bmatrix} \text{and }W = \begin{bmatrix} w_1\ \vdots\ w_m\ \end{bmatrix} $$

## Common Activation Functions

Probability distribution between 0 and 1. Non-linear functions. The purpose is to introduce **non-linearities** into the network. Non-linearities allow us to approximate arbitrarily complex functions.

Sigmoid function:

$$ \begin{aligned} g(z) &= \sigma(z) = \frac 1 {1+e^{-z}} \ g'(z) &= g(z)(1-g(z)) \end{aligned} $$

Hyperbolic Tangent:

$$ \begin{aligned} g(z) &= \frac {e^z - e^{-z}} {e^z + e^{-z}} \ g'(z) &= 1 - g(z)^2 \end{aligned} $$

Rectified Linear Unit (RELU):

$$ \begin{aligned} g(z) &= \text{max}(0, z) \ g'(z) &= \begin{cases} 1,& z > 0\ 0,& \text{otherwise} \end{cases} \end{aligned} $$

## Example

We have $w_0=1$ and $W=\begin{bmatrix} 3\ -2 \end{bmatrix}$.

$$ \begin{aligned} \hat{y} &= g(w_0 + X^TW) \ &= g\left(1+\begin{bmatrix} x_1\ x_2 \end{bmatrix}^T\begin{bmatrix} 3\ -2 \end{bmatrix}\right)\ &= g(1+3x_1-2x_2) \end{aligned} $$

Which is a 2-D line.

Test with $X = \begin{bmatrix} -1\ 2 \end{bmatrix}$.

$$ \begin{aligned} \hat{y} &= g (1 + (3 * -1) - (2 * 2))\ &= g(-6) \approx 0.002 \end{aligned} $$

In the result,

$$ z < 0, y < 0.5 \ z > 0, y > 0.5 $$

$\hat{y}= g(1+3x_1-2x_2)$ is a straightforward formula. In reality, the formula is hard to identify due to the size of the input and result.

# Building Neural Networks with Perceptrons

$\hat{y} = g(w_0 + X^TW)$

Simplified:

$$ z = w_0 + \sum_{\substack{j=1}}^mx_jw_i $$

Multi Output Perceptron like $y_1 = g(z_1), y_2 = g(z_2)$:

$$ z_i = w_{0,i} + \sum_{\substack{j=1}}^mx_jw_{j,i} $$

Because all inputs are densely connected to all outputs, these layers are called **Dense** layers.

```
class MyDenseLayer(tf.keras.layers.Layer):
def __init__(self, input_dim, output_dim):
super(MyDenseLayer, self).__init__()
# Initialize weights and bias
self.w = self.add_weight([input_dim, output_dim])
self.b = self.add_weight([1, output_dim])
def call(self, inputs):
# Forward propagate the inputs
z = tf.matmul(inputs, self.W) + self.b
# Feed through a non-linear activation
output = tf.math.sigmoid(z)
return output
```

```
import tensorflow as tf
layer = tf.keras.layers.Dense(units=2)
```

## Single Layer Neural Network

"Hidden" layers are underlying between the input and the final output.

$$
\begin{aligned}
z_i &= w_{0,i}^{(1)} + \sum_{\substack{j=1}}^mx_jw_{j,i}^{(1)} \
\hat{y}*i &= g\left(w*{0,i}^{(2)} + \sum_{\substack{j=1}}^{d_1}z_jw_{j,i}^{(2)}\right)
\end{aligned}
$$

```
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(n),
tf.keras.layers.Dense(2)
])
```

## Deep Neural Network

Make many layers in the network.

$$ z_{k,i} = w_{0,i}^{(k)} + \sum_{\substack{j=1}}^{n_{k-1}}g(z_{k-1,j})w_{j,i}^{(k)} $$

```
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(n_1),
tf.keras.layers.Dense(n_2),
# ...
tf.keras.layers.Dense(2)
])
```

# Applying Neural Networks

## Example: "Will I pass this class?"

A simple two feature model.

$$ x_1 = \text{Number for lectures you attend} \ x_2 = \text{Hours spent on the final project} $$

Need to train the network first.

## Quantifying Loss

The loss of our network measures the cost incurred from incorrect predictions.

$$ L(f(x^{(i)};W),y^{(i)}) $$

Fix answers to move closer towards to the true answers.

## Empirical Loss

The empirical loss measures the total loss over out entire dataset. Average of all individual losses.

$$ J(W) = \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)}) $$

## Binary Cross Entropy Loss

**Cross entropy loss** can be used with models that output a probability between 0 and 1. Introduced by Claude Shannon.

$$ \begin{aligned} J(W) &= \frac 1 n \sum_{\substack{i=1}}^n y^{(i)}\log(f(x^{(i)};W))+(1-y^{(1)})\log(1-f(x^{(i)};W))\ \text{Predicted}&: f(x^{(i)};W)\ \text{Actual}&: y^{(i)} \end{aligned} $$

```
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, predicted))
```

Compares how different these two distributions.

## Mean Squared Error Loss

Mean squared error loss can be used with regression models that output continuous real numbers. Possible numbers rather than true or false.

$$ J(W) = \frac 1 n \sum_{\substack{i=1}}^n (y^{(i)} - f(x^{(i)};W))^2 $$

```
loss = tf.reduce_mean(tf.square(tf.subtract(y, predicted)))
```

# Training Neural Networks

## Loss Optimization

We want to find the network weights that achieve the lowest loss.

$$ \begin{aligned} W^* &= \argmin(w) \frac 1 n \sum_{\substack{i=1}}^n L(f(x^{(i)};W),y^{(i)})\ &= \argmin(w) J(W)\ W &= {W^{(0)},W^{(1)},\dots} \end{aligned} $$

Find the each of the weights W.

- Pick random place as an initial $(w_0, w_1)$.
- Compute gradient. $\frac {\partial J(W)} {\partial W}$ to find maximum ascent.
- Take small step in opposite direction of gradient
- Repeat until convergence to local minimum

## Graident Descent

Algorithm:

- Initialize weights randomly $~N(0, \sigma^2)$
- Loop until convergence:
- Compute gradient. $\frac {\partial J(W)} {\partial W}$
- Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$ ($\eta$: learning rate, how much of a step to repeat each iteration)

- Return weights

```
import tensorflow as tf
weights = tf.Variable([tf.random.normal()])
while True:
with tf.GradientTape() as g:
loss = compute_loss(weights)
gradient = g.gradient(loss, weights) # Backpropagation
weights = weights - lr * gradient
```

## Computing Gradients: Backpropagation

Gradients shows how does a small change in one wieght (ex. $w_2$) affect the final loss $J(W)$.

Use chain rule here.

$$ \begin{aligned} \frac {\partial J(W)} {\partial W_2} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_2} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial \hat{y}} {\partial W_1} \ \frac {\partial J(W)} {\partial W_1} &= \frac {\partial J(W)} {\partial \hat{y}} * \frac {\partial z_1} {\partial W_1} * \frac {\partial z_1} {\partial W_1} \end{aligned} $$

Repeat this for every weight in the network using gradients from later layers. The most frameworks provide the function to calculate this under the hood.

# Neural Networks in Practice: Optimization

Training neural networks is difficult.

Loss functions can be difficult to optimize. Optimization through gradient descent: $W \gets W - \eta \frac {\partial J(W)} {\partial W}$. How to decide the learning rate $\eta$?

- Small learning rates: converges slowly and gets stuck in false local minima
- Large learning rates: overshoot, become unstable and diverge
- Stable learning rates: converge smoothly and avoid local minima

## Approaches

- Try many different learning rate. Or,
- Adaptive learning rate that adapts to the landscape
- no longer fixed, many algorithms

## Adaptive algorithm

- SGD
`tf.keras.optimizers.SGD`

- Adam
`tf.keras.optimizers.Adam`

- Adadelta
`tf.keras.optimizers.Adadelta`

- Adagrad
`tf.keras.optimizers.Adagrad`

- RMSProp
`tf.keras.optimizers.RMSProp`

Ref.:

- Kiefer & Wolfowitz. "Stochastic Estimation of the Maximum of a regression Function." 1952.
- Kingma et al. "Adam: A Method for Stochastic Optimization." 2014.
- Zeiler et al. "ADADELTA: An Adaptive Learning Rate Method." 2012.
- Duchi et al. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." 2011.

```
import tensorflow as tf
model = tf.keras.Sequential([...])
# pick your favorite optimizer
optimizer = tf.keras.optimizer.SGD()
while True:
# forward pass through the network
prediction = model(x)
with tf.GradientTape() as tape:
# compute the loss
loss = compute_loss(y, prediction)
# update the weights using the gradient
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
```

# Neural Networks in Practice: Mini-batches

Calculating every partials are expensive tasks. Pick single data point $i$ and compute the gradient.

Pick single data point. Easy to compute but *very noisy (stochastic)*:

- Initialize weights randomly $~N(0, \sigma^2)$
- Loop until convergence:
- Pick single data point $i$
- Compute gradient. $\frac {\partial J_i(W)} {\partial W}$
- Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$

- Return weights

Mini batch of points. Fast to compute and a much better estimate of the true gradient.

- Initialize weights randomly $~N(0, \sigma^2)$
- Loop until convergence:
- Pick batch of $B$ data points
- Compute gradient. $\frac {\partial J(W)} {\partial W} = \frac 1 B \sum_{k=1}^B \frac {\partial J_k(W)} {\partial W}$
- Update weights, $W \gets W - \eta \frac {\partial J(W)} {\partial W}$

- Return weights

- More accurate estimation of gradient
- Smoother convergence
- Allows for larger learning rates

- Mini-batches lead to fast training
- Can parallelize computation
- achieve significant speed increases on GPU's

# Neural Networks in Practice: Overfitting

Overfitting is a general problem in machine learning.

- Underfitting: Model does not have capacity to fully learn the data
- Ideal fit
- Overfitting: Too complex, extra parameters, does not generalize well

## Regularization

Technique that constrains our optimization problem to discourage complex models. Improve generalization of of our model on unseen data.

### Regularization 1: Dropout

- During training, randomly set some activations to 0
- Typically 'drop' 50% of activations in layer
- Forces network to not rely on any 1 node
- Repeat every iteration
- Build more robust representation of its prediction
- Generalize better to new test data

```
tf.keras.layers.Dropout(p=0.5)
```

### Regularization 2: Early Stopping

- Stop training before we have a chance to overfit.
- Find the inflection point that diverge the loss from the testing data. The point will be between underfitting and overfitting.

# Summary

- The Perceptron
- Structural building blocks
- Nonlinear activation functions

- Neural Networks
- Stacking Perceptrons to form neural networks
- Optimization through backpropagation

- Training in Practice
- Adaptive learning
- Batching
- Regularization