# Course and lecture info

# Deep Sequence Modeling

- Feed-forward models in the previous lectures
- Sequential processing of data

Predict the movement using previous data which is sequencial. e.g. stock prices, EEG signals

# A Sequence Modeling Problem: Predict the Next Word

"This morning I took my cat for a [predict]."

**Problem**: Feed-forward network can only take a fixed length input. The model needs to handle variable length inputs.**Solution**: Uses a fixed window. Use only certain length of the words. e.g. last 2 words.

One-hot feature encoding: tells us what each word is.

```
[10000 01000] -> [ ]
for a prediction
```

## Problem 1: Can't Model long term dependencies

"France is where I grew up, but I now live in Boston. I speak fluent [ ]."

We need information from the distance past to accurately predict the correct word. Not only past few words.

## Idea 2: Use Entire Sequence as Set of Counts

Use a bag of words.

```
"This morning I took my cat for a" -> [0100100...00110001] -> prediction
```

**Problem!** Counts Don't Preserve Order

These two sentences are exact same representations but the semantic meanings are opposite.

- The food was good, not bad at all.
- The food was bad, not good at all.

## Idea 3: Use a Really Big Fixed Window

```
[10000 00001 00100 01000 00010 ...] -> prediction
this morning took the cat
```

**Problem!** Each of these inputs has a separate parameter. The meaning is same but the parameter can be separated. This we learn about the sequence won't transfer if the appear elsewhere in the sequence.

```
[10000 00001 00100 01000 00010 ...]
this morning took the cat
[00100 01000 00010 10000 00001 ...]
took the cat this morning
```

## Design Criteria

To model sequences, we need to:

- Handle
**variable-length**sequences - Track
**long-term**dependencies - Maintain information about
**order** **Share parameters**across the sequence

Use Recurrent Neural Networks (RNNs) as an approach to sequence modeling problems in this lecture.

# Recurrent Neural Networks (RNNs)

Comparsion RNN with Standard Feed-Forward Neural Network

```
One to One -- "Vanilla" neural network
Many to One -- Sentiment Classification
Many to Many -- Music Generation
# and many others
```

```
# RNN
+-------------+
| |
+---------+----------+ |
x_t +---->+ RNN | |
| Recurrent cell +------> \hat{y}_t
+---------+----------+ |
^ |
+-------------+ h_t
```

Apply a **recurrence relation** at every time step to process a sequence:

$$ \begin{aligned} h_t &= f_w(h_{t-1}, x_t) \ h_t&: \text{cell state} \ f_w&: \text{function parameterized by W} \ h_{t-1}&: \text{old state} \ x_t&: \text{input vector at time step t} \ \end{aligned} $$

Note: the same function and set of parameters are used at every time step.

## RNN Intuition

```
my_rnn = RNN()
hidden_state = [0, 0, 0, 0]
sentence = ["I", "love", "recurrrent", "neural"]
for word in sentence:
prediction, hidden_state = my_rnn(word, hidden_state)
next_word_prediction = prediction
# >>> "networks!"
```

## RNN State Update and Output

$$
\begin{aligned}
\text{Input Vector: }& x_t \
\text{Update Hidden State: }& h_t = \tanh(W_{hh}^Th_{t-1} + W_{xh}^Tx_{t}) \
\text{Output Vector: }& \hat{y}*t = W*{hy}^th_t \
\end{aligned}
$$

Application of a weighted matrix and non-linearity. Two params of the previous result and produce the output.

## Computational Graph Across Time

Re-use the same weight matrices at every time step. Forward pass to the final result over time. Compute a loss at each time step and this computation of the loss will then complete forward propagation through the network.

(Li/Johnson/Yeung C231n.)

Total loss is a sum the losses of individual contributions over time. (time component involved.)

```
class MyRNNCell(tf.keras.layers.Layer):
def __init__(self, rnn_units, input_dim, output_dim):
super(MyRNNCell, self).__init__()
# Initialize weight matrices
self.W_xh = self.add_weight([rnn_units, input_dim])
self.W_hh = self.add_weight([rnn_units, rnn_units])
self.W_hy = self.add_weight([output_dim, rnn_units])
# Initialize hidden state to zeros
self.h = tf.zeros([rnn_units, 1])
def call(self, x):
# Update the hidden state
self.h = tf.math.tanh(self.W_hh * self.h * self.W_xh * x)
# Compute the output
output = self.W_hy * self.h
# Return the current output and hidden state
return output, self.h
# tf.keras.layers.SimpleRNN(rnn_units)
```

# Backpropagation Through Time (BPTT)

## Backpropagation

- Take the derivative (gradient) of the loss with respect to each parameter
- Shift parameters in order to minimize loss

Backpropagation in RNN,

- Errors are back propagated at each individual time step
- Finally across all time steps (Backpropagation through time)

## Standard RNN Gradient Flow

Computing the gradient wrt $h_0$ involves many factors for $W_{hh} + \text{repeated gradient computation}$.

- $\text{Many values} > 1$: exploding gradients.

=> Sln: Gradient clipping to scale big gradients. - $\text{Many values} < 1$: Vanishing gardients.

=> Sln:- Activation function
- Weight initialization
- Network architecture

## The problem of long-term dependencies

**Vanishing gradients**, why matter?

- Multiply many
**small numbers**together. - Errors due to further back time steps have smaller and smaller gradients.
- Bias parameters to capture short-term dependencies.

When more context requires, RNN cannot connect the dot between the context because of vanishing gradients problem.

## Trick 1: Activation Functions

Using ReLU prevent $f'$ from shirinking the gradient when $x > 0$. By choosing more contrast activation function, the network can prevent vanishing gradient problem. (Compare ReLU derivative, tanh, and sigmoid.)

## Trick 2: Parameter Initialization

Initialize **weights** to identity matrix. Initialize **biases** to zero. This helps prevent the weigths from shirinking to zero.

$$ I_n = \begin{pmatrix} 1 & 0 & 0 & \dots & 0 \ 0 & 1 & 0 & \dots & 0 \ 0 & 0 & 1 & \dots & 0 \ \vdots & \vdots & \vdots & \ddots & \vdots \ 0 & 0 & 0 & \dots & 1 \end{pmatrix} $$

## Solution 3: Gated Cells

Idea: use a more **complex recurrent unit with gates** to control what information is passed through. e.g. LSTM, GRU, etc.

**Long Short Term Memory (LSTMs)** networks rely on a gated cell to track information throughtout many time steps. Well-suited for learning long-term dependencies to overcome this vanishing $\nabla$ problem.

# Long Short Term Memory (LSTM) Networks

- In a standard RNN, repeating modules contain a
**simple computation node**. - LSTM modules contain
**computational blocks**that**control information flow**. LSTM cells are able to track information throughout many timesteps.

```
tf.keras.layers.LSTM(num_units)
```

(Hochreiter & Schmidhuber, Neural Computation 1997.)

Information is **added** or **removed** through structures called **gates** $\sigma$. Gats optionally let information through, for example via a sigmoid neural net layer and pointwise multiplication.

"How much of the information through the gate should be retained?" => Selectively update their internal state, generate an output => gating/regulating the flow of information effectively

Ref: Olah, "Understanding LSTMs."

- Forget: LSTM's forget irrelevant parts of the previous state $f_t$
- Store: LSTMs store relevant new information into the cells state $i_t$
- Update: LSTMs selectively update cell state values $c_{t-1} \xrightarrow{f_t, i_t} c_{t}$
- Output: The output gate controls what information is sent to the next time step. $o_t$

## LSTM Gradient Flow

Internal cell state $C$ allows for the interrupted flow to find the gradients through time. "a highway of cell states" and also alleviate and mitigate the vanishing gradient problem.

While training, identify what are those bits of prior history that carry more meaning that are important to predict the next word and discard what is not relevant to.

## Key Concepts

- Maintain a separate cell state from what is outputted
- Use gates to control the flow of information
- Forget gate gets rid of irrelevant information
- Store relevant information from current input
- Selectively update cell state
- Output gate returns a filtered version of the cell state

- Backpropagation through time with uninterrupted gradient flow

# RNN Applications

## Example Task: Music Generation

- Input: sheet music
- Output: next character in sheet music

## Example Task: Sentiment Classification

- Input: sequence of words
- Output: probability of having positive sentiment

```
loss = tf.nn.softmax_cross_entropy_with_logits(y, predicted)
```

## Example Task: Machine Translation

Encode original language sentence into a state vector and decoder takes encoded language. **Encoding bottleneck**: need to be a single vector.

Sln: **Attention Mechanisms** is neural networks provide **learnable memory access** from the original sentence. (Extends the scope of state so that decorder also can access the encoder state in all time.)

When the network learns this waiting, its placing its attention in the original sentences so that effectively capture the memory of the important information.

## Examples

- Trajectory Prediction: Self-Driving Cars: encounters a cyclist cuts the lane so that slow down.
- Environmental Modeling: predict the future behavior of weather

# Summary

- RNNs are well suited for sequence modeling tasks
- Model sequences via a recurrence relation
- Training RNNs with backpropagation through time
- Gated cells like LSTMs let us model long-term dependencies
- Models for music generation, classification, machine translation, and more

Check the lab session 1.