RNN Foundations
What is an RNN? How does NLP work?
from fastai.text.all import *
Credit Where Credit is Due
The concept and techniques covered in this post are covered in much greater detail in Jeremy Howard and Sylvain Gugger's book. If you like this post, you should buy the book as you'll probably like it even more!
https://www.amazon.com/gp/product/1492045527/ref=ppx_yo_dt_b_asin_image_o08_s00?ie=UTF8&psc=1|
path = untar_data(URLs.HUMAN_NUMBERS)
lines = L()
with open(path/"train.txt") as f: lines += L(*f.readlines())
with open(path/"valid.txt") as f: lines += L(*f.readlines())
lines
Tokenization
What is Tokenization? Tokenization is about getting 'tokens' of language that have meaning. A word could be a token as it has meaning. A piece of punctuation could be a token as it has meaning. If a work is in all capital letters that could be a token. A portion of a word could be a token (ie dis) because a word beginning with dis has meaning. There are many many ways to tokenize, for this post I will use the most simple approach. That is, I will split based on spaces to make each word a token.
txt = ' . '.join([l.strip() for l in lines])
tokens = L(*txt.split(' ')); tokens
Numericalization
Now that things are split into tokens, we need to start thinking about how to feed it to a Neural Network. Neural Networks rely on multiplication and addition, and we can't do that with a word. Somehow we need to convert these tokens to numbers. That is what Numericalization is all about. We will do this in a few steps:
- Get a unique list of all tokens (v)
- Assign a number to each of token (vocab)
- Replace tokens with numbers (nums)
v = tokens.unique()
# Assign a number to each of token (vocab)
vocab = {v:i for i,v in enumerate(v)};
# We can lookup the number associated with a token like this
vocab['fifty']
nums = L(vocab[tok] for tok in tokens); nums
Sequence Definition
Now that we have tokens in the form of numbers, we need to create out inputs and outputs to the model. For this we need to organize our data into dependent and independent variables. Let's use the preceding 3 words to predict the next word. Below, we see the same thing in 2 ways - one with tokens and one with numbers. These are the same thing, just translating the tokens to numbers using the vocab above.
sl = 3
# For example, we will use the tokens 'one','.', and 'two' to predict '.'
L((tokens[i:i+sl], tokens[i+sl]) for i in range(0,len(tokens)-sl-1,sl))
seqs = L((tensor(nums[i:i+sl]), nums[i+sl]) for i in range(0,len(nums)-sl-1,sl)); seqs
bs = 128
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut],seqs[cut:],bs=bs, shuffle=False)
dls2 = DataLoader(seqs[:cut],bs=bs, shuffle=False)
dls3 = DataLoader(seqs[cut:],bs=bs, shuffle=False)
dls4 = DataLoaders(dls3,dls3)
n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
n += y.shape[0]
for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, v[idx.item()], counts[idx].item()/n
RNN Number 1
Code
We are going to make the simplest RNN we can. Here's a quick explanation of the code below.
for i in range(sl):
Because we are feeding in a number of tokens based on our sequence length, sl, which was defined as 3. We will have 3 steps, 1 per token.
h = h + self.i_h(x[:,i])
For each input token we will run our input to hidden function. We are indexing to grab the column in our embedding matrix that corresponds with the token, and adding that. All this is doing is adding the embedding for the particular token.
h = F.relu(self.h_h(h))
We then run our hidden to hidden function (h_h), which is a linear layer (y = wx + b). We do a ReLu of that, which is just replacing any negative values with 0.
return self.h_o(h)
We then run our hidden to output function (h_o), which is another linear layer, but it is outputing the prediction of which word is next. Naturally, this is the size of our vocabulary.
Wrap all that in a class and it looks like the below:
class LM1(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
def forward(self, x):
h = 0
for i in range(sl):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)
Now we can run it below and see that we get almost 50% accuracy before we overfit, which is great considering the most common token only appears 15% of the time.
learn = Learner(dls, LM1(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
Embeddings
Let's start with our input_hidden. Our Embedding matrix is has 64 weights (n_hidden) for each token in our vocabulary. So that looks like this:
$\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 64-weights} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}30-tokens$
Now all the embedding layer does is get the correct columns. So for the first word in the sequence we get the index, then look it up in the embedding matrix. That 1 index location turns into the 64 weights.
$\underbrace{ \begin{bmatrix} \cdots \\ \cdots \\ \cdots \\ \cdots \\ \cdots \\ \cdots \\ \end{bmatrix}}_{\displaystyle token-idx} \left.\vphantom{\begin{bmatrix} \cdots \\ \cdots \\ \cdots \\ \cdots \\ \cdots \\ \cdots \\ \end{bmatrix}}\right\}128-bs$ $==$ lookup in embedding matrix $==>$ $\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 64} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}128$
Hidden Linear Layer
Next, we have out hidden_hidden. We have our 128x64 matrix from our embedding lookup and we need to do a linear layer.
$\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 64-weights} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}128-bs$ $\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 64} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}64$ $+$ $\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 64-bias} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}1$ $=$ $\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 64-weights} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}128-bs$ ===ReLu - Replace all negatives with 0 ===> $\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 64-weights} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}128-bs$
And we do the above for however long our sequence is, in our case 3. So for each token we do the above. We start with 0 on the first loop, and each subsequence loop through we add onto that.
Ouput Linear Layer
We ended with a 128x64 matrix, which isn't exactly what we want. We have 30 words, so we want to know which one of the 30 is most likely. Specifically for each of the 128 items in our batch, we want 30 scores (1 for each word in our vocab). So we do a similar stepp as our hidden linear layer, but adjust the number of weights so we end up with the matrix of the appropriate size.
$\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 64-weights} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}128-bs$ $\underbrace{ \begin{bmatrix} \cdots & \cdots\\ \cdots & \cdots\\ \cdots & \cdots\\ \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 30} \left.\vphantom{\begin{bmatrix} \cdots & \cdots\\ \cdots & \cdots\\ \cdots & \cdots\\ \cdots & \cdots\\ \end{bmatrix}}\right\}64$ $+$ $\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 30-bias} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}1$ $=$ $\underbrace{ \begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}_{\displaystyle 30-preds} \left.\vphantom{\begin{bmatrix} \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \cdots & \cdots & \cdots & \cdots\\ \end{bmatrix}}\right\}128-bs$
RNN Number 2
Now that we have a simple model, how do we improve it? There are many steps that need to be taken to get to a cutting edge model. We'll do one improvement, then leave the rest for future blog posts.
One thing that was a bit odd is in the training loop we reset back to 0 every time. What I mean by that, is we would loop through each of the 3 tokens, output our predictions for those, update the weights, then reset back for a new set. This isn't really how language works. Language has a pattern and a sequence to it. The further back you go the less important, but even things said a couple minutes ago could be important. Could you imagine holding a conversation if you could only remember and respond based on the last 3 words?
So let's fix this problem. We will move our h=0 up to the initialization of the class, and never reset back to 0. Instead, we will continuously keep adding to it. We will only update the last batch of weights (as if we updated all of them by the 1000th one we would be updating far to many weights to compute). We call this "detaching" it. Ultimately we are left with the same thing, but if has a memory of previous sequences beyond the one we are processing! Let's see if it makes things better.
class LM2(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
self.h = 0
def forward(self, x):
for i in range(3):
self.h = self.h + self.i_h(x[:,i])
self.h = F.relu(self.h_h(self.h))
out = self.h_o(self.h)
self.h = self.h.detach()
return out
To do this we need to take care that our data is in the appropriate order, so let's do a few tranformations to make that work.
m = len(seqs)//bs
m,bs,len(seqs)
def group_chunks(ds, bs):
m = len(ds) // bs
new_ds = L()
for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
return new_ds
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
group_chunks(seqs[:cut], bs),
group_chunks(seqs[cut:], bs),
bs=bs, drop_last=True, shuffle=False)
learn = Learner(dls, LM2(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(10, 3e-3)
And we are up from about 50% accuracy to about 60%!