Contents

Language Models [01] - Intro and theory

If there is one single topic I could recommend to learn for every person learning NLP - that would be language models.

The chances that you will need to implement an LM in your work are minimal. But concepts you can learn are priceless. Many NLP topics like sequence to sequence models, machine translation, summarization - or even famous transformer models are built on top of language models. Of course, they also have some direct applications.

In this series, we are going to build a language model using PyTorch. We will start with the required theory. Next, we will examine the data, build the model and evaluate it. In the end, we will have some fun things with a trained model!

Series

his post is a part of a longer series about casual language modeling. The aim is to provide you with all the required theoretical info and understand the topic by actually implementing a language model. We will start with the simples possible and gradually expand it.

Table of contents

  1. Introduction and theory (you are here)
  2. Dataset exploratory analysis
  3. Tokenizer
  4. Training the model
  5. Evaluation of language model
  6. Experiments with the model
  7. Exercises for you

Let’s get started!

Theory

Language models can be divided into smaller groups. To basic ones are:

  • casual language models - also called autoregressive
  • masked language models - also called autoencoding

In this series, we will work with the first type - from now by language model I mean a casual one. Masked LM is a separate topic ;)

We can also divide them by type of tokens:

  • word-level
  • subword-level
  • character-level

Each of these types makes a different assumption of what a token is. In this series, we will make a character-level one. The main reason is a lot faster training - it will allow you to learn and experiment faster. After completing this tutorial, you will create word-level on your own - as mechanics are the same.

Definition

A casual language model is a function that predicts the next token in a given sentence. Sound simple, huh? Let’s make it a little bit formal.

Tip
If you are afraid of formal definitions you should not - they really help in advanced scenarions

Let’s start with sentence - it is a sequence of tokens from the vocabulary. The first step for language modeling is to define vocabulary \(V\) - simply the list of allowed tokens + some special ones. We already said that tokens could be either words, subwords, or characters. We denote sentence as a sequence of tokens:

$$ S = w_0, w_1, w_3, \ldots, w_{n-1}, w_i \isin V $$

The model takes a sequence of tokens and predicts the next one. In other words, it predicts a probability for all tokens - just like in a regular classification problem. We denote it as following:

$$ P(w_n | w_0, w_1, w_2, \ldots, w_{n-2}, w_{n-1}) $$

For example, if we ask the model “What is the next token for the sequence: ‘I like ‘” we can get:

  1. to (0.11)
  2. good (0.05)
  3. it (0.03)
  4. Chopin (0.01)

All of them sum up to 1 (to make proper probability distribution).

With the formal notation we can say:

  1. \( P(\text{to} | \text{I}, \text{like}) = 0.11 \)
  2. \( P(\text{good} | \text{I}, \text{like}) = 0.05 \)
  3. \( P(\text{it} | \text{I}, \text{like}) = 0.03 \)
  4. \( P(\text{Chopin} | \text{I}, \text{like}) = 0.01 \)

We are good to go, but there are important corner cases.

Special tokens

What if we have a sentence that should not have a continuation? For example: “The sky is blue.” (I even added a full stop to indicate it).

$$ P(w_5 | \text{The}, \text{sky}, \text{is}, \text{blue}, \text{.}) = ? $$

We need a special token to mark the end of the sentence. It is usually denoted as <s/>, <eos>, [EOS] or similar.

  1. \( P(\text{<s/>} | \text{The}, \text{sky}, \text{is}, \text{blue}, \text{.}) = 0.8 \)
  2. \( P(\text{because} | \text{The}, \text{sky}, \text{is}, \text{blue}, \text{.}) = 0.02 \)
  3. \( P(\text{lyrics} | \text{The}, \text{sky}, \text{is}, \text{blue}, \text{.}) = 0.01 \)
  4. all other

But what for the beginning? What if we ask a model to predict the first token in a new sequence? Something like:

$$ P(w_0 | \varnothing) $$

We introduce another special token for it - <s>/<sos>/[BOS] - start/beginning of sequence. So our equation become:

$$ P(w_0 | \text{<s>} ) $$

The introduction of start/end tokens means that we need to add them for all sentences for the model.

$$ S = \text{<s>}, w_1, w_2, w_3, \ldots, w_{n-1}, \text{</s>} $$

For example

1
The sky is blue .

becomes

1
<s> The sky is blue . </s>

There can be more special tokens - like <oov>/<unk> for an out-of-vocabulary token or <pad> for length padding. We will introduce them later when needed.

Sentence probability

It turns out that that we can use the same model for computing the likelihood of the whole sentence. Some examples

  • \(P(\text{Order a taxi please!}) = 0.000000037 \)
  • \(P(\text{The Earth is flat}) = 0.00000000012 \)
  • \(P(\text{like pen green bottle}) = 0.0000000000042 \)
  • \(P(\text{xaaaaaz grrr ee fruu@@}) = 0.00000000000000000000001 \)

As you can see, sentences that sound reasonable have bigger scores than trashy ones. Still, absolute numbers are very low. We need to cover all possible sentences. And there are a lot of them.

Advanced

In order to grab a better intuition, consider the following language used by Aliens

  1. It has 2 000 words (there are about 170 000 words in English)
  2. Sentences has to be a maximum of seven words in length

In such scenarion number of possible sentences is:

$$ V = 2000 $$ $$ 1 \le N \le 7 $$ $$ |S| = 2000 + 2000^2 + 2000^3 + \ldots + 2000^7 $$

It gives 128 064 032 016 008 004 002 000 possible sentences! Assuming that each of them is equally likely, the probability of each of these is 0.0000000000000000000000078 - even for such primitive language.

Fortunately, we are more interested in relative likelihoods. For example

$$ P(\text{“I have a car”}) > P(\text{“I has a car”}) $$

The model says it is more likely that “have” is better than “has” for this sentence.

But how to compute those? The answer is the chain rule. Imagine we have a sentence:

$$ S_0 = \text{<s>}, w_1, w_2, w_3, \text{</s>} $$

To compute the probability of the whole sentence, we start with a question: “what is the probability that the sentence begins with \(w_1\)

$$ P(w_1| \text{<s>})$$

Next, we compute how likely it is to \(\text{<s>}, w_1\) be followed by \(w_2\) and mulpitply it with previous score.

$$ P(w_1| \text{<s>}) \cdot P(w_2 | \text{<s>}, w_1 )$$

Continuing to the next token

$$ P(w_1| \text{<s>}) \cdot P(w_2 | \text{<s>}, w_1 ) \cdot P(w_3 | \text{<s>}, w_1, w_2 ) $$

To the end

$$ P(w_1| \text{<s>}) \cdot P(w_2 | \text{<s>}, w_1 ) \cdot P(w_3 | \text{<s>}, w_1, w_2 ) \cdot P(\text{</s>} | \text{<s>}, w_1, w_2, w_3 ) $$

In general case it is defined as:

$$ \prod_{i=1}^n P(w_i | w_1, \ldots, w_{i-1}) $$

Question
As at each step, we multiply numbers in the interval $[0, 1]$ the value will be smaller and smaller - approaching zero. Can you see an issue with it?

Evaluation

Ok, but how do we know if the model is good? Maybe we could treat it as a multilabel classification problem?

For language models, there is a specific metric called perplexity. It is strongly related to information entropy. We will explore this in detail later, simplifying things for now.

Intuitively, perplexity tells “on average how many tokens model is considering to be next”. It means that lower=better (just like loss). The lower bound is 1 (the model cannot consider less than one token). The upper bound is infinite. It may sound useless (how do we know if we are correct if the result can be arbitrarily high?). But recall that metrics like MAE also have no maximum value. Perplexity is affected by vocab size and token distribution. It means that we cannot make a fair comparison for models using different tokenization methods.

Given that, is it useful at all? My answer is yes, but not as much as accuracy, f1 or similar. For now, it is enough to know that we should aim for lower PPL during model training.

Summary

Ok, that was quite a lot of theory. Now we are prepared for implementing the model!

Info
If you liked this post, consider joining the newsletter .