Contents

Language Models [02] - EDA

We start the implementation with exploratory dataset analysis and a project template.

Abstract

This post is a part of a series;

  1. Introduction and theory
  2. Dataset exploratory analysis (you are here)
  3. Tokenizer
  4. Training the model
  5. Evaluation of language model
  6. Experiments with the model
  7. Exercises for you

Project template

If you want to code along (I strongly recommend such an approach), you can use any project template you like. The template I will use is following:

/posts/language-models/02-eda/02_structure.png

Going top-down:

  • char_lm - Directory with model’s package
  • data - All data. It contains multiple subdirectories
    • raw - Data as downloaded/scrapped/extracted. Read only, no manual manipulations allowed!
    • interim - Intermediate formats - after prerocessing etc
    • dataset - Datasets after preparatons, ready to be used by models.
  • models - Storing trained models
  • notebooks - Jupyter notebooks used for exploration/rough experimentation. No other use allowed!
  • scripts - Utility scripts that are not part of model package. Distinction between scripts and char_lm is pretty arbitrary
  • tests - unit tests. Yes, we are going to write them.
  • .pre-commit-config.yaml - I use pre-commit to help me keep code quality. Optional
  • poetry.lock/pyproject.toml - I use poetry for installing packages. I encourage you to give a try but it is file to stick with conda/pip

You can download the complete project (when it will be completed, the link will appear here)

Dataset

As I stated in the first part, we will start with the character language model and switch to subword later. For now, a suitable dataset is US Census: City and Place Names . Download and extract it to data/raw/us-city-place-names.csv. Spend a few minutes to familiarize yourself with the data, and we will go to exploration.

Tip
Try to perform exploratory analysis on your own and compare your conclusions with mine!

Complete notebook is avaliable on github: https://github.com/artofai/char-lm/blob/master/notebooks/cities-eda.ipynb .

EDA

The first issue is encoding. If we try to load it with default pd.read_csv settings, it will throw a Unicode exception. Unfortunately, there is no information about encoding on the dataset page, so I did a small investigation, and it looks like ISO-8859-1 (or Latin-1 - it’s the same) is the correct one.

Note
If you would like to read more about dealing with encoding issues leave a comment.
1
2
df = pd.read_csv('../data/raw/us-city-place-names.csv', encoding='ISO-8859-1')
df.sample(10)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
       state_id      state_name                    city
40583        45           Texas                  Leakey
46096        50   West Virginia                     Pax
36000        39            Ohio  Jackson Center village
25542        30         Montana             Hot Springs
29099        34  North Carolina                 Lowland
43780        48           Texas                  Burton
24541        29        Missouri              Ellisville
48214        55       Wisconsin        Endeavor village
6706         12         Florida           Neptune Beach
25650        31      New Jersey                 Beverly

There are three columns. We are interested only in city. Interestingly, we could also utilize information about the state but let’s leave it for another time.

There are no nulls (yay), and after deduplicating city, we have about 25000 examples - decent.

1
df.isnull().sum()
1
2
3
4
state_id      0
state_name    0
city          0
dtype: int64
1
2
df_deduplicated = df.drop_duplicates('city')
len(df_deduplicated)
1
24583

Analysis of city lengths (in characters).

1
2
3
cities = df_deduplicated['city']
lens = [len(s) for s in cities]
sns.distplot(lens, rug=True)

The distribution looks pretty nice - skewed normal distribution with a bit of tail. Let’s take a look at the longest examples:

1
cities.sort_values(key=lambda s: s.str.len())[-20:]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
12555                                South Chicago Heights village
31898                                Village of the Branch village
29685                                Peapack and Gladstone borough
48228                               Fontana-on-Geneva Lake village
17285                               Lexington-Fayette urban county
25161                              Village of Four Seasons village
4160                               El Paso de Robles (Paso Robles)
22113                              Village of Grosse Pointe Shores
20047                             Chevy Chase Section Five village
20048                            Chevy Chase Section Three village
7642                             Webster County unified government
30225                           Los Ranchos de Albuquerque village
7293                         Echols County consolidated government
16237                  Greeley County unified government (balance)
7327                  Georgetown-Quitman County unified government
7254               Cusseta-Chattahoochee County unified government
7151             Athens-Clarke County unified government (balance)
42659         Nashville-Davidson metropolitan government (balance)
17293       Louisville/Jefferson County metro government (balance)
7155     Augusta-Richmond County consolidated government (balance)
Name: city, dtype: object

I don’t live in the US, so names with (balance) is a mystery. I will remove them later.

Next, let’s take a look at character distributions:

1
2
char_freq = Counter(chain.from_iterable(cities))
sorted(char_freq.items(), key=lambda x: x[1])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
[('ñ', 3),
 ('/', 3),
 ('X', 4),
 ('(', 15),
 (')', 15),
 ("'", 27),
 ('j', 52),
 ('Z', 61),
 ('Q', 68),
 ('-', 84),
 ('.', 108),
 ('Y', 134),
 ('q', 153),
 ('U', 168),
 ('z', 295),
 ('x', 364),
 ('J', 406),
 ('I', 424),
 ('K', 653),
 ('V', 741),
 ('O', 762),
 ('T', 1042),
 ('D', 1069),
 ('E', 1110),
 ('N', 1134),
 ('A', 1233),
 ('F', 1300),
 ('f', 1362),
 ('G', 1444),
 ('R', 1551),
 ('W', 1714),
 ('L', 2002),
 ('H', 2075),
 ('P', 2166),
 ('p', 2257),
 ('M', 2564),
 ('B', 2647),
 ('w', 2708),
 ('S', 3132),
 ('m', 3292),
 ('k', 3484),
 ('C', 3518),
 ('b', 3550),
 ('y', 3772),
 ('c', 3874),
 ('d', 5645),
 ('h', 5900),
 ('u', 6361),
 ('v', 6445),
 ('g', 8222),
 ('s', 9229),
 ('t', 11389),
 (' ', 12847),
 ('n', 15172),
 ('r', 15726),
 ('i', 16546),
 ('o', 17874),
 ('a', 21525),
 ('l', 22549),
 ('e', 25270)]

Looks good, except character ñ and /.

Removing outliers

According to our analysis, we are going to remove entries containing any of ()ñ/.

1
2
3
4
5
idx_to_drop = cities.str.contains(r'ñ|\/|\(|\)')
print(idx_to_drop.sum())
print(cities[idx_to_drop])
cleaned_cities = cities[~idx_to_drop]
cleaned_cities
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
3299                                             Fredonia (Biscoe)
4160                               El Paso de Robles (Paso Robles)
4235                                          La Cañada Flintridge
4398                                    San Buenaventura (Ventura)
4874                                                    Cañon City
5051                                           Raymer (New Raymer)
5130                                             Milford (balance)
7151             Athens-Clarke County unified government (balance)
7155     Augusta-Richmond County consolidated government (balance)
13788                                       Indianapolis (balance)
16237                  Greeley County unified government (balance)
17293       Louisville/Jefferson County metro government (balance)
25497                                   Butte-Silver Bow (balance)
30199                                                     Española
40771                                       Naval Air Station/ Jrb
42575                                  Hartsville/Trousdale County
42659         Nashville-Davidson metropolitan government (balance)
47809                                    Addison (Webster Springs)
47820                                      Bath (Berkeley Springs)
48039                                         Womelsdorf (Coalton)
Name: city, dtype: object

It is always good to make a sanity check - how much data will be removed and what. This operation will remove 20 examples - for me, the removal list looks great.

At this point, I stop the EDA. In cleaned_cities, there is a list of cities used for modeling.

Build dataset script

We could, in theory, save the dataset from the exploratory notebook. But it would be a bad practice - data engineering and exploration shouldn’t be mixed. Because of this, let’s create a data preprocessing script in scripts/build_dataset.py

Our knowledge from EDA can be transferred to a function. We already know what kind of preprocessing we want to apply, so implementation is straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def preprocess_dataset(df: pd.DataFrame) -> List[str]:
    """Preprocesses the cities dataset by removing entries with invalid characters

    Args:
        df (pd.DataFrame): Source dataframe, requires column "city"

    Returns:
        List[str]: A list of cleaned cities
    """
    df = df.drop_duplicates("city")
    logger.info(f"Rows after deduplication: {len(df)}")

    cities = df["city"]
    idx_to_drop = cities.str.contains(r"ñ|\/|\(|\)")
    cities = cities[~idx_to_drop]
    logger.info(f"Dropped {idx_to_drop.sum()} outliers")

    return list(cities)

The second piece is to take cleaned cities, split them into train/val/test and save them into separate files. I also save the entire dataset at this step - maybe it will be helpful later. To make the process deterministic, a random state can be provided.

Tip
The ability to reproduce your results is important. Consider every place using random generators saving seed value.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def split_and_save(cities: List[str], out_dir: Path, random_state: int) -> None:
    """Splits dataset into train/val/test and saves to text files

    Args:
        cities (List[str]): List of cleaned cities
        out_dir (Path): Output directory. Will be created if not exists
        random_state (int): Random state for splitting
    """
    out_dir.mkdir(parents=True, exist_ok=True)
    logger.info(f"Saving results to {out_dir}")

    with (out_dir / "full.txt").open("wt", encoding="utf-8") as f:
        f.write("\n".join(cities))

    keeep_set, test_set = train_test_split(
        cities, test_size=0.2, random_state=random_state
    )

    train_set, val_set = train_test_split(
        keeep_set, test_size=0.2, random_state=random_state + 1
    )

    with (out_dir / "train.txt").open("wt", encoding="utf-8") as f:
        f.write("\n".join(train_set))
    with (out_dir / "val.txt").open("wt", encoding="utf-8") as f:
        f.write("\n".join(val_set))
    with (out_dir / "test.txt").open("wt", encoding="utf-8") as f:
        f.write("\n".join(test_set))

Building blocks are ready. We can create a simple command-line interface using click. We want to pass the raw, input file, the output directory, specify the input file’s encoding, and a random state.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@click.command()
@click.option("--input-path", "-i", required=True)
@click.option("--output-dir", "-o", required=True)
@click.option("--encoding", default="ISO-8859-1")
@click.option("--random-state", type=int, default=42)
def main(input_path: str, output_dir: str, encoding: str, random_state=42):
    logging.basicConfig(level=logging.INFO)

    input_path = Path(input_path)
    output_dir = Path(output_dir)
    logger.info(f"Reading file {input_path} with encoding {encoding}")

    df = pd.read_csv(input_path, encoding=encoding)
    logger.info(f"Read {len(df)} rows")

    cities = preprocess_dataset(df)

    split_and_save(cities, output_dir, random_state)

    logger.info("Done.")

if __name__ == "__main__":
    main()

It is very simple: define main function with mentioned arguments and decorate them using click. In the body just pass relevant parameters to functions and let them do all work.

Summary

In this post, we performed the EDA and created a script for cleaning and preparing dataset.

Full script

The complete script is following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import logging
from pathlib import Path
from typing import List

import click
import pandas as pd
from sklearn.model_selection import train_test_split

logger = logging.getLogger(__name__)


def preprocess_dataset(df: pd.DataFrame) -> List[str]:
    """Preprocesses the cities dataset by removing entries with invalid characters

    Args:
        df (pd.DataFrame): Source dataframe, requires column "city"

    Returns:
        List[str]: A list of cleaned cities
    """
    df = df.drop_duplicates("city")
    logger.info(f"Rows after deduplication: {len(df)}")

    cities = df["city"]
    idx_to_drop = cities.str.contains(r"ñ|\/|\(|\)")
    cities = cities[~idx_to_drop]
    logger.info(f"Dropped {idx_to_drop.sum()} outliers")

    return list(cities)


def split_and_save(cities: List[str], out_dir: Path, random_state: int) -> None:
    """Splits dataset into train/val/test and saves to text files

    Args:
        cities (List[str]): List of cleaned cities
        out_dir (Path): Output directory. Will be created if not exists
        random_state (int): Random state for splitting
    """
    out_dir.mkdir(parents=True, exist_ok=True)
    logger.info(f"Saving results to {out_dir}")

    with (out_dir / "full.txt").open("wt", encoding="utf-8") as f:
        f.write("\n".join(cities))

    keeep_set, test_set = train_test_split(
        cities, test_size=0.2, random_state=random_state
    )

    train_set, val_set = train_test_split(
        keeep_set, test_size=0.2, random_state=random_state + 1
    )

    with (out_dir / "train.txt").open("wt", encoding="utf-8") as f:
        f.write("\n".join(train_set))
    with (out_dir / "val.txt").open("wt", encoding="utf-8") as f:
        f.write("\n".join(val_set))
    with (out_dir / "test.txt").open("wt", encoding="utf-8") as f:
        f.write("\n".join(test_set))


@click.command()
@click.option("--input-path", "-i", required=True)
@click.option("--output-dir", "-o", required=True)
@click.option("--encoding", default="ISO-8859-1")
@click.option("--random-state", type=int, default=42)
def main(input_path: str, output_dir: str, encoding: str, random_state=42):
    logging.basicConfig(level=logging.INFO)

    input_path = Path(input_path)
    output_dir = Path(output_dir)
    logger.info(f"Reading file {input_path} with encoding {encoding}")

    df = pd.read_csv(input_path, encoding=encoding)
    logger.info(f"Read {len(df)} rows")

    cities = preprocess_dataset(df)

    split_and_save(cities, output_dir, random_state)

    logger.info("Done.")


if __name__ == "__main__":
    main()
Info
Instead of print there is logger.info call. I will write a separate article on how and why use it. For now can use print if you prefer to.
Info
If you liked this post, consider joining the newsletter .