Language Models [02] - EDA
We start the implementation with exploratory dataset analysis and a project template.
This post is a part of a series;
- Introduction and theory
- Dataset exploratory analysis (you are here)
- Tokenizer
- Training the model
- Evaluation of language model
- Experiments with the model
- Exercises for you
Project template
If you want to code along (I strongly recommend such an approach), you can use any project template you like. The template I will use is following:
Going top-down:
char_lm
- Directory with model’s packagedata
- All data. It contains multiple subdirectoriesraw
- Data as downloaded/scrapped/extracted. Read only, no manual manipulations allowed!interim
- Intermediate formats - after prerocessing etcdataset
- Datasets after preparatons, ready to be used by models.
models
- Storing trained modelsnotebooks
- Jupyter notebooks used for exploration/rough experimentation. No other use allowed!scripts
- Utility scripts that are not part of model package. Distinction betweenscripts
andchar_lm
is pretty arbitrarytests
- unit tests. Yes, we are going to write them..pre-commit-config.yaml
- I use pre-commit to help me keep code quality. Optionalpoetry.lock
/pyproject.toml
- I use poetry for installing packages. I encourage you to give a try but it is file to stick with conda/pip
You can download the complete project (when it will be completed, the link will appear here)
Dataset
As I stated in the first part, we will start with the character language model and switch to subword later. For now, a suitable dataset is US Census: City and Place Names
. Download and extract it to data/raw/us-city-place-names.csv
. Spend a few minutes to familiarize yourself with the data, and we will go to exploration.
Complete notebook is avaliable on github: https://github.com/artofai/char-lm/blob/master/notebooks/cities-eda.ipynb .
EDA
The first issue is encoding. If we try to load it with default pd.read_csv
settings, it will throw a Unicode exception. Unfortunately, there is no information about encoding on the dataset page, so I did a small investigation, and it looks like ISO-8859-1 (or Latin-1 - it’s the same) is the correct one.
|
|
|
|
There are three columns. We are interested only in city
. Interestingly, we could also utilize information about the state but let’s leave it for another time.
There are no nulls (yay), and after deduplicating city, we have about 25000 examples - decent.
|
|
|
|
|
|
|
|
Analysis of city lengths (in characters).
|
|
The distribution looks pretty nice - skewed normal distribution with a bit of tail. Let’s take a look at the longest examples:
|
|
|
|
I don’t live in the US, so names with (balance)
is a mystery. I will remove them later.
Next, let’s take a look at character distributions:
|
|
|
|
Looks good, except character ñ
and /
.
Removing outliers
According to our analysis, we are going to remove entries containing any of ()ñ/
.
|
|
|
|
It is always good to make a sanity check - how much data will be removed and what. This operation will remove 20 examples - for me, the removal list looks great.
At this point, I stop the EDA. In cleaned_cities
, there is a list of cities used for modeling.
Build dataset script
We could, in theory, save the dataset from the exploratory notebook. But it would be a bad practice - data engineering and exploration shouldn’t be mixed. Because of this, let’s create a data preprocessing script in scripts/build_dataset.py
Our knowledge from EDA can be transferred to a function. We already know what kind of preprocessing we want to apply, so implementation is straightforward:
|
|
The second piece is to take cleaned cities, split them into train/val/test and save them into separate files. I also save the entire dataset at this step - maybe it will be helpful later. To make the process deterministic, a random state can be provided.
|
|
Building blocks are ready. We can create a simple command-line interface using click. We want to pass the raw, input file, the output directory, specify the input file’s encoding, and a random state.
|
|
It is very simple: define main
function with mentioned arguments and decorate them using click
. In the body just pass relevant parameters to functions and let them do all work.
Summary
In this post, we performed the EDA and created a script for cleaning and preparing dataset.
Full script
The complete script is following:
|
|
print
there is logger.info
call. I will write a separate article on how and why use it. For now can use print
if you prefer to.