The Ultimate Guide to Preparing Text Data for Language Modeling with PyTorch

Master tokenization, Byte Pair Encoding, Sampling windows, and Embeddings

Jan 06, 2025

Introduction

When working with large language models (LLMs), one of the most crucial steps is preparing the textual data in a format that these models can understand and learn from. This process involves converting raw text into numerical vectors, known as embeddings, as LLMs cannot directly process plain text.

In this post, we'll take a deep dive into the techniques and best practices for text preprocessing and embedding generation using PyTorch, a popular deep learning framework. We'll cover everything from basic tokenization to implementing advanced algorithms like Byte Pair Encoding (BPE), creating efficient data sampling techniques, and building embedding layers from scratch. By the end, you'll have a solid understanding of how to prepare text data for training powerful language models. To explore the concepts further and see the code in action, check out the accompanying Colab notebook here and follow along with the step-by-step examples.

Let's get started!

open book lot — Photo by Patrick Tomasso on Unsplash

Understanding Embeddings: The Bridge Between Text and Mathematics

Before we dive into the technical implementation details, let's understand what embeddings are and why they're crucial for language models. Think of embeddings as a way to translate words into numbers – but not just any numbers. They're carefully crafted numerical representations that capture the meaning, relationships, and context of words in a way that computers can process.

What Are Embeddings and Why Do We Need Them?

At their core, embeddings are dense vectors (arrays of numbers) that represent words or tokens in a continuous vector space. When you feed the word "cat" to a computer, you can't just use the letters "c-a-t" - computers need numbers to perform calculations. An embedding transforms "cat" into a vector like [0.2, -0.5, 0.8, ...], where each number helps represent different aspects of the word's meaning.

What makes embeddings powerful is their ability to capture semantic relationships. Words with similar meanings end up having similar numerical representations. For example, the embeddings for "cat" and "kitten" would be more similar to each other than to the embedding for "submarine". This similarity can be measured mathematically, allowing models to understand relationships between words.

Source: Representation of embedding vectors in 2D

Modern embedding systems typically represent words in high-dimensional spaces. For example:

GPT-2 uses 768-dimensional embeddings for its smallest model
GPT-3's largest model uses 12,288-dimensional embeddings
BERT-base uses 768-dimensional embeddings

The real power of embeddings comes from their ability to learn from data. During model training, these embeddings are automatically adjusted to capture relationships present in the training data, adapting to specific domains and discovering nuanced patterns that might not be obvious to human designers.

Text Tokenization and Preprocessing Techniques

The first step in preparing text for LLMs is tokenization - breaking down raw text into smaller units called tokens. Tokens can be individual words, subwords, or even characters. The goal is to create a finite set of meaningful units that the model can learn from.

However, raw text often contains noise and inconsistencies that can hinder the tokenization process. These include:

Inconsistent casing (e.g., "Hello" vs "hello")
Punctuation attached to words (e.g., "world!")
Special characters and contractions (e.g., "don't", "U.S.A.")
Unknown or rare words

To handle these issues and perform effective tokenization, we can use a combination of text preprocessing techniques and regular expressions in Python. Here's an example code snippet that demonstrates this:

import re

UNK = '<unk>'  # Token for unknown words
EOS = '<eos>'  # Token for end of text

def tokenize(text, known_words):
    # Lowercase the text
    text = text.lower()

    # Split on whitespace and punctuation using regular expressions
    tokens = re.findall(r"\w+|[^\w\s]", text)

    # Replace unknown words with <unk> token
    tokens = [t if t in known_words else UNK for t in tokens]

    # Append <eos> token to the end of the text
    tokens.append(EOS)

    return tokens

# Example usage
text = "Hello, world! This is a sample sentence."
known_words = {'this', 'is', 'a', 'sample', 'sentence'}

print(tokenize(text, known_words))

['<unk>', '<unk>', '<unk>', '<unk>', 'this', 'is', 'a', 'sample', 'sentence', '<unk>', '<eos>']

Let's break down the tokenization process step by step:

First, we convert the entire text to lowercase using text.lower(). This ensures consistent casing across all words.
Next, we use a regular expression r"\w+|[^\w\s]" to split the text on whitespace and punctuation. The regex pattern \w+ matches one or more word characters , while [^\w\s] matches any single character that is not a word character or whitespace. This effectively separates words and punctuation into individual tokens.
We then replace any unknown words (i.e., words not in the known_words set) with a special <unk> token. This helps the model handle out-of-vocabulary words gracefully during training and inference.
Finally, we append an <eos> token to the end of the tokenized text to mark the end of the sequence. This is useful for the model to learn when a text or document ends.

Understanding and Implementing Byte Pair Encoding

While the tokenization approach we discussed so far works well for many cases, it has some limitations. One major drawback is the handling of unknown or rare words. Replacing all uncommon words with a generic <unk> token can lead to loss of information and hinder the model's ability to understand the nuances of the text.

This is where Byte Pair Encoding (BPE) comes into play. BPE is a subword tokenization algorithm that iteratively builds a vocabulary of subword units based on their frequency in the training corpus. It starts with individual characters and progressively merges them into larger subword units until a desired vocabulary size is reached. This allows BPE to effectively handle out-of-vocabulary words by representing them as combinations of subword units.

Let's walk through a step-by-step example to better understand how BPE constructs its vocabulary. Imagine we have the following list of words:

['low', 'lower', 'newest', 'widest']

Step 1: Initialization

BPE begins by splitting each word into individual characters and appending a special end-of-word symbol, typically denoted by </w>, to mark the end of each word. This initial segmentation looks like this:

['l o w</w>', 'l o w e r</w>', 'n e w e s t</w>', 'w i d e s t</w>']

Step 2: Frequency Counting

Next, BPE counts the frequency of each character pair in the corpus. In this example, the most frequent pair is e followed by </w>, as it appears in two words: lower and newest:

Step 3: Merging

BPE merges the most frequent pair into a new subword unit. In our example, e</w> becomes a single unit i.e. considered as single token in vocabulary:

['l o w</w>', 'l o w e r</w>', 'n e w e s t</w>', 'w i d e s t</w>']

Step 4: Iteration

The process of frequency counting and merging is repeated iteratively. In the next iteration, the most frequent pair is es followed by </w>, so they get merged:

['l o w</w>', 'l o w e r</w>', 'n e w es</w>', 'w i d es</w>']

This iterative process continues until one of two conditions is met:

A desired vocabulary size is reached (e.g., 10,000 subword units).
No more frequent pairs are found (i.e., all possible merges have been performed).

The resulting set of subword units, along with their frequencies, forms the final BPE vocabulary. To further illustrate how BPE handles out-of-vocabulary words, let's consider an example. Suppose we have a BPE vocabulary that includes the subword units low, est, and </w>, but not the word lowest. When encountering lowest, BPE would break it down into the known subword units:

['low', 'est', '</w>']

By representing lowest as a combination of subword units, BPE enables the model to process and generate words it hasn't seen during training.

Byte Pair Encoding in Python

Now, let's see how we can implement BPE in Python using the tiktoken library. tiktoken is an existing Python open source library (https://github.com/openai/tiktoken), which implements the BPE algorithm very efficiently based on source code in Rust. It can be installed as follows:

Now, let's see how we can implement BPE in Python using the tiktoken library:

pip install tiktoken

import tiktoken

# Load the BPE tokenizer 
bpe_tokenizer = tiktoken.get_encoding("gpt2")

text = "This is an example of byte pair encoding! xhsbfubs"

# Tokenize the text using BPE
tokens = bpe_tokenizer.encode(text) 
decoded = bpe_tokenizer.decode(tokens) 

print(f"Encoded tokens: {tokens}") 
print(f"Decoded text: {decoded}")

# Encoded tokens: [1212, 318, 281, 1672, 286, 18022, 5166, 21004, 0, #  # 2124, 11994, 19881, 23161] 
# Decoded text: This is an example of byte pair encoding! xhsbfubs

The tiktoken library provides an efficient implementation of the BPE algorithm used by OpenAI's GPT models. We first load the BPE tokenizer with tiktoken.get_encoding("gpt2"), which gives us access to the same tokenizer used by the GPT-2 model.

We then encode our text using bpe_tokenizer.encode(text), which applies the BPE algorithm and returns a list of token IDs. These IDs correspond to the subwords in the BPE vocabulary.

Finally, we can decode the token IDs back into the original text using bpe_tokenizer.decode(tokens). This demonstrates that BPE can effectively tokenize and reconstruct the text without losing information.

The real power of BPE lies in its ability to handle out-of-vocabulary words. Since it breaks down words into subwords, even if a word is not explicitly present in the vocabulary, it can still be represented by a combination of subwords. This allows the model to understand and generate words it hasn't seen during training. By understanding and implementing Byte Pair Encoding, you can take your text preprocessing to the next level and build more powerful and versatile language models.

Creating and Managing Sampling Windows

Now that we have our text data tokenized into a sequence of token IDs, the next step is to prepare it for training our language model. But how exactly do we feed this data to the model?

To answer that, let's first understand how language models like GPT learn. During training, the model tries to predict the next token in a sequence given the tokens that come before it. For example, if the input is "The cat sat on the", the model learns to predict the next most likely token, such as "mat" or "couch".

To facilitate this learning process, we need to create input-target pairs from our tokenized text. The input will be a sequence of tokens, and the target will be the next token that follows this sequence. We can generate these pairs using a technique called sampling windows. The sampling window is popularly also known as context length.

Imagine our tokenized text as a long ribbon. We take a small window of a fixed size (say, 50 tokens) and slide it over the ribbon. At each step, the tokens inside the window become our input, and the token immediately following the window becomes the target. We keep sliding the window until we reach the end of the ribbon.

Here's a visual representation:

[The, cat, sat, on, the, mat, ., <eos>]
 |   window 1    |
      |   window 2    |
           |   window 3    |

In window 1, the input is [The, cat, sat, on, the] and the target is mat. In window 2, the input is [cat, sat, on, the, mat] and the target is .. And so on.

By creating these sampling windows, we break down our long text into manageable sequences that the model can learn from. The size of the window is a hyperparameter that we can tune. A larger window allows the model to learn from more context, but it also increases the computational complexity.

Now, let's see how we can implement this in Python. We'll use PyTorch's Dataset and DataLoader classes to create an efficient data pipeline.

import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, tokens, window_size):
        # Store the tokenized text
        self.tokens = tokens  
        # Store the size of the sampling window
        self.window_size = window_size  

    def __len__(self):
        # Return the total number of sampling windows
        return len(self.tokens) - self.window_size

    def __getitem__(self, idx):
        # Get the input-target pair for the given index
        input_seq = self.tokens[idx:idx+self.window_size]  # Input sequence
        target_seq = self.tokens[idx+1:idx+self.window_size+1]  # Target sequence (shifted by 1)
        return torch.tensor(input_seq), torch.tensor(target_seq)

# Example usage
tokens = [1212, 318, 281, 1672, 286, 2419, 683, 26254, 0] # Tokenized text
dataset = TextDataset(tokens, window_size=5)  # Create a TextDataset with window size of 5
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)  # Create a DataLoader with batch size of 2 and shuffling enabled

for inputs, targets in dataloader:
    print(inputs)  # Print the input sequences
    print(targets)  # Print the corresponding target sequences
    break  # Break after the first batch (for demonstration purposes)

tensor([[ 1672, 286, 2419, 683, 26254], 
		[ 1212, 318, 281, 1672, 286]]) 

tensor([[ 286, 2419, 683, 26254, 0], 
		[ 318, 281, 1672, 286, 2419]])

Let's break this down step by step:

We define a custom TextDataset class that inherits from PyTorch's Dataset class. This class takes the tokenized text and the window size as input.
The __len__ method returns the total number of sampling windows we can create from the tokenized text. We subtract the window size to avoid going out of bounds.
The __getitem__ method is the heart of the dataset. It takes an index idx and returns the input-target pair for the corresponding sampling window. The input is tokens[idx:idx+window_size] and the target is tokens[idx+1:idx+window_size+1], i.e., the input sequence shifted by one token.
We then create an instance of the TextDataset with our tokenized text and a window size of 5.
We wrap the dataset in a DataLoader, which allows us to batch the data and shuffle it for training. Here, we use a batch size of 2.
Finally, we loop over the dataloader to get batches of input-target pairs. Each input is a tensor of shape (batch_size, window_size), and each target is a tensor of shape (batch_size, window_size).

Now consider the following code which provides the best practice for creating datasets and dataloaders:

import torch
from torch.utils.data import Dataset, DataLoader

class GPTDataset(Dataset):
    def __init__(self, text, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        
        # Tokenize entire text
        token_ids = tokenizer.encode(text)
        
        # Create overlapping sequences
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1:i + max_length + 1]
            
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            
    def __len__(self):
        return len(self.input_ids)
        
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader(text, batch_size=4, max_length=256, stride=128):
    """Create an efficient data loader for training"""
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDataset(text, tokenizer, max_length, stride)
    
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        drop_last=True,
        num_workers=4,
        pin_memory=torch.cuda.is_available()
    )
    
    return dataloader

Building Token Embeddings from Scratch

So far, we've seen how to preprocess text data and convert it into sequences of token IDs using techniques like tokenization and Byte Pair Encoding. The next crucial step is to transform these discrete token IDs into continuous vector representations, known as embeddings.

Source: https://machinelearningmastery.com/wp-content/uploads/2022/01/PE6.png

Embeddings are dense, low-dimensional vectors that capture semantic and syntactic information about the tokens. By representing tokens as embeddings, we enable the language model to learn meaningful relationships and patterns in the text data.

In PyTorch, we can create embeddings using the nn.Embedding layer. This layer maps each token ID to a corresponding vector of a specified size.

Here's an example of how to create an embedding layer in PyTorch:

import torch
import torch.nn as nn

vocab_size = 10000  # Size of the vocabulary (number of unique tokens)
embed_size = 128  # Dimensionality of the embedding vectors

embedding_layer = nn.Embedding(vocab_size, embed_size)

In this code snippet, we define an embedding layer with a vocabulary size of 10,000 and an embedding size of 128. This means that each token ID will be mapped to a 128-dimensional vector.

To use the embedding layer, we simply pass the token IDs through it:

token_ids = torch.tensor([1, 2, 3, 4])  # Example token IDs
embeddings = embedding_layer(token_ids)

print(embeddings.shape)  

# Output: torch.Size([4, 128])

Here, we pass a tensor of token IDs through the embedding layer, and it returns the corresponding embeddings. The resulting embeddings tensor has a shape of (4, 128), indicating that we have 4 tokens, each represented by a 128-dimensional vector.

But how does the embedding layer know what values to assign to each token's embedding vector? Initially, the embedding layer is randomly initialized. During the training process, the language model learns to adjust these embeddings based on the patterns and relationships in the text data.

However, the token embeddings each word independently, regardless of its position. This is where positional embeddings come in – they help the model understand where each word appears in the sequence.

Adding Positional Embeddings

The self-attention mechanism in transformer models is inherently position-agnostic. When looking at token embeddings alone, the model has no way to know whether "cat" appears at the beginning, middle, or end of the sentence. Positional embeddings solve this by adding position-specific information to each token embedding.

Think of it this way: if token embeddings tell us "what" the word is, positional embeddings tell us "where" it appears. When we combine them, the model gets both pieces of information simultaneously.

Implementing Positional Embeddings

Let's implement a complete embedding system that combines both token and positional embeddings. We choose max_sequence_length based on how long our input sequences might be:

# Define max_sequence_length as 512
max_sequence_length = 512 

# Create positional embedding layer
position_embedding = nn.Embedding(max_sequence_length, embed_size)

Now, generate position indices for our sequence:

# If our token_ids has length 4, we need positions [0, 1, 2, 3] 
positions = torch.arange(len(token_ids)) 
print(f"Position indices: {positions}")

# Output: Position indices: tensor([0, 1, 2, 3])

Now get the embeddings for these positions or indices:

# Get embeddings from position_embeddings layer
position_embeddings = position_embedding(positions) 
print(f"Position embedding shape: {position_embeddings.shape}")

# Output: Position embedding shape: torch.Size([4, 128])

Now combine both token embeddings and positional embeddings:

combined_embeddings = embeddings + position_embeddings
print(f"Combined embedding shape: {combined_embeddings.shape}")

# Output: Combined embedding shape: torch.Size([4, 128])

Implementing Embeddings (Best Practise)

Let's implement a complete embedding system that combines both token and positional embeddings:

# Best practices implementation
class EmbeddingLayer(nn.Module):
    def __init__(self, vocab_size, embedding_dim, max_sequence_length):
        super().__init__()
        
        # Initialize embeddings with proper scaling
        self.token_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.position_embedding = nn.Embedding(max_sequence_length, embedding_dim)
        
        # Initialize weights using normal distribution
        nn.init.normal_(self.token_embedding.weight, std=0.02)
        nn.init.normal_(self.position_embedding.weight, std=0.02)
        
    def forward(self, token_ids):
        # Apply scaling to token embeddings
        token_embeddings = self.token_embedding(token_ids) * self.scale
        
        # Create and cache position indices
        if not hasattr(self, '_position_ids'):
            self._position_ids = torch.arange(
                token_ids.size(1), 
                device=token_ids.device
            )
        
        # Add positional embeddings
        return token_embeddings + self.position_embedding(self._position_ids)

Let's break down how this works:

Token Embeddings: Each word gets transformed into a dense vector through the token_embedding layer, just as we discussed earlier.
Position Numbers: We create a sequence of position indices (0, 1, 2, ...) for each position in our input sequence.
Position Embeddings: These indices get transformed into position-specific vectors through the position_embedding layer.
Combination: We add the token and positional embeddings together. This addition operation preserves both the meaning of the word (from token embeddings) and its position (from positional embeddings).

Wrapping Up

In this comprehensive guide, we've explored the fundamental building blocks of text preprocessing for language modeling. We started by diving into tokenization techniques, learning how to break down raw text into meaningful units while handling challenges like punctuation, casing, and special characters. Next, we discovered the power of Byte Pair Encoding (BPE) for creating subword vocabularies that effectively handle rare and unknown words. We then learned how to construct efficient sampling windows to prepare tokenized text for training, and finally, we built token embeddings from scratch using PyTorch, incorporating positional information to capture word order and context.

Remember, the techniques and concepts we've discussed are not just theoretical - they have practical applications in a wide range of natural language processing tasks, such as language translation, text summarization, sentiment analysis, and more. By mastering these fundamentals, you'll be equipped to tackle complex language modeling challenges and build impressive AI systems.

Thanks for reading NeuraForge: AI Unleashed!

If you enjoyed this deep dive into AI/ML concepts, please consider subscribing to our newsletter for more technical content and practical insights. Your support helps grow our community and keeps the learning going! Don't forget to share with peers who might find it valuable. 🧠✨