Unlock the Power of Generative AI: Mastering Personalized Model Development

#7 Generative AI with LLMs: Step-by-step guide to Fine-Tune GPT-2 with Custom Data Using Transformers

Jan 22, 2024

Introduction

The impact of the Generative Pre-trained Transformers (GPT) series by OpenAI in the revolution of large language models has been undeniable. In this blog, we delve into fine-tuning a variant of the GPT-2, GPT-2 small variant using Hugging Face's Transformers library, a process crucial for tailoring these models to specific language tasks.

Decoding GPT-2 & Transformers: The Power Duo in Language Modelling

The GPT-2 model is a prominent example of a decoder-based language model developed by OpenAI. Transformer models, like GPT-2, are distinguished by their ability to process words in relation to all other words in a sentence. This contrasts with earlier models that processed text in sequential order.

Hugging Face's Transformers library is a comprehensive suite that simplifies using thousands of pre-trained models for various natural language processing tasks. It bridges these complex models and developers, providing tools for easy implementation, customization, and deployment of many open-source AI models, including GPT-2. This library is widely recognized for its user-friendly interface and extensive documentation, making cutting-edge AI accessible to novice and expert practitioners. The combination of GPT-2’s advanced capabilities and the Transformers library’s ease of use offers an unparalleled toolkit for developing sophisticated language processing applications.

Language Modelling Deep Dive

In this section, let’s dive deep into different stages of fine-tuning the GPT-2 model with a custom dataset.

Setting Up the Development Environment

Begin by preparing your development environment, a vital step for success in your project. Start with installing Python, then install Hugging Face's Transformers library, which is essential for this endeavour. Use Python's package installer, pip, for a smooth installation:

!pip install transformers
!pip install datasets
!pip install accelerate -U
!pip install apache-beam

Additionally, consider leveraging Google Colab for your project. Google Colab provides a cloud-based platform that offers free access to powerful GPUs, which can significantly accelerate the training and fine-tuning process of models like GPT-2. This makes it an ideal choice, especially if you have limited resources or require advanced computational capabilities. With these tools and platforms at your disposal, you are now equipped to dive into the fine-tuning of the GPT-2 language model. Google Colab can be accessed through this link.

Delving into the Wikipedia Dataset

For fine-tuning GPT-2, the Wikipedia dataset from Hugging Face offers a comprehensive collection of articles, ideal for training language models. This dataset provides a diverse range of topics and writing styles, making it perfect for enhancing the linguistic understanding and generative capabilities of GPT-2. Load the dataset from the datasets library:

from datasets import load_dataset

dataset = load_dataset("wikipedia", "20220301.en")

We can view the contents of the dataset by printing the structure.

print(dataset)

Output:

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 6458670
    })
})

Given the vast dataset size, which includes over 6 million rows, we'll focus on a subset for this tutorial. Training on the entire dataset would demand extensive computational resources.

# Choose a data subset
data_subset = dataset['train'].select(range(1000))
print(data_subset)

# Split the dataset into training and validation sets
split = data_subset.train_test_split(test_size=0.1)
train_dataset = split['train']
val_dataset = split['test']

Loading the GPT-2 Model & Tokenizer

For this task, let’s use the small variant of the GPT-2 model. Load the GPT-2 model and its tokenizer from Hugging Face's Transformers library:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Now, let’s process the Wikipedia dataset to prepare it for model training.

Data preprocessing: Tokenization

Before diving into the fine-tuning process, preparing the Wikipedia dataset appropriately is essential. This preparation involves tokenizing the text data and converting it into a format the GPT-2 model can understand and process efficiently. We use the GPT2Tokenizer loaded earlier to tokenize the text. This process breaks down the text into tokens or smaller pieces, aligning with the model's linguistic understanding. Once tokenized, the dataset is ready for model training, ensuring that the input data aligns with the internal workings of GPT-2. This step is crucial in ensuring effective training and fine-tuning of the model on the dataset.

Let’s split the dataset before proceeding with model training.

# Function to tokenize the text
def tokenize_function(inputs):
    return tokenizer(inputs['text'],
                    padding="max_length",
                    max_length=512,
                    truncation=True,
                    return_overflowing_tokens=True,
                    return_length=True)

# Apply the tokenization to the dataset splits
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_valid = val_dataset.map(tokenize_function, batched=True)

Fine-tuning of GPT-2 with Wikipedia Dataset: Model training

In the fine-tuning process, the TrainingArguments and the Trainer classes from the transformer library play pivotal roles. We can implement these classes in code as follows:

# Data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    evaluation_strategy="epoch",
    save_steps=10_000,
    save_total_limit=2,
    report_to=None
)

# Custom function to compute perplexity
def compute_perplexity(eval_pred):
    logits, labels = eval_pred
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    loss_fct = torch.nn.CrossEntropyLoss()
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    return {"perplexity": torch.exp(loss)}

The more detailed explanation is provided below,

TrainingArguments: We define various parameters for training our model. Some of the import parameters of this class are:
- Output Directory (output_dir): Specifies where to save the model.
- Number of Epochs (num_train_epochs): Sets how many times the model will see the entire dataset.
- Batch Sizes (per_device_train_batch_size, per_device_eval_batch_size): Determines the number of samples processed before the model is updated.
- Evaluation and Save Steps (eval_steps, save_steps): Defines frequency of evaluation and model saving.
- Warmup Steps (warmup_steps): Adjusts learning rate in the initial training phase.
- Logging Directory (logging_dir): Location for storing training logs.
Trainer: The Trainer is a powerful class in the Transformers library that abstracts much of the training loop. Some essential parameters passed as inputs to this model are:
- Model (model): The pre-trained GPT-2 model to be fine-tuned.
- Training Arguments (args): The TrainingArguments instance as detailed above.
- Training Dataset (train_dataset): The subset of the Wikipedia dataset for training.
- Evaluation Dataset (eval_dataset): The subset of the Wikipedia dataset for evaluation.
- Metrics Function (compute_perplexity): Function to evaluate model performance with perplexity metric.

This structured approach in configuring the TrainingArguments and Trainer ensures an optimized environment for effectively fine-tuning the GPT-2 model on the chosen dataset.

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    compute_metrics=compute_perplexity,
)

# Training the model
trainer.train()

Save the Model

Once the model is trained, we can save the model for performing further fine-tuning or inference with data.

# Save the model
model.save_pretrained("./gpt2-medium-finetuned")

Conclusion

In this blog, we explored how to prepare and tokenize the dataset for Causal Language Modeling tasks and perform fine-tuning on the GPT-2 pre-trained model using the HuggingFace Transformers library. Using the above code and method, we can fine-tune any model and dataset of our choice, just by replacing the model checkpoint and the dataset source. Just ensure you have enough computing available in terms of GPUs and memory.

Thank you for reading!

NeuraForge: AI Unleashed

Discussion about this post