Understanding LLMs: Your Guide to Transformers
#2 Generative AI with LLMs: Understanding the Transformer Architecture - I
Introduction
The Generative AI revolution, powered by LLMs, began with the introduction of the transformers architecture in 2017. Transformers significantly revolutionized natural language processing, making it possible to create large-scale Language models like BERT and GPT-2, demonstrating exceptional capabilities in understanding and generating natural language. In this blog, let's explore different algorithms used for text generation and a brief overview and working of the model powering all the modern LLMs: the Transformer architecture.
Text Generation Pre-Transformers: RNNs and LSTMs
Before the advent of transformer architecture, some of the most important models or architectures powering text generation applications were RNNs and LSTMs.
The RNNs, or Recurrent Neural Networks, are a class of neural networks that process sequential data like time series or natural text. These networks will process the input data sequentially, where the output of the previous step is fed into the next step and processes the sequences from left to right.
LSTMs, or Long Short Term Memory, is a modified RNN that can remember essential and required data from the past. The dependence of LSTMs and RNNs on the time or index ensures that past data is processed sequentially before generating/predicting new outputs for future timestamps.
But these algorithms pose some severe challenges in the following ways:
RNNs and LSTMs capture/encode better information for nearer words than words far away in the sentence.
RNNs have limited short-term memory. They are inefficient in capturing long-term dependencies and the context of large documents.
RNNs process the inputs in the order of index or timestamps. Therefore, it is challenging to scale the training process.
GPU Parallelization is not possible due to the incapability of RNNs to compute future hidden states before completely computing all the past hidden states. Therefore, it does not utilize the GPU's ability to parallelize multiple computations simultaneously.
Now, let's explore the transformer model and the attention mechanism that addresses the problems posed by RNNs effectively.
Introduction to Transformer
The Transformer model was first introduced in the paper "Attention is all you need" by Google Brain and the University of Toronto in 2017. This revolutionary paper changed the entire landscape of text generation and training language models, leading to modern generative AI.
The self-attention mechanism described in the above paper paves the path to a new architecture called Transformer that efficiently addresses all the drawbacks of RNNs. Modern LLMs based on the Transformer architecture have the following advantages:
During training, it can be easily scaled due to its independence with the positions of the words in the sentence.
Perform parallel processing and enable GPUs to compute and train models faster.
Learn the relevance and context of every word in the sentence up to context lengths of 1000 or more.
The Transformer architecture introduced in the paper comprises of 2 parts: The Encoder and the Decoder. The primary task this model addressed to improve in the paper is the language translation task. Let's understand more about the transformer architecture.
The Transformer Architecture
Machine learning models require tokenized text as input, meaning that the raw text must be converted to tokens before being fed into the model. This conversion process is called tokenization. It is important to note that the tokenizer used during model training should also be used during model inference. The transformer architecture is composed of two main components:
The Encoder: Converts the input sequence of tokens into a sequence of high dimensional vector space called Embeddings. These vectors are also called hidden states. In simpler terms, the embeddings are the features extracted from the provided inputs by the encoder.
The Decoder: Uses the encoder's hidden state and the model's inputs to generate an output sequence of tokens iteratively. The Decoder is responsible for the text generation or next-word prediction task.
Source: https://arxiv.org/abs/1706.03762
The transformer architecture was initially introduced as the encoder-decoder model to perform machine translation tasks. The encoder-decoder architecture of the transformer was designed primarily for sequence-to-sequence tasks. However, in recent times, many researchers have found ways to utilize the encoder and decoder components of the transformer as stand-alone models.
Therefore, most open-source pre-trained models can be classified based on the underlying transformer architecture. Therefore, most of the transformer models belong to one of the following three types:
Encoder-only: These models convert text input sequences into high-dimensional embedding vectors with rich numerical representations of the inputs. These embeddings are further processed to perform sentiment analysis or named entity recognition tasks. The embeddings computed for a particular token take both the left-side (text before the token) and the right-side (text after the token) contexts into account. Therefore, this is often called Bidirectional attention. Some of the models in this category are BERT and its variants like DistilBERT.
Decoder-only: The Decoder-based models convert and predict the next word based on the set of input tokens and the embeddings from the encoder. Since the Decoder primarily deals with the next word prediction task, the embeddings computed for a given token are based only on the left context or attention to the left side of the input text, i.e., words preceding this token. This attention is known as Casual or Autoregressive attention. Some of the models belonging to the category are GPT series, Llama and Bloom.
Encoder-Decoder: The encoder-decoder models utilize both the encoder and decoder components to extract complex relationships between sequences. Usually, these models are fine-tuned for sequence-to-sequence tasks like machine translation and summarization. Some of the models belonging to this category are T5 and BART.
Transformers are an incredible breakthrough in the field of artificial intelligence, revolutionizing natural language processing and beyond. At the center of this innovation is the transformer architecture, consisting of two essential components: the encoder and the decoder. The encoder processes input sequences efficiently, while the decoder generates meaningful outputs. What makes transformers unique is the concept of self-attention. This mechanism allows each element in a sequence to weigh its importance in relation to all other elements, enabling the model to capture intricate dependencies and contextual information effectively. This innovation has led to remarkable advancements in machine translation, text summarization, language modeling, and other applications, making transformers the go-to architecture for top-notch performance in various fields.
Summary
To summarise,
Recurrent Neural networks and LSTMs are some of the model architectures prominently used for text generation tasks before the introduction of the Transformer architecture.
RNNs were not fully capable of learning the global context of the provided input text, and they couldn't be easily scaled due to their dependence on time/index for order of computation.
Transformer architecture easily handles and captures the global context of the input through self-attention layers.
Transformer consists of two main components: the encoder and the decoder blocks.
An encoder is used to convert the input sequence tokens into high-dimensional vector spaces called embeddings. Embeddings are the features extracted from the input sequences that are further used by the Decoder for text generation. Examples of some models are BERT, Roberta etc
Encoder-only models use encoder part of the transformer as the stand-alone model to train and perform tasks. Some everyday NLP tasks possible by encoder-only models are text classification, named entity recognition, keyword extraction etc.
The Decoder performs the next word generation task, the text generation task, by taking the inputs and the embeddings generated from the encoder.
Decoder-only models use only the decode part of the transformer by eliminating the encoder. Examples of some models are the GPT series, BLOOM, LLama etc
The encoder-decoder architecture uses both the Encoder and Decoder to perform the sequence to sequence tasks like machine translation, text summarization and question answering. T5, and BART are some examples of encoder-decoder-type of model.
Thank you and Happy Learning !
Please share the post if you find it interesting!
Please subscribe to my newsletter to receive new posts and understand AI concepts better. Thank you !