Understanding the Architecture of Transformers

Transformer Architecture Overview

Introduction

Transformers represent an advanced architecture of neural networks optimized for processing sequential data. This innovation was introduced by Vaswani et al. in the seminal paper “Attention is All You Need” Central to its design is the self-attention mechanism, which enables the model to dynamically concentrate on various segments of the input sequence—irrespective of their initial positions. This capability is crucial for detecting and comprehending patterns and correlations that extend across extensive portions of the data.

The Components of Transformer Architecture

Input Processing

Input

In a Transformer model, the input is a sequence of words or tokens, commonly referred to as the context or prompt.

Tokenization

Tokenization is the process of systematically decomposing input text into a sequence of tokens, which can include words, subwords, punctuation marks, and individual characters. Each token is then mapped to a unique identifier, linking it to a comprehensive and predefined vocabulary.

Note

The tokenizer used during training should be the same one used for generating text.

Hands On for Tokenization

Embedding Layer

The embedding layer transforms the input tokens into dense, multidimensional vector representations. It operates as a learnable embedding space,where each unique token from the vocabulary is associated with a specific vector in a high- dimensional space. This configuration allows the model to capture the semantic meaning and contextual nuances of each token within the input sequence.

Word embeddings — From Words to Contextual Embeddings

Hands On for words embeddings

Positional Encoding

Transformers do not inherently capture the order of tokens, To compensate for this, positional encodings are integrated with the token embeddings, to provide information about each token’s location within the sequence. The original Transformer paper proposed a specific method for positional encoding that uses a combination of sine and cosine functions to generate a distinct encoding for each position, although other methods can also be used.

Hands on for Positional encoding

Core Mechanisms

Self-Attention Layer

The self-attention mechanism is a feature within neural networks that allows the model to dynamically prioritize and interpret various segments of the in- put sequence while ignoring physical distance. This functionality is achieved through learned self-attention weights that are refined during the training pro- cess. by adjusting these weights the model can capture implicit patterns and dependencies.

Attention weights — Example of an attention map

Hands on for Self-Attention Layer

Multi-Head Attention

Multi-head attention expands the self-attention mechanism, allowing the model to learn multiple sets of self-attention weights, or ”heads,” in parallel and in- dependently. This design aims for the simultaneous comprehension of various language facets, with the number of attention heads varying across models. Each head captures different dimensions of the input sequence’s information.

Hands on for Multi-Head Attention

Network Layers

Feedforward Neural Network

The feedforward neural network in a Transformer model is a dense, fully-connected layer that processes the output from the attention mechanisms. It further trans- forms this output by applying a series of non-linear activations, refining the attention-derived information into a vector of logits. These logits reflect the model’s predictive confidence, with each logit proportional to the likelihood of corresponding tokens in the tokenizer’s dictionary. To make sure that the model not only identifies key patterns through attention but also evaluates and pre- dicts the sequence’s structure and content accurately.

Hands on for Feedforward Neural Network

Residual Connections

Residual connections or skip connections are used to add the input of each sub- layer—be it self-attention or feedforward neural network- to its output prior to the application of layer normalization. this technique is important because it allows the direct flow of gradients throughout the network, mitigating the vanishing gradient problem and enabling the training of deeper and more complex models.

In simpler words: Residual connections help the model preserve information from initial layers and seamlessly integrating it with the knowledge gained in subsequent layers.

Layer Normalization

Layer normalization is applied to the outputs of both the self-attention and feedforward neural network layers, standardizing the outputs across features to ensure a mean of zero and a standard deviation of one. This normalization process aids in stabilizing the training of deep neural networks by mitigating covariate shift, thereby facilitating faster convergence.

Hands on for Residual Connections & Layer Normalization

Output Layer

The final output is a probability distribution over the vocabulary, representing the likelihood of each token being the next word in the sequence.This distribution is derived by channeling the output from the last Transformer block through a linear layer, which is then processed by a softmax function. This sequence of operations transforms the linear layer’s output into a comprehensive probability distribution, effectively predicting the next word in the sequence with a quantifiable likelihood for each possible token.

Full example here

Types of Transformer Architectures

Encoder-Only Models

Overview

Encoder-only models or Auto-encoding Models are a class of Transformer-based architectures designed primarily for understanding and interpreting text. Unlike their encoder-decoder counterparts, they do not generate new text but focus on analyzing and extracting meaning from input sequences.

Encoder-only models
Examples of Existing Models	RoBERTa & BERT
Focus	Understanding text
Applications	Sentence embedding Sentiment Analysis Named entity recognition Text classification Feature extraction
Limitations	Not designed for text generation May require large datasets for fine-tuning Can be computationally intensive

Masked Language Modeling in Encoder-Only Models

Encoder-only models, such as BERT (Bidirectional Encoder Representations from Transformers), use a training technique called ”masked language model- ing” (MLM) to learn bidirectional representations of the input text.

Masking Tokens

During the training phase, some tokens in the input sequence are randomly selected and replaced with a special [MASK] token. The model’s task is to predict these masked tokens based on the context provided by the surrounding (unmasked) tokens.

Bidirectional Context

Encoder-only models leverage the context from both directions (bidirectional),the model considers both the preceding and following tokens to predict the masked token.

Objective function

The objective of the masked language modeling task is to minimize the pre- diction error of the masked tokens. The model’s predictions are compared to the actual tokens, and the parameters of the model are updated to reduce the difference between the predicted and actual tokens.

Denoising

The model performs a denoising task, where it attempts to reconstruct the original sentence from a corrupted version (with masked tokens).

Decoder-Only Models

Overview

Decoder-only models or Auto-regressive models are a class of Transformer-based architectures designed primarily for generating text. Unlike their encoder- decoder counterparts, they focus solely on producing new text based on the input sequence, often used in tasks like language modeling and text generation.

Decoder-only models
Examples of Existing Models	GPT & BLOOM
Focus	Text Generation
Applications	Text completion Language modeling Chatbots Text summarization
Limitations	May generate incoherent or biased text Requires substantial computational resources Limited understanding of context compared to encoder-decoder models

Causal Language Modeling in Decoder-Only Models

Decoder-only models, such as GPT (Generative Pretrained Transformer), use a training technique called ”causal language modeling” (CLM) to learn sequential representations of the input text.

Sequential Prediction

During the training phase, the model predicts each token in the input sequence based on the preceding tokens. Unlike masked language modeling, which pre- dicts randomly masked tokens, causal language modeling predicts each token in the sequence in order.

Unidirectional Context

Decoder-only models leverage context from only one direction (unidirectional) by considers only the preceding tokens to predict the next token, ensuring that the prediction for each token is causally dependent only on known tokens.

Objective Function

The objective is to minimize the prediction error of the next token in the se- quence. The model’s predictions are compared to the actual tokens, and the parameters of the model are updated to reduce the difference between the pre- dicted and actual tokens.

Text Generation

The autoregressive nature of causal language modeling makes decoder-only mod- els particularly well-suited for text generation tasks, after the training the model can generate new text by predicting one token at a time, using its own previous outputs as part of the input for the next token’s prediction.

Encoder-Decoder Models

Overview

Encoder-decoder models or Sequence-to-Sequence are a class of Transformer- based architectures designed for tasks that involve both understanding and gen- erating text, by combining two main components: an encoder that processes the input sequence and a decoder that generates the output sequence.

Encoder-decoder models
Example of Existing Models	T5 (Text-to-Text Transfer Transformer)
Focus	Text Understanding and Generation
Applications	Machine translation Text summarization Question answering Conversational agents Text-to-speech synthesis Language Translation Systems Image Caption Generation Speech-to-Text Systems
Advantages	Versatility Enhanced Accuracy Complex Training
Limitations	Information Loss