How a Large Language Model (LLM) predicts the next word

how a Large Language Model (LLM) predicts the next word

Sure, here is the rewritten blog with the mathematical operations in LaTeX:

 


 

 

 Step-by-Step Mathematical Breakdown of LLM’s Prediction Process

 

  1. Text Tokenization

 

The first step is converting raw text into tokens, which are the basic units the model processes. The tokenization process often involves splitting the text into words or subwords (like using Byte-Pair Encoding or WordPiece).

 

Text: “I love programming”

 

Tokenization itself is a process of mapping text to discrete units (tokens). For each token, the model uses an embedding vector.

 

  1. Embedding Layer (Token to Vector Mapping)

 

The tokenized input is then mapped to vectors using an embedding matrix. In this step, each token is transformed into a continuous vector of real numbers.

 

Embedding Matrix (\(\mathbf{E}\)): This is a matrix where each row corresponds to the embedding of a token in the vocabulary.

 

Let \(V\) be the size of the vocabulary and \(d\) be the embedding dimension. The embedding matrix \(\mathbf{E}\) is of size \(V \times d\).

 

\[\mathbf{E} = \begin{bmatrix} \mathbf{e}_1 \\ \mathbf{e}_2 \\ \vdots \\ \mathbf{e}_V \end{bmatrix}\]

 

Where \(\mathbf{e}_i\) is the \(i\)-th token embedding.

 

Example: The word “I” is tokenized to a vector:

 

\[\mathbf{E}(\text{“I”}) = [0.1, 0.2, 0.3, \dots] \quad (\text{Size of } d)\]

 

This transforms each token \(t_i\) into an embedding vector \(\mathbf{E}(t_i)\) where:

 

\[\mathbf{x}_i = \mathbf{E}(t_i) \quad \text{for each token } t_i\]

 

  1. Positional Encoding (Injecting Word Order Information)

 

Since Transformers don’t inherently process sequential data like RNNs, Positional Encoding is added to provide information about the order of tokens in the sequence.

 

Positional encodings are typically added to embeddings. They can be learned or predefined (e.g., sinusoidal functions).

 

Positional Encoding (\(\mathbf{P}_i\)): For each position \(i\) in the sequence, a positional encoding vector is generated.

 

The positional encoding vector is typically computed as:

 

\[\mathbf{P}_i = \begin{bmatrix} \sin\left(\frac{i}{10000^{2j/d}}\right) \\ \cos\left(\frac{i}{10000^{2j/d}}\right) \end{bmatrix}\]

 

Where \(i\) is the position, \(j\) is the dimension of the vector, and \(d\) is the dimensionality of the embedding.

 

After computing positional encodings, each token’s embedding \(\mathbf{x}_i\) is modified by adding its corresponding positional encoding:

 

\[\mathbf{x}_i^{‘} = \mathbf{E}(t_i) + \mathbf{P}_i\]

 

  1. Self-Attention Mechanism (Key, Query, and Value)

 

The model uses the Self-Attention mechanism to understand the relationships between all tokens in a sequence, irrespective of their positions.

 

For each token, three vectors are computed:

 

– Query (Q): Captures what the token is looking for.

– Key (K): Captures what the token is offering.

– Value (V): Contains the actual information of the token.

 

These vectors are computed as:

 

\[\mathbf{Q}_i = \mathbf{W}_Q \cdot \mathbf{x}_i^{‘}\]

\[\mathbf{K}_i = \mathbf{W}_K \cdot \mathbf{x}_i^{‘}\]

\[\mathbf{V}_i = \mathbf{W}_V \cdot \mathbf{x}_i^{‘}\]

 

Where \(\mathbf{W}_Q\), \(\mathbf{W}_K\), \(\mathbf{W}_V\) are learnable weight matrices for the Query, Key, and Value respectively, and \(\mathbf{x}_i^{‘}\) is the embedded token (with positional encoding).

 

  1. Attention Scores (Dot Product)

 

Self-attention computes a score for how much attention a token should pay to another token in the sequence. This is done by computing the dot product of the Query and Key vectors:

 

\[\text{Attention Score}_{ij} = \frac{\mathbf{Q}_i \cdot \mathbf{K}_j}{\sqrt{d_k}}\]

 

Where \(\mathbf{Q}_i\) and \(\mathbf{K}_j\) are the Query and Key vectors of tokens \(i\) and \(j\), respectively, and \(d_k\) is the dimension of the Key vector.

 

  1. Softmax (Normalize Attention Scores)

 

After computing the attention scores for all token pairs, the scores are passed through a Softmax function to normalize them into a probability distribution:

 

\[\alpha_{ij} = \text{Softmax}(\text{Attention Score}_{ij}) = \frac{e^{\text{Attention Score}_{ij}}}{\sum_k e^{\text{Attention Score}_{ik}}}\]

 

This transforms the attention scores into a probability distribution that sums to 1 across all tokens.

 

  1. Weighted Sum of Values

 

The output of the attention mechanism is a weighted sum of the Value (V) vectors, weighted by the attention scores \(\alpha_{ij}\):

 

\[\mathbf{O}_i = \sum_j \alpha_{ij} \cdot \mathbf{V}_j\]

 

This vector represents the contextualized representation of token \(i\), considering all other tokens in the sequence.

 

  1. Feed-Forward Neural Network (FFNN)

 

The output from the attention mechanism is passed through a Feed-Forward Neural Network (FFNN), typically a fully connected layer with ReLU activation:

 

\[\mathbf{h}_i = \text{FFNN}(\mathbf{O}_i)\]

 

The FFNN consists of two linear layers with a ReLU activation in between:

 

\[\mathbf{h}_i = \text{ReLU}(\mathbf{W}_2 \cdot \text{ReLU}(\mathbf{W}_1 \cdot \mathbf{O}_i + \mathbf{b}_1) + \mathbf{b}_2)\]

 

Where \(\mathbf{W}_1\), \(\mathbf{W}_2\) are weight matrices, and \(\mathbf{b}_1\), \(\mathbf{b}_2\) are bias terms.

 

  1. Output Layer

 

After passing through multiple layers of attention and FFNN, the model outputs a vector for each token, which is then projected onto the vocabulary space to predict the next token.

 

Logits: The output vector for token \(i\) is projected to a vector of size \(V\) (the vocabulary size) to generate logits (unnormalized predictions):

 

\[\mathbf{L}_i = \mathbf{W}_O \cdot \mathbf{h}_i + \mathbf{b}_O\]

 

Where \(\mathbf{W}_O\) is the weight matrix, and \(\mathbf{b}_O\) is the bias.

 

  1. Softmax (Final Prediction)

 

The logits are passed through a Softmax function to convert them into a probability distribution over the vocabulary:

 

\[\text{Probabilities}_i = \frac{e^{\mathbf{L}_i}}{\sum_j e^{\mathbf{L}_j}}\]

 

This step converts the raw scores into a probability distribution over all possible next tokens.

 

  1. Next Word Selection

 

Finally, the model selects the token with the highest probability as the next word in the sequence:

 

\[\hat{y} = \arg\max(\text{Probabilities}_i)\]

 

The token corresponding to the highest probability is predicted as the next word.

 

Summary of Mathematical Operations

 

  1. Tokenization: Splitting the text into tokens.
  2. Embedding: Mapping each token to a continuous vector \(\mathbf{E}(t)\).
  3. Positional Encoding: Adding positional encodings to the embeddings.
  4. Self-Attention: Computing Query, Key, and Value vectors and attention scores.
  5. Softmax: Normalizing attention scores and logits.
  6. Feed-Forward Network: Applying a neural network layer to contextualized representations.
  7. Logits: Projecting the model’s output onto the vocabulary space.
  8. Final Softmax: Converting logits into a probability distribution.
  9. Next Word Selection: Selecting the next word based on the probabilities.

 

By using these mathematical processes iteratively across multiple layers of the Transformer model, LLMs can generate highly accurate and contextually relevant predictions for the next word in a sequence.

 


 

This revised blog explains the prediction process of a Large Language Model (LLM) with the mathematical operations in LaTeX. If you have any further questions or need additional details, feel free to ask!

Leave a Reply

Your email address will not be published. Required fields are marked *

Home
Courses
Services
Search