What Are LLMs?

LLMs are machine learning models trained on vast amounts of text data. They use transformer architectures, a neural network design introduced in the paper “Attention Is All You Need”. Transformers excel at capturing context and relationships within data, making them ideal for natural language tasks.

1. Architectural Types of Language Models
(Expanded with Technical Nuances)

A. Decoder-Only (Autoregressive) Models
– Core Mechanism: Processes text sequentially from left-to-right using masked self-attention. Each token prediction depends only on previous tokens.
– Key Innovations:
– Sparse Attention (e.g., GPT-3’s block-sparse patterns) for long-context efficiency.
– Rotary Positional Embeddings (RoPE) in LLaMA for better positional encoding.
– Limitations: Struggles with bidirectional context understanding (e.g., fill-in-the-blank tasks).
– Examples: GPT-4 (345B params), Mistral 7B (sliding window attention), PaLM 2 (Google’s pathway scaling).

B. Encoder-Only (Autoencoding) Models
– Core Mechanism: Uses full bidirectional attention to reconstruct masked tokens (e.g., BERT’s 15 – Training Tricks:
– Dynamic Masking (RoBERTa): Changes masked tokens per epoch.
– Whole-Word Masking: Masks entire words for Chinese/Japanese.
– Use Cases:
– Sentence embeddings (e.g., SBERT for semantic search).
– Low-latency classification (DistilBERT’s 66

C. Encoder-Decoder (Sequence-to-Sequence) Models
– Hybrid Approach: Encoder processes input bidirectionally; decoder generates output autoregressively.
– Specialized Variants:
– T5: Treats all tasks as text-to-text (e.g., “translate English to German: …”).
– BART: Optimized for denoising (e.g., document reconstruction).
– Efficiency Trade-offs: 30-40

2. Training Objectives & Pretraining Strategies
(Beyond Basic Causal/Masked LM)

A. Multitask Pretraining
– FLAN-T5: Trained on 1,800+ instruction templates across 140 tasks.
– UniLM: Combines causal, masked, and seq2seq objectives in one model.

B. Reinforcement Learning from Human Feedback (RLHF)
– Process: Supervised fine-tuning → Reward modeling → PPO optimization.
– Critical for Alignment: Reduces harmful outputs by ~60

C. Denoising Objectives
– BART’s Approach: Randomly corrupts text (deletion, permutation, masking) and learns to reconstruct.
– PEGASUS: Specifically designed for summarization via gap-sentence generation.

3. Specialized LLMs & Emerging Categories

A. Vision-Language Models (VLMs)
– Architecture:
– Single-Stream (Flamingo): Interleaves image and text tokens in one transformer.
– Dual-Encoder (CLIP): Separate image/text encoders with contrastive learning.
– Breakthrough Models:
– GPT-4V: Processes images via vision encoder + LLM fusion.
– Kosmos-2: Grounds text to image regions (e.g., “click on the red car”).

B. Code-Specialized LLMs
– Training Data:
– StarCoder: 80+ programming languages from GitHub (1TB code).
– Code LLaMA: Infill-compatible (e.g., predicts missing code segments).
– Unique Features:
– Repository-Level Context: AlphaCode processes entire GitHub repos.
– Unit Test Execution: CodeT5+ validates outputs against test cases.

C. Domain-Specific LLMs
– Medical:
– Med-PaLM 2: Achieves 85 – BioBERT: Pretrained on PubMed abstracts.
– Legal:
– LegalGPT: Fine-tuned on 2M court opinions.
– Harvey AI: Used by Allen & Overy for contract review.

4. Multimodal & Embodied AI Frontiers

A. Audio-Language Models
– Whisper: ASR + translation via encoder-decoder.
– AudioPaLM: Merges speech and text tokenizers for voice assistants.

B. Robotics Integration
– RT-2: Uses VLMs to convert camera inputs to robot actions (“pick up the banana”).
– PaLM-E: Embodied model handling sensor data + language.

C. Agentic LLMs
– AutoGPT: Recursively decomposes goals into sub-tasks.
– Voyager: Minecraft AI that learns from environment feedback.

Comparative Summary Table

Category	Key Differentiators	Example Models	Benchmark Performance
Decoder-Only	Fast generation, left-context only	GPT-4, Mistral 7B	75
Encoder-Only	Bidirectional, no generation	BERT, LegalBERT	92
VLMs	Fuses vision + text	GPT-4V, Kosmos-2	88
Code LLMs	Repository-aware, test-passing	Code LLaMA, AlphaCode	54
Medical LLMs	FDA-compliant fine-tuning	Med-PaLM 2	85

Core Components of an LLM

Tokenizer: Splits text into smaller units like words or subwords.
Embedding Layer: Converts tokens into dense vector representations.
Transformer Blocks: Layers that use self-attention mechanisms to process and understand input sequences.
Output Layer: Generates predictions, such as the next word in a sentence.

Now Let’s create the LLM step by step:

I shall use colab and we shall look at all the steps involved for this example we shall use “distilgpt2” model.



# Install only if not already available
!pip install transformers torch 

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F

def log_step(step_num, title, data, data_type="tensor", max_len=5):
    """Helper function for consistent logging"""
    print(f"\n{'='*50}")
    print(f" STEP {step_num}: {title.upper()}")
    print("-"*50)

    if data_type == "tensor":
        print(f"Shape: {data.shape}")
        if len(data.shape) <= 2:
            print(data)
        else:
            print(f"First {max_len} elements of last dimension:", data[..., :max_len])
    elif data_type == "text":
        print(data)
    elif data_type == "dict":
        for k, v in data.items():
            print(f"{k}: {v}")

# Load tokenizer and model
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Install only if not already available

!pip install transformers torch

from transformers import AutoTokenizer, AutoModelForCausalLM

import torch

import torch.nn.functional as F

def log_step(step_num, title, data, data_type="tensor", max_len=5):

"""Helper function for consistent logging"""

print(f"\n{'='*50}")

print(f" STEP {step_num}: {title.upper()}")

print("-"*50)

if data_type == "tensor":

print(f"Shape: {data.shape}")

if len(data.shape) <= 2:

print(data)

else:

print(f"First {max_len} elements of last dimension:", data[..., :max_len])

elif data_type == "text":

print(data)

elif data_type == "dict":

for k, v in data.items():

print(f"{k}: {v}")

# Load tokenizer and model

model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)

Model Name Definition:



model_name = "distilgpt2"

model_name = "distilgpt2"

This sets the variable model_name to the string “distilgpt2”

“distilgpt2” refers to a distilled (smaller, faster) version of the GPT-2 model created by Hugging Face

This is a pretrained language model that can generate human-like text

Loading the Tokenizer:



tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

AutoTokenizer is a class from the Hugging Face transformers library that automatically selects the appropriate tokenizer based on the model name

from_pretrained(model_name) loads the tokenizer that was specifically trained with the “distilgpt2” model

The tokenizer handles:

Splitting text into tokens (words/subwords)

Converting tokens to numerical IDs (tokenization)

Converting numerical IDs back to text (detokenization)

Handling special tokens like [CLS], [SEP], etc.

Loading the Model:



model = AutoModelForCausalLM.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)

- AutoModelForCausalLM is a class that automatically selects the appropriate model architecture for causal language modeling
- from_pretrained(model_name) downloads and loads:
  - The model architecture (in this case, a distilled GPT-2 architecture)
  - The pretrained weights (the knowledge the model learned during training)
- This is a “causal” language model, meaning it’s designed to predict the next word in a sequence (used for text generation)

Key points about what happens under the hood:

Both operations will download the model/tokenizer from Hugging Face’s model hub if they’re not already cached locally
The downloaded files are stored in a cache directory (typically ~/.cache/huggingface)
The model will be loaded in evaluation mode by default (no training/gradient computation)
The tokenizer includes all the special tokens and vocabulary needed to preprocess text for this specific model

After these steps, you’ll have:

A tokenizer ready to convert between text and token IDs
A model ready to make predictions (generate text) based on input token IDs



==================================================
 STEP 1: ORIGINAL TEXT
--------------------------------------------------
The weather is

==================================================
 STEP 2: TOKENIZATION
--------------------------------------------------
Tokens: ['The', 'Ġweather', 'Ġis']
Token IDs: tensor([[ 464, 6193,  318]])
Attention Mask: tensor([[1, 1, 1]])

==================================================
 STEP 3: EMBEDDING MATRIX
--------------------------------------------------
Shape: torch.Size([50257, 768])
Parameter containing:
tensor([[-0.1445, -0.0455,  0.0042,  ..., -0.1523,  0.0184,  0.0991],
        [ 0.0573, -0.0722,  0.0234,  ...,  0.0603, -0.0042,  0.0478],
        [-0.1106,  0.0386,  0.1948,  ...,  0.0421, -0.1141, -0.1455],
        ...,
        [-0.0710, -0.0173,  0.0176,  ...,  0.0834,  0.1340, -0.0746],
        [ 0.1993,  0.0201,  0.0151,  ..., -0.0829,  0.0750, -0.0294],
        [ 0.0342,  0.0640,  0.0305,  ...,  0.0291,  0.0942,  0.0639]],
       requires_grad=True)

==================================================

STEP 1: ORIGINAL TEXT

--------------------------------------------------

The weather is

==================================================

STEP 2: TOKENIZATION

--------------------------------------------------

Tokens: ['The', 'Ġweather', 'Ġis']

Token IDs: tensor([[ 464, 6193, 318]])

Attention Mask: tensor([[1, 1, 1]])

==================================================

STEP 3: EMBEDDING MATRIX

--------------------------------------------------

Shape: torch.Size([50257, 768])

Parameter containing:

tensor([[-0.1445, -0.0455, 0.0042, ..., -0.1523, 0.0184, 0.0991],

[ 0.0573, -0.0722, 0.0234, ..., 0.0603, -0.0042, 0.0478],

[-0.1106, 0.0386, 0.1948, ..., 0.0421, -0.1141, -0.1455],

...,

[-0.0710, -0.0173, 0.0176, ..., 0.0834, 0.1340, -0.0746],

[ 0.1993, 0.0201, 0.0151, ..., -0.0829, 0.0750, -0.0294],

[ 0.0342, 0.0640, 0.0305, ..., 0.0291, 0.0942, 0.0639]],

requires_grad=True)

Let me break down each step mathematically with clear explanations and key points regarding the output above.

Step 1: Original Text Processing
Text: "The weather is"

– This is a sequence of 3 words (including the space before “weather” and “is”).
– The tokenizer will split this into subword tokens based on the vocabulary.

Mathematical Representation:
– Let the input text be a string $ S = $ "The weather is".
– The tokenizer $ \mathcal{T} $ maps $ S $ to a sequence of tokens $ T $:
\[
\mathcal{T}(S) = T = [t_1, t_2, t_3] = \text{[‘The’, ‘Ġweather’, ‘Ġis’]}
\]
– Ġ indicates a preceding space in GPT-style tokenization.

Step 2: Tokenization
Output:
– Tokens: ['The', 'Ġweather', 'Ġis']
– Token IDs: tensor([[464, 6193, 318]])
– Attention Mask: tensor([[1, 1, 1]])

Mathematical Explanation:
1. Token → ID Mapping:
– The tokenizer has a vocabulary $ V $ where each token $ t_i $ is assigned a unique integer $ x_i $.
– The mapping is:
\[
\begin{cases}
t_1 = \text{‘The’} & \rightarrow x_1 = 464 \\
t_2 = \text{‘Ġweather’} & \rightarrow x_2 = 6193 \\
t_3 = \text{‘Ġis’} & \rightarrow x_3 = 318 \\
\end{cases}
\]
– The input sequence becomes:
\[
X = [x_1, x_2, x_3] = [464, 6193, 318]
\]
2. Attention Mask:
– Since all tokens are valid (not padding), the mask is [1, 1, 1].
– If padding were present, some positions would be 0.

Key Points:
– The tokenizer converts text into numerical IDs that the model can process.
– The attention mask helps the model ignore padding tokens during computation.

Step 3: Embedding Matrix
Output:
– Shape: torch.Size([50257, 768])
– Description: A matrix of size (vocab_size, embedding_dim).

Mathematical Explanation:
1. Embedding Matrix Definition:
– Let $ W_e \in \mathbb{R}^{V \times d} $, where:
– $ V = 50257 $ (vocabulary size)
– $ d = 768 $ (embedding dimension)
– Each row $ W_e[i] $ is the embedding vector for token ID $ i $.

2. Embedding Lookup:
– For each token ID $ x_i $, the embedding is:
\[
e_i = W_e[x_i]
\]
– For our input $ X = [464, 6193, 318] $, we get:
\[
\begin{cases}
e_1 = W_e[464] \\
e_2 = W_e[6193] \\
e_3 = W_e[318] \\
\end{cases}
\]
– The final embedded sequence is:
\[
E = [e_1, e_2, e_3] \in \mathbb{R}^{3 \times 768}
\]

Key Points:
– The embedding matrix is a trainable lookup table that maps discrete token IDs to continuous vectors.
– Each token is represented as a dense vector in $ \mathbb{R}^{768} $.
– Similar words tend to have similar embeddings (closer in vector space).

Step 4: Embedded Input Vectors
Output:
– Shape: torch.Size([1, 3, 768]) (batch_size=1, sequence_length=3, embedding_dim=768)
– First 5 elements shown for each token.

Mathematical Explanation:
1. Embedding Output:
– The embedded sequence is:
\[
E = \begin{bmatrix}
e_{1,1} & e_{1,2} & \cdots & e_{1,768} \\
e_{2,1} & e_{2,2} & \cdots & e_{2,768} \\
e_{3,1} & e_{3,2} & \cdots & e_{3,768} \\
\end{bmatrix}
\]
– The example shows:
\[
e_1 = [-0.0626, -0.0449, 0.0559, -0.0547, -0.1171, \dots]
\]
\[
e_2 = [0.1632, 0.1023, 0.0634, 0.1102, -0.0860, \dots]
\]
\[
e_3 = [-0.0006, 0.0075, 0.0307, -0.1343, -0.1336, \dots]
\]

2. Batch Dimension:
– Since the input is a single sequence, the output has shape (1, 3, 768).
– For a batch of size $ B $, the shape would be (B, seq_len, 768).

Key Points:
– The embeddings capture semantic and syntactic features of the tokens.
– These vectors will be fed into the transformer layers for further processing.
– The embedding step is differentiable, allowing gradients to flow back during training.

Summary of Mathematical Flow
1. Text → Tokens:
\[
S \rightarrow \mathcal{T}(S) = [t_1, t_2, t_3]
\]
2. Tokens → IDs:
\[
[t_1, t_2, t_3] \rightarrow [x_1, x_2, x_3]
\]
3. IDs → Embeddings:
\[
[x_1, x_2, x_3] \rightarrow [W_e[x_1], W_e[x_2], W_e[x_3]] = [e_1, e_2, e_3]
\]
4. Final Embedding Tensor:
\[
E \in \mathbb{R}^{1 \times 3 \times 768}
\]

Why This Matters
– Discrete → Continuous: Converts words into numerical vectors.
– Semantic Similarity: Words with similar meanings have closer embeddings.
– Downstream Processing: These embeddings are the input to transformer layers for tasks like text generation or classification.



# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Ensure model is in evaluation mode
model.eval()

# Input text
text = "The weather is"
log_step(1, "Original Text", text, "text")

# Tokenize with attention to special tokens
inputs = tokenizer(text, return_tensors="pt", return_attention_mask=True)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]

log_step(2, "Tokenization", {
    "Tokens": tokenizer.convert_ids_to_tokens(input_ids[0]),
    "Token IDs": input_ids,
    "Attention Mask": attention_mask
}, "dict")

# Get the embedding layer
embedding_layer = model.transformer.wte
log_step(3, "Embedding Matrix", embedding_layer.weight)

# Get the embedded vectors for input token IDs
input_embeddings = embedding_layer(input_ids)
log_step(4, "Embedded Input Vectors", input_embeddings)

# Forward pass through first transformer block with detailed logging
with torch.no_grad():
    # Get model components
    block = model.transformer.h[0]
    ln1 = block.ln_1
    attn = block.attn
    mlp = block.mlp
    ln2 = block.ln_2

    # === Layer Norm 1 ===
    normed_input = ln1(input_embeddings)
    log_step(5, "LayerNorm 1 Output", normed_input)

    # === Self-Attention ===
    # Project to Q, K, V
    qkv = attn.c_attn(normed_input)
    q, k, v = torch.chunk(qkv, 3, dim=-1)
    log_step(6, "QKV Projections", {
        "Query (Q)": q,
        "Key (K)": k,
        "Value (V)": v
    }, "dict")

# Add padding token if not present

if tokenizer.pad_token is None:

tokenizer.pad_token = tokenizer.eos_token

# Ensure model is in evaluation mode

model.eval()

# Input text

text = "The weather is"

log_step(1, "Original Text", text, "text")

# Tokenize with attention to special tokens

inputs = tokenizer(text, return_tensors="pt", return_attention_mask=True)

input_ids = inputs["input_ids"]

attention_mask = inputs["attention_mask"]

log_step(2, "Tokenization", {

"Tokens": tokenizer.convert_ids_to_tokens(input_ids[0]),

"Token IDs": input_ids,

"Attention Mask": attention_mask

}, "dict")

# Get the embedding layer

embedding_layer = model.transformer.wte

log_step(3, "Embedding Matrix", embedding_layer.weight)

# Get the embedded vectors for input token IDs

input_embeddings = embedding_layer(input_ids)

log_step(4, "Embedded Input Vectors", input_embeddings)

# Forward pass through first transformer block with detailed logging

with torch.no_grad():

# Get model components

block = model.transformer.h[0]

ln1 = block.ln_1

attn = block.attn

mlp = block.mlp

ln2 = block.ln_2

# === Layer Norm 1 ===

normed_input = ln1(input_embeddings)

log_step(5, "LayerNorm 1 Output", normed_input)

# === Self-Attention ===

# Project to Q, K, V

qkv = attn.c_attn(normed_input)

q, k, v = torch.chunk(qkv, 3, dim=-1)

log_step(6, "QKV Projections", {

"Query (Q)": q,

"Key (K)": k,

"Value (V)": v

}, "dict")

Let me break down the mathematical transformations from Step 4 (Embeddings) → Step 5 (LayerNorm) → Step 6 (QKV Projections) in detail, with clear equations and conceptual explanations.



==================================================
 STEP 4: EMBEDDED INPUT VECTORS
--------------------------------------------------
Shape: torch.Size([1, 3, 768])
First 5 elements of last dimension: tensor([[[-0.0626, -0.0449,  0.0559, -0.0547, -0.1171],
         [ 0.1632,  0.1023,  0.0634,  0.1102, -0.0860],
         [-0.0006,  0.0075,  0.0307, -0.1343, -0.1336]]],
       grad_fn=<SliceBackward0>)

==================================================
 STEP 5: LAYERNORM 1 OUTPUT
--------------------------------------------------
Shape: torch.Size([1, 3, 768])
First 5 elements of last dimension: tensor([[[-0.1117, -0.0560,  0.0642, -0.0921, -0.2085],
         [ 0.2921,  0.1648,  0.0607,  0.1642, -0.1441],
         [ 0.0089,  0.0305,  0.0303, -0.2418, -0.2519]]])

==================================================
 STEP 6: QKV PROJECTIONS
--------------------------------------------------
Query (Q): tensor([[[-0.8680, -1.4208,  0.2183,  ..., -0.5652, -0.9652, -0.0693],
         [ 1.3773, -0.1343, -1.3214,  ...,  1.2513, -0.0020, -0.7276],
         [ 0.7626, -1.5240, -0.9271,  ...,  0.2431, -1.1848,  0.1435]]])
Key (K): tensor([[[ 1.7470, -1.8736,  0.3305,  ..., -1.3672,  0.3262, -1.2373],
         [ 1.1331, -0.9781, -0.8564,  ...,  0.4419, -0.3670,  0.2714],
         [ 0.4763, -1.3152, -0.3092,  ..., -0.3529,  0.3742, -0.4410]]])
Value (V): tensor([[[ 0.0318, -0.0429,  0.5845,  ...,  0.1157, -0.0642, -0.4520],
         [ 0.3991,  0.1155,  0.1320,  ...,  0.0198, -0.2356,  0.1931],
         [ 0.4426, -0.4031,  0.7601,  ..., -0.1701,  0.3393, -0.2028]]])

==================================================

STEP 4: EMBEDDED INPUT VECTORS

--------------------------------------------------

Shape: torch.Size([1, 3, 768])

First 5 elements of last dimension: tensor([[[-0.0626, -0.0449, 0.0559, -0.0547, -0.1171],

[ 0.1632, 0.1023, 0.0634, 0.1102, -0.0860],

[-0.0006, 0.0075, 0.0307, -0.1343, -0.1336]]],

grad_fn=<SliceBackward0>)

==================================================

STEP 5: LAYERNORM 1 OUTPUT

--------------------------------------------------

Shape: torch.Size([1, 3, 768])

First 5 elements of last dimension: tensor([[[-0.1117, -0.0560, 0.0642, -0.0921, -0.2085],

[ 0.2921, 0.1648, 0.0607, 0.1642, -0.1441],

[ 0.0089, 0.0305, 0.0303, -0.2418, -0.2519]]])

==================================================

STEP 6: QKV PROJECTIONS

--------------------------------------------------

Query (Q): tensor([[[-0.8680, -1.4208, 0.2183, ..., -0.5652, -0.9652, -0.0693],

[ 1.3773, -0.1343, -1.3214, ..., 1.2513, -0.0020, -0.7276],

[ 0.7626, -1.5240, -0.9271, ..., 0.2431, -1.1848, 0.1435]]])

Key (K): tensor([[[ 1.7470, -1.8736, 0.3305, ..., -1.3672, 0.3262, -1.2373],

[ 1.1331, -0.9781, -0.8564, ..., 0.4419, -0.3670, 0.2714],

[ 0.4763, -1.3152, -0.3092, ..., -0.3529, 0.3742, -0.4410]]])

Value (V): tensor([[[ 0.0318, -0.0429, 0.5845, ..., 0.1157, -0.0642, -0.4520],

[ 0.3991, 0.1155, 0.1320, ..., 0.0198, -0.2356, 0.1931],

[ 0.4426, -0.4031, 0.7601, ..., -0.1701, 0.3393, -0.2028]]])

Step 4: Embedded Input Vectors
Mathematical Representation
– Input: Token IDs [464, 6193, 318] → Embedded via lookup table $ W_e \in \mathbb{R}^{50257 \times 768} $.
– Output: $ E \in \mathbb{R}^{1 \times 3 \times 768} $ (batch_size=1, seq_len=3, hidden_dim=768)
\[
E = \begin{bmatrix}
[-0.0626, -0.0449, 0.0559, \dots] & \text{(Token 1: “The”)} \\
[0.1632, 0.1023, 0.0634, \dots] & \text{(Token 2: “weather”)} \\
[-0.0006, 0.0075, 0.0307, \dots] & \text{(Token 3: “is”)}
\end{bmatrix}
\]

Key Points
✔ Raw embeddings capture initial semantic representations of tokens.
✔ Values are untrained or pre-trained (depending on model initialization).
✔ Next step: Normalization to stabilize training.

Step 5: LayerNorm Output
Mathematical Transformation
Applies Layer Normalization to each token’s embedding independently:
\[
\text{LayerNorm}(E_i) = \gamma \odot \frac{E_i – \mu_i}{\sigma_i + \epsilon} + \beta
\]
where:
– $ E_i \in \mathbb{R}^{768} $: Embedding of the $i$-th token.
– $ \mu_i, \sigma_i $: Mean and standard deviation of $ E_i $.
– $ \gamma, \beta \in \mathbb{R}^{768} $: Learnable scale and shift parameters.
– $ \epsilon \approx 10^{-5} $: Small constant for numerical stability.

Example Calculation (Token 1)
1. Compute mean and std:
\[
\mu_1 = \text{mean}([-0.0626, -0.0449, \dots]) \approx -0.042
\]
\[
\sigma_1 = \text{std}([-0.0626, -0.0449, \dots]) \approx 0.078
\]
2. Normalize and scale:
\[
\text{LayerNorm}(E_1) = \gamma \odot \frac{[-0.0626, -0.0449, \dots] + 0.042}{0.078} + \beta
\]
Result:
\[
[-0.1117, -0.0560, 0.0642, \dots]
\]

Why LayerNorm?
✔ Stabilizes training by normalizing activations.
✔ Token-wise normalization (unlike BatchNorm).
✔ Preserves sequence-length independence.

Output
\[
E_{\text{norm}} = \begin{bmatrix}
[-0.1117, -0.0560, 0.0642, \dots] \\
[0.2921, 0.1648, 0.0607, \dots] \\
[0.0089, 0.0305, 0.0303, \dots]
\end{bmatrix}
\]

Step 6: QKV Projections
Mathematical Operations
Projects normalized embeddings into Query (Q), Key (K), Value (V) using learned matrices:
\[
Q = E_{\text{norm}} W_Q, \quad K = E_{\text{norm}} W_K, \quad V = E_{\text{norm}} W_V
\]
where:
– $ W_Q, W_K, W_V \in \mathbb{R}^{768 \times 768} $ (for single-head attention).
– In multi-head attention, these are split into smaller matrices.

Intuition
– Query (Q): “What am I looking for?”
– Key (K): “What information do I contain?”
– Value (V): “What should I output?”

Example Calculation (Token 1)
\[
Q_1 = [-0.1117, -0.0560, \dots] \cdot W_Q = [-0.8680, -1.4208, \dots]
\]
\[
K_1 = [-0.1117, -0.0560, \dots] \cdot W_K = [1.7470, -1.8736, \dots]
\]
\[
V_1 = [-0.1117, -0.0560, \dots] \cdot W_V = [0.0318, -0.0429, \dots]
\]

Output Shapes
All Q, K, V have shape [1, 3, 768] (same as input).

Key Observations
1. Q and K are used for attention scores (dot products).
2. V stores the actual content to be weighted by attention.
3. Projections are linear but enable non-linear interactions via attention.

Description: The normalized embeddings are now projected into Query (Q), Key (K), and Value (V) matrices, which are the foundation of the self-attention mechanism.

Mathematical Formulation

Given an input $X \in \mathbb{R}^{T \times d}$ (where $T = 3, d = 768$):

$$
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
$$

Where:

* $W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ are learnable weight matrices (shared or per-head in multi-head attention).
* Resulting shapes:

$$
Q, K, V \in \mathbb{R}^{T \times d}
$$

Each token gets projected into three different views:

* Query (what this token wants to attend to)
* Key (how much it should be attended to)
* Value (what content it carries)

These are used in the scaled dot-product attention:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V
$$

Weight Matrices:
– $ W_Q, W_K, W_V \in \mathbb{R}^{768 \times 768} $
Projections:
\[
Q = E_{\text{norm}} W_Q, \quad K = E_{\text{norm}} W_K, \quad V = E_{\text{norm}} W_V
\]
Outputs:
\[
Q = \begin{bmatrix}
[-0.8680, -1.4208, \dots] \\
[1.3773, -0.1343, \dots] \\
[0.7626, -1.5240, \dots]
\end{bmatrix}, \quad
K = \begin{bmatrix}
[1.7470, -1.8736, \dots] \\
[1.1331, -0.9781, \dots] \\
[0.4763, -1.3152, \dots]
\end{bmatrix}, \quad
V = \begin{bmatrix}
[0.0318, -0.0429, \dots] \\
[0.3991, 0.1155, \dots] \\
[0.4426, -0.4031, \dots]
\end{bmatrix}
\]



==================================================
 STEP 7: ATTENTION SCORES (BEFORE SOFTMAX)
--------------------------------------------------
Shape: torch.Size([1, 3, 3])
First 5 elements of last dimension: tensor([[[21.3379,  0.7819,  2.9506],
         [ 1.0591, 23.2175, -2.8248],
         [ 5.9320,  1.0623, 22.1555]]])

==================================================
 STEP 8: ATTENTION WEIGHTS (AFTER SOFTMAX)
--------------------------------------------------
Shape: torch.Size([1, 1, 3, 3])
First 5 elements of last dimension: tensor([[[[1.0000e+00, 1.1821e-09, 1.0340e-08],
          [2.3810e-10, 1.0000e+00, 4.8974e-12],
          [9.0001e-08, 6.9081e-10, 1.0000e+00]]]])

==================================================
 STEP 9: ATTENTION OUTPUT + RESIDUAL
--------------------------------------------------
Shape: torch.Size([1, 1, 3, 768])
First 5 elements of last dimension: tensor([[[[ 0.8015, -2.0873,  3.8054, -0.1192, -0.1380],
          [ 2.5527,  1.6109,  0.4329,  0.1603, -0.1754],
          [-3.7829,  3.6026, -2.2009, -0.2250,  0.0787]]]])

==================================================
 STEP 10: LAYERNORM 2 OUTPUT
--------------------------------------------------
Shape: torch.Size([1, 1, 3, 768])
First 5 elements of last dimension: tensor([[[[ 0.0943, -0.1983,  0.4484, -0.0909, -0.1237],
          [ 0.2161,  0.2401,  0.0613,  0.1264, -0.1022],
          [-0.2260,  0.4635, -0.2279, -0.1290,  0.0470]]]])

==================================================
 STEP 11: MLP INTERMEDIATE (GELU ACTIVATION)
--------------------------------------------------
Shape: torch.Size([1, 1, 3, 3072])
First 5 elements of last dimension: tensor([[[[ 0.5316, -0.0780, -0.0685, -0.0302, -0.0143],
          [ 0.5049, -0.1657,  0.3253, -0.0413, -0.1682],
          [-0.1299, -0.0223, -0.1658, -0.1523,  0.0550]]]])

==================================================
 STEP 12: FINAL BLOCK OUTPUT
--------------------------------------------------
Shape: torch.Size([1, 1, 3, 768])
First 5 elements of last dimension: tensor([[[[ 2.2350, -2.3735,  2.1591,  2.2254,  2.8077],
          [ 2.8378,  2.3826, -0.2443, -0.2247, -2.0033],
          [-2.1502,  3.4678, -1.4818,  0.3594,  0.3218]]]])

==================================================
 STEP 13: FULL TRANSFORMER OUTPUT
--------------------------------------------------
Shape: torch.Size([1, 3, 768])
First 5 elements of last dimension: tensor([[[-0.0301,  0.3668,  0.0901,  0.2023, -0.2082],
         [ 0.5815,  0.2135,  0.1955,  0.2233, -0.8497],
         [ 0.2250,  0.5953, -0.3460,  0.1568,  0.2529]]])

==================================================
 STEP 14: RAW LOGITS FOR NEXT TOKEN PREDICTION
--------------------------------------------------
Shape: torch.Size([1, 3, 50257])
First 5 elements of last dimension: tensor([[[-31.6921, -29.4775, -31.2145, -30.9288, -31.5420],
         [-61.4353, -62.0023, -67.4475, -69.2026, -66.4335],
         [-77.1769, -78.6993, -84.0041, -84.2638, -81.8153]]])

==================================================
 STEP 15: TOP PREDICTIONS
--------------------------------------------------
Token IDs: tensor([1972, 5609,  407, 2938,  922,  257, 1016,  845,  991,  523])
Tokens: [' getting', ' changing', ' not', ' expected', ' good', ' a', ' going', ' very', ' still', ' so']
Probabilities: tensor([0.0424, 0.0402, 0.0368, 0.0302, 0.0298, 0.0291, 0.0241, 0.0240, 0.0199,
        0.0165])

==================================================
 STEP 16: FINAL PREDICTION
--------------------------------------------------
Token ID: 1972
Word: getting

==================================================
 STEP 17: FULL TEXT GENERATION
--------------------------------------------------
Generating continuation...

==================================================
 STEP 18: COMPLETE GENERATED TEXT
--------------------------------------------------
The weather is getting better and more dangerous.

==================================================

STEP 7: ATTENTION SCORES (BEFORE SOFTMAX)

--------------------------------------------------

Shape: torch.Size([1, 3, 3])

First 5 elements of last dimension: tensor([[[21.3379, 0.7819, 2.9506],

[ 1.0591, 23.2175, -2.8248],

[ 5.9320, 1.0623, 22.1555]]])

==================================================

STEP 8: ATTENTION WEIGHTS (AFTER SOFTMAX)

--------------------------------------------------

Shape: torch.Size([1, 1, 3, 3])

First 5 elements of last dimension: tensor([[[[1.0000e+00, 1.1821e-09, 1.0340e-08],

[2.3810e-10, 1.0000e+00, 4.8974e-12],

[9.0001e-08, 6.9081e-10, 1.0000e+00]]]])

==================================================

STEP 9: ATTENTION OUTPUT + RESIDUAL

--------------------------------------------------

Shape: torch.Size([1, 1, 3, 768])

First 5 elements of last dimension: tensor([[[[ 0.8015, -2.0873, 3.8054, -0.1192, -0.1380],

[ 2.5527, 1.6109, 0.4329, 0.1603, -0.1754],

[-3.7829, 3.6026, -2.2009, -0.2250, 0.0787]]]])

==================================================

STEP 10: LAYERNORM 2 OUTPUT

--------------------------------------------------

Shape: torch.Size([1, 1, 3, 768])

First 5 elements of last dimension: tensor([[[[ 0.0943, -0.1983, 0.4484, -0.0909, -0.1237],

[ 0.2161, 0.2401, 0.0613, 0.1264, -0.1022],

[-0.2260, 0.4635, -0.2279, -0.1290, 0.0470]]]])

==================================================

STEP 11: MLP INTERMEDIATE (GELU ACTIVATION)

--------------------------------------------------

Shape: torch.Size([1, 1, 3, 3072])

First 5 elements of last dimension: tensor([[[[ 0.5316, -0.0780, -0.0685, -0.0302, -0.0143],

[ 0.5049, -0.1657, 0.3253, -0.0413, -0.1682],

[-0.1299, -0.0223, -0.1658, -0.1523, 0.0550]]]])

==================================================

STEP 12: FINAL BLOCK OUTPUT

--------------------------------------------------

Shape: torch.Size([1, 1, 3, 768])

First 5 elements of last dimension: tensor([[[[ 2.2350, -2.3735, 2.1591, 2.2254, 2.8077],

[ 2.8378, 2.3826, -0.2443, -0.2247, -2.0033],

[-2.1502, 3.4678, -1.4818, 0.3594, 0.3218]]]])

==================================================

STEP 13: FULL TRANSFORMER OUTPUT

--------------------------------------------------

Shape: torch.Size([1, 3, 768])

First 5 elements of last dimension: tensor([[[-0.0301, 0.3668, 0.0901, 0.2023, -0.2082],

[ 0.5815, 0.2135, 0.1955, 0.2233, -0.8497],

[ 0.2250, 0.5953, -0.3460, 0.1568, 0.2529]]])

==================================================

STEP 14: RAW LOGITS FOR NEXT TOKEN PREDICTION

--------------------------------------------------

Shape: torch.Size([1, 3, 50257])

First 5 elements of last dimension: tensor([[[-31.6921, -29.4775, -31.2145, -30.9288, -31.5420],

[-61.4353, -62.0023, -67.4475, -69.2026, -66.4335],

[-77.1769, -78.6993, -84.0041, -84.2638, -81.8153]]])

==================================================

STEP 15: TOP PREDICTIONS

--------------------------------------------------

Token IDs: tensor([1972, 5609, 407, 2938, 922, 257, 1016, 845, 991, 523])

Tokens: [' getting', ' changing', ' not', ' expected', ' good', ' a', ' going', ' very', ' still', ' so']

Probabilities: tensor([0.0424, 0.0402, 0.0368, 0.0302, 0.0298, 0.0291, 0.0241, 0.0240, 0.0199,

0.0165])

==================================================

STEP 16: FINAL PREDICTION

--------------------------------------------------

Token ID: 1972

Word: getting

==================================================

STEP 17: FULL TEXT GENERATION

--------------------------------------------------

Generating continuation...

==================================================

STEP 18: COMPLETE GENERATED TEXT

--------------------------------------------------

The weather is getting better and more dangerous.

Step 7: Attention Scores
Equation:
\[
\text{Scores} = \frac{QK^T}{\sqrt{d_k}} \quad (d_k = 768)
\]
Computed Scores:
\[
\text{Scores} = \begin{bmatrix}
[21.34, 0.78, 2.95] \\
[1.06, 23.22, -2.82] \\
[5.93, 1.06, 22.16]
\end{bmatrix}
\]

Step 8: Attention Weights (Softmax)
Equation:
\[
\text{Weights} = \text{Softmax}(\text{Scores}) = \begin{bmatrix}
[1.0, 1e-9, 1e-8] \\
[2e-10, 1.0, 5e-12] \\
[9e-8, 7e-10, 1.0]
\end{bmatrix}
\]

Step 9: Attention Output
Equation:
\[
\text{Output} = \text{Weights} \cdot V + E_{\text{norm}}
\]
Result:
\[
\text{Output} = \begin{bmatrix}
[0.8015, -2.0873, \dots] \\
[2.5527, 1.6109, \dots] \\
[-3.7829, 3.6026, \dots]
\end{bmatrix}
\]

Step 10-12: MLP Processing
1. LayerNorm: Normalize attention output.
2. MLP Expansion:
\[
\text{MLP}_{\text{intermediate}} = \text{GELU}(X W_1), \quad W_1 \in \mathbb{R}^{768 \times 3072}
\]
3. Projection Back:
\[
\text{Output} = \text{MLP}_{\text{intermediate}} W_2, \quad W_2 \in \mathbb{R}^{3072 \times 768}
\]

Step 13: Final Transformer Output
Contextual Embeddings:
\[
\text{Output} = \begin{bmatrix}
[-0.0301, 0.3668, \dots] \\
[0.5815, 0.2135, \dots] \\
[0.2250, 0.5953, \dots]
\end{bmatrix}
\]

Step 14-15: Next-Token Prediction
Logits Calculation:
\[
\text{Logits} = \text{Output} W_{\text{lm\_head}}, \quad W_{\text{lm\_head}} \in \mathbb{R}^{768 \times 50257}
\]
Top Predictions for “is”:
\[
P(\text{next token}) = \text{Softmax}(\text{Logits}[-1]) \implies \text{“getting”} (42.4%)
\]

Step 16-18: Autoregressive Generation
Final Output:



"The weather is getting better and more dangerous."

"The weather is getting better and more dangerous."

Key Mathematical Flow
1. Token → Embedding: $ X \rightarrow E = W_e[X] $
2. LayerNorm: $ E \rightarrow E_{\text{norm}} $
3. QKV Projections: $ E_{\text{norm}} \rightarrow Q, K, V $
4. Attention: $ \text{Softmax}(QK^T/\sqrt{d_k}) \cdot V $
5. MLP: $ \text{GELU}(X W_1) W_2 $
6. Prediction: $ \text{Output} W_{\text{lm\_head}} \rightarrow \text{Softmax} $

All matrices ($ W_e, W_Q, W_K, W_V, W_1, W_2, W_{\text{lm\_head}} $) are learned during training. This end-to-end process enables transformers to generate coherent text.

Summary steps:

Step 1: Define the Use Case

Identify the purpose of your LLM. Different applications require different designs and datasets.

General-Purpose LLM: Trained on diverse data for broad tasks (e.g., GPT, BERT).
Domain-Specific LLM: Focused on specialized fields like legal, medical, or financial text.
Task-Specific LLM: Designed for tasks such as summarization, translation, or sentiment analysis.

Step 2: Gather and Prepare Data

High-quality data is the backbone of any LLM.

Data Collection

Sources: Open datasets (e.g., Common Crawl, Wikipedia), proprietary data, or domain-specific corpora.
Quantity: A typical LLM requires hundreds of gigabytes to terabytes of text data.

Data Cleaning

Remove duplicates, noise, and irrelevant content.
Normalize text by converting it to lowercase, fixing encoding issues, etc.

Data Annotation

For supervised learning tasks, annotated datasets (e.g., labeled sentiment data) enhance performance.

Step 3: Build a Tokenizer

What is Tokenization?

Tokenization is the process of splitting text into smaller units, such as words, subwords, or characters.

Common Tokenization Methods

Word Tokenization: Splits text by spaces.
Subword Tokenization: Breaks rare words into subwords (e.g., “unbelievable” → “un”, “believable”).
Character Tokenization: Uses individual characters as tokens.

Example Tool: Byte Pair Encoding (BPE) is widely used for subword tokenization. Libraries like Hugging Face’s Tokenizers make implementation easier.

Step 4: Architect the Model

The transformer architecture is the foundation of LLMs.

Key Components of a Transformer

Self-Attention: Captures relationships between words regardless of their position in a sentence.
Positional Encoding: Adds information about the order of words.
Feedforward Layers: Process outputs from the attention mechanism.

Design Choices

Depth: Number of transformer layers.
Width: Size of hidden layers and embedding vectors.
Attention Heads: Number of parallel attention mechanisms.

For large-scale models, consider using a prebuilt architecture like GPT, BERT, or T5 as a blueprint.

Step 5: Choose a Training Framework

Leverage machine learning frameworks to implement your model.

Popular Frameworks

PyTorch: Great for custom implementations.
TensorFlow: Offers robust tools for scalability.
Hugging Face Transformers: Provides prebuilt models and training utilities.

Step 6: Train the Model

Training an LLM is resource-intensive and requires careful planning.

Pretraining vs. Fine-Tuning

Pretraining: Train the model on large, unlabeled datasets for general language understanding.
Fine-Tuning: Adapt the pretrained model to specific tasks using labeled data.

Compute Resources

Hardware: Use GPUs or TPUs for faster training.
Distributed Training: Split the workload across multiple devices or machines.

Training Steps

Load Data: Feed batches of tokenized text into the model.
Backpropagation: Adjust weights using loss functions like cross-entropy.
Optimization: Use optimizers like AdamW to minimize loss.

Step 7: Evaluate the Model

Metrics

Perplexity: Measures how well the model predicts sequences.
BLEU/ROUGE: Evaluates text generation quality.
Accuracy/F1 Score: Measures performance on classification tasks.

Test Dataset

Use unseen data to assess generalization capabilities.

Step 8: Optimize the Model

Large models often need optimization to improve efficiency.

Techniques

Quantization: Reduce the precision of weights (e.g., float32 → int8).
Pruning: Remove unnecessary connections.
Distillation: Train a smaller model (student) using the outputs of the large model (teacher).

Step 9: Deploy the Model

An LLM’s value comes from its ability to serve real-world applications.

Serving Options

REST APIs: Serve the model through a web interface.
Edge Deployment: Deploy lightweight versions on devices.
Cloud Services: Use platforms like AWS, Azure, or Google Cloud.

Scaling

Use containerization (e.g., Docker) and orchestration tools (e.g., Kubernetes) for scalability.

Step 10: Maintain and Update

Monitor the model’s performance and retrain it periodically with fresh data.

Best Practices

Implement logging to track predictions and errors.
Use feedback loops to incorporate user corrections.

Challenges and Considerations

Cost: Training large models requires significant computational resources.
Ethics: Ensure the model doesn’t propagate biases or generate harmful content.
Regulations: Adhere to data privacy laws like GDPR.



# Install only if not already available
# !pip install transformers torch --quiet

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F

def log_step(step_num, title, data, data_type="tensor", max_len=5):
    """Helper function for consistent logging"""
    print(f"\n{'='*50}")
    print(f" STEP {step_num}: {title.upper()}")
    print("-"*50)

    if data_type == "tensor":
        print(f"Shape: {data.shape}")
        if len(data.shape) <= 2:
            print(data)
        else:
            print(f"First {max_len} elements of last dimension:", data[..., :max_len])
    elif data_type == "text":
        print(data)
    elif data_type == "dict":
        for k, v in data.items():
            print(f"{k}: {v}")

# Load tokenizer and model
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Ensure model is in evaluation mode
model.eval()

# Input text
text = "The weather is"
log_step(1, "Original Text", text, "text")

# Tokenize with attention to special tokens
inputs = tokenizer(text, return_tensors="pt", return_attention_mask=True)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]

log_step(2, "Tokenization", {
    "Tokens": tokenizer.convert_ids_to_tokens(input_ids[0]),
    "Token IDs": input_ids,
    "Attention Mask": attention_mask
}, "dict")

# Get the embedding layer
embedding_layer = model.transformer.wte
log_step(3, "Embedding Matrix", embedding_layer.weight)

# Get the embedded vectors for input token IDs
input_embeddings = embedding_layer(input_ids)
log_step(4, "Embedded Input Vectors", input_embeddings)

# Forward pass through first transformer block with detailed logging
with torch.no_grad():
    # Get model components
    block = model.transformer.h[0]
    ln1 = block.ln_1
    attn = block.attn
    mlp = block.mlp
    ln2 = block.ln_2

    # === Layer Norm 1 ===
    normed_input = ln1(input_embeddings)
    log_step(5, "LayerNorm 1 Output", normed_input)

    # === Self-Attention ===
    # Project to Q, K, V
    qkv = attn.c_attn(normed_input)
    q, k, v = torch.chunk(qkv, 3, dim=-1)
    log_step(6, "QKV Projections", {
        "Query (Q)": q,
        "Key (K)": k,
        "Value (V)": v
    }, "dict")

    # Scaled dot-product attention
    d_k = q.size(-1)
    attn_scores = torch.matmul(q, k.transpose(-2, -1)) / (d_k ** 0.5)
    log_step(7, "Attention Scores (before softmax)", attn_scores)

    # Apply attention mask (important for accurate predictions)
    if attention_mask is not None:
        # Create extended attention mask for broadcasting
        extended_mask = (1.0 - attention_mask[:, None, None, :]) * -10000.0
        attn_scores = attn_scores + extended_mask

    attn_weights = F.softmax(attn_scores, dim=-1)
    log_step(8, "Attention Weights (after softmax)", attn_weights)

    # Context vector calculation
    context = torch.matmul(attn_weights, v)
    attn_output = attn.c_proj(context)

    # Residual connection
    attn_residual = input_embeddings + attn_output
    log_step(9, "Attention Output + Residual", attn_residual)

    # === Layer Norm 2 ===
    normed_attn = ln2(attn_residual)
    log_step(10, "LayerNorm 2 Output", normed_attn)

    # === MLP ===
    mlp_hidden = mlp.c_fc(normed_attn)
    mlp_activation = F.gelu(mlp_hidden)
    log_step(11, "MLP Intermediate (GELU activation)", mlp_activation)

    mlp_output = mlp.c_proj(mlp_activation)

    # Final residual
    block_output = attn_residual + mlp_output
    log_step(12, "Final Block Output", block_output)

    # === Pass through remaining layers ===
    # (For accurate predictions, we should process through all layers)
    transformer_output = model.transformer(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
    log_step(13, "Full Transformer Output", transformer_output)

    # === Final Layer: Project to Vocabulary ===
    logits = model.lm_head(transformer_output)
    log_step(14, "Raw Logits for Next Token Prediction", logits)

    # === Softmax over logits ===
    # Focus on last token (most relevant for next word prediction)
    last_token_logits = logits[:, -1, :]
    probs = F.softmax(last_token_logits, dim=-1)

    # Get top predictions
    top_probs, top_ids = torch.topk(probs, 10)
    log_step(15, "Top Predictions", {
        "Token IDs": top_ids[0],
        "Tokens": [tokenizer.decode([tid]) for tid in top_ids[0]],
        "Probabilities": top_probs[0]
    }, "dict")

    # === Final prediction ===
    predicted_token_id = torch.argmax(probs, dim=-1)
    predicted_word = tokenizer.decode(predicted_token_id)
    log_step(16, "Final Prediction", {
        "Token ID": predicted_token_id.item(),
        "Word": predicted_word.strip()
    }, "dict")

    # === Generate full continuation ===
    log_step(17, "Full Text Generation", "Generating continuation...", "text")
    generated = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=20,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id
    )
    full_output = tokenizer.decode(generated[0], skip_special_tokens=True)
    log_step(18, "Complete Generated Text", full_output, "text")

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

# Install only if not already available

# !pip install transformers torch --quiet

from transformers import AutoTokenizer, AutoModelForCausalLM

import torch

import torch.nn.functional as F

def log_step(step_num, title, data, data_type="tensor", max_len=5):

"""Helper function for consistent logging"""

print(f"\n{'='*50}")

print(f" STEP {step_num}: {title.upper()}")

print("-"*50)

if data_type == "tensor":

print(f"Shape: {data.shape}")

if len(data.shape) <= 2:

print(data)

else:

print(f"First {max_len} elements of last dimension:", data[..., :max_len])

elif data_type == "text":

print(data)

elif data_type == "dict":

for k, v in data.items():

print(f"{k}: {v}")

# Load tokenizer and model

model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)

# Add padding token if not present

if tokenizer.pad_token is None:

tokenizer.pad_token = tokenizer.eos_token

# Ensure model is in evaluation mode

model.eval()

# Input text

text = "The weather is"

log_step(1, "Original Text", text, "text")

# Tokenize with attention to special tokens

inputs = tokenizer(text, return_tensors="pt", return_attention_mask=True)

input_ids = inputs["input_ids"]

attention_mask = inputs["attention_mask"]

log_step(2, "Tokenization", {

"Tokens": tokenizer.convert_ids_to_tokens(input_ids[0]),

"Token IDs": input_ids,

"Attention Mask": attention_mask

}, "dict")

# Get the embedding layer

embedding_layer = model.transformer.wte

log_step(3, "Embedding Matrix", embedding_layer.weight)

# Get the embedded vectors for input token IDs

input_embeddings = embedding_layer(input_ids)

log_step(4, "Embedded Input Vectors", input_embeddings)

# Forward pass through first transformer block with detailed logging

with torch.no_grad():

# Get model components

block = model.transformer.h[0]

ln1 = block.ln_1

attn = block.attn

mlp = block.mlp

ln2 = block.ln_2

# === Layer Norm 1 ===

normed_input = ln1(input_embeddings)

log_step(5, "LayerNorm 1 Output", normed_input)

# === Self-Attention ===

# Project to Q, K, V

qkv = attn.c_attn(normed_input)

q, k, v = torch.chunk(qkv, 3, dim=-1)

log_step(6, "QKV Projections", {

"Query (Q)": q,

"Key (K)": k,

"Value (V)": v

}, "dict")

# Scaled dot-product attention

d_k = q.size(-1)

attn_scores = torch.matmul(q, k.transpose(-2, -1)) / (d_k ** 0.5)

log_step(7, "Attention Scores (before softmax)", attn_scores)

# Apply attention mask (important for accurate predictions)

if attention_mask is not None:

# Create extended attention mask for broadcasting

extended_mask = (1.0 - attention_mask[:, None, None, :]) * -10000.0

attn_scores = attn_scores + extended_mask

attn_weights = F.softmax(attn_scores, dim=-1)

log_step(8, "Attention Weights (after softmax)", attn_weights)

# Context vector calculation

context = torch.matmul(attn_weights, v)

attn_output = attn.c_proj(context)

# Residual connection

attn_residual = input_embeddings + attn_output

log_step(9, "Attention Output + Residual", attn_residual)

# === Layer Norm 2 ===

normed_attn = ln2(attn_residual)

log_step(10, "LayerNorm 2 Output", normed_attn)

# === MLP ===

mlp_hidden = mlp.c_fc(normed_attn)

mlp_activation = F.gelu(mlp_hidden)

log_step(11, "MLP Intermediate (GELU activation)", mlp_activation)

mlp_output = mlp.c_proj(mlp_activation)

# Final residual

block_output = attn_residual + mlp_output

log_step(12, "Final Block Output", block_output)

# === Pass through remaining layers ===

# (For accurate predictions, we should process through all layers)

transformer_output = model.transformer(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state

log_step(13, "Full Transformer Output", transformer_output)

# === Final Layer: Project to Vocabulary ===

logits = model.lm_head(transformer_output)

log_step(14, "Raw Logits for Next Token Prediction", logits)

# === Softmax over logits ===

# Focus on last token (most relevant for next word prediction)

last_token_logits = logits[:, -1, :]

probs = F.softmax(last_token_logits, dim=-1)

# Get top predictions

top_probs, top_ids = torch.topk(probs, 10)

log_step(15, "Top Predictions", {

"Token IDs": top_ids[0],

"Tokens": [tokenizer.decode([tid]) for tid in top_ids[0]],

"Probabilities": top_probs[0]

}, "dict")

# === Final prediction ===

predicted_token_id = torch.argmax(probs, dim=-1)

predicted_word = tokenizer.decode(predicted_token_id)

log_step(16, "Final Prediction", {

"Token ID": predicted_token_id.item(),

"Word": predicted_word.strip()

}, "dict")

# === Generate full continuation ===

log_step(17, "Full Text Generation", "Generating continuation...", "text")

generated = model.generate(

input_ids=input_ids,

attention_mask=attention_mask,

max_length=20,

num_return_sequences=1,

pad_token_id=tokenizer.eos_token_id

)

full_output = tokenizer.decode(generated[0], skip_special_tokens=True)

log_step(18, "Complete Generated Text", full_output, "text")

KNCMAP

Chief Editor

Translate

What Are LLMs?

Comparative Summary Table

Core Components of an LLM

Now Let’s create the LLM step by step:

Summary steps:

Step 1: Define the Use Case

Step 2: Gather and Prepare Data

Data Collection

Data Cleaning

Data Annotation

Step 3: Build a Tokenizer

What is Tokenization?

Common Tokenization Methods

Step 4: Architect the Model

Key Components of a Transformer

Design Choices

Step 5: Choose a Training Framework

Popular Frameworks

Step 6: Train the Model

Pretraining vs. Fine-Tuning

Compute Resources

Training Steps

Step 7: Evaluate the Model

Metrics

Test Dataset

Step 8: Optimize the Model

Techniques

Step 9: Deploy the Model

Serving Options

Scaling

Step 10: Maintain and Update

Best Practices

Challenges and Considerations

Leave a Reply Cancel reply

Related Posts