Implementing KV Cache from Scratch in nanoVLM: A 38% Speedup in Autoregressive Generation

NEOVLM NEOVLM

Introduction

Autoregressive language models generate text one token at a time. Each new prediction requires a full forward pass through all transformer layers, leading to redundant computations.

For example, generating the next token in:
[What, is, in,] → [the]
requires recomputing attention over [What, is, in,] even though these tokens haven’t changed.

KV Caching solves this inefficiency by storing and reusing intermediate computations. In this post, we’ll:

SCOPE: KV Cache optimization framework for long-context generation in LLMs | by SACHIN KUMAR | Medium

  1. Revisit transformer attention mechanics.

  2. Identify where redundancy occurs.

  3. Implement KV Caching in nanoVLM (a minimal VLM built with PyTorch).

  4. Benchmark the speedup (38

bar plot showcasing improvement in generation speed

Revisiting Transformer Attention

A transformer layer consists of:

  • Multi-head self-attention

  • Feed-forward network (MLP)

  • Residual connections & layer norm

Self-attention computes:

  • Queries (Q)Keys (K)Values (V) from input embeddings.

  • Attention scores via softmax(QKᵀ / √dₖ).

  • Output as a weighted sum of V.

Here’s a minimal PyTorch implementation:

diagram for autoregression

Where Redundancy Creeps In

During autoregressive generation:

  1. The model predicts tᵢ₊₁ given [t₀...tᵢ].

  2. At each step, it recomputes K and V for the entire sequence—even though only the newest token changes.

Example:

 = Already computed (reused)
 = Unnecessary recomputation

How KV Caching Fixes It

Awesome-Efficient-LLM/kv_cache_compression.md at main · horseee/Awesome-Efficient-LLM · GitHub

Instead of recomputing K and V for the entire sequence:

  1. Cache K and V after the first pass.

  2. For new tokens, compute only the latest Kₙₑᵥ and Vₙₑᵥ.

  3. Concatenate them with the cached values.

Key Insight:

  • Prefill Phase: Process the full prompt and populate the cache.

  • Decode Phase: Generate tokens incrementally using cached K/V.

Implementing KV Cache in nanoVLM

We modified three components:

1. Attention Block (LanguageModelGroupedAttention)

2. Layer-Wise Cache Tracking (LanguageModel)

3. Two-Phase Generation (VisionLanguageModel)

Results & Takeaways

  • ✅ 38 Percent  faster generation (benchmarked on nanoVLM).
  • ✅ Memory-efficient (grows linearly with sequence length).
  • ✅ Position-aware (correct rotary embeddings via start_pos).

Trade-offs:

  • Slightly more complex code.

  • Restricts some advanced inference methods (e.g., beam search).

Conclusion

KV Caching is a game-changer for autoregressive models. By eliminating redundant computations, it enables faster, longer, and more efficient generation—critical for real-world applications.

Let us know your thoughts in the comments!

Leave a Reply

Your email address will not be published. Required fields are marked *

Home
Courses
Services
Search