A Deep Dive into Modern Vision Architectures: ViTs, Mamba Layers, STORM, SigLIP, and Qwen

VISION MODELS VISION MODELS

Introduction

As the AI landscape rapidly evolves, vision architectures are undergoing a revolution. We’ve moved beyond CNNs into the age of Vision Transformers (ViTs), hybrid systems like SigLIP, long-sequence models such as Mamba, and powerful multimodal models like Qwen-VL. Then there’s STORM—a new architecture combining selective attention, token reduction, and memory.

This blog walks you through:

  1. Vision Transformers (ViTs) – Core architecture

  2. Mamba Layers – State space models for long-sequence efficiency

  3. STORM – A hybrid transformer architecture

  4. SigLIP – Efficient CLIP replacement

  5. Qwen-VL – Open-source multimodal giant

  6. Full architectural breakdowns and how to build them

Let’s begin.


1. Vision Transformers (ViTs): The Foundation

An Explainable Vision Transformer Model Based White Blood Cells  Classification and Localization

Architecture

  1. Patch Embedding: Split an image (e.g., 224×224) into fixed-size patches (e.g., 16×16).

  2. Linear Projection: Each patch is flattened and linearly projected into a vector.

  3. Positional Embedding: Add positional info to preserve spatial structure.

  4. Transformer Encoder Blocks:

    • Multi-head self-attention

    • LayerNorm

    • MLP with GELU activation

    • Skip connections

  5. Classification Head: [CLS] token is passed to an MLP head for classification.

Key Equations

Q, K, V = Linear(X)
Attention = Softmax(QKᵀ / √d_k) V
Output = MLP(LayerNorm(Attention + X))

2. Mamba Layers: Sequence Modeling Revolution

An Empirical Study of Mamba-based Language Models | Research

Mamba (by Princeton & Together AI) introduces a selective state space model for high-throughput long sequence modeling.

Why It Matters

  • Efficient for long sequences

  • Linear time complexity with sequence length

  • Ideal for time-series, audio, video, and vision

Architecture

Each Mamba layer processes inputs using:

  1. A learned input projection

  2. A selective scan mechanism (like convolution + memory)

  3. Gating for dynamic selection

  4. Normalization + residuals

⛓️ How It Works

def mamba_block(x):
x_proj = input_proj(x)
memory_out = scan_over_sequence(x_proj)
gated = gate(memory_out)
return residual_connection(x, gated)

Use in Vision:

Can replace attention blocks in ViT or be hybridized (like in STORM).


3. ⚡ STORM: Speed Meets Scale

ajindal/llama3.1-storm

STORM (Selective Token Retention and Mamba) introduces a clever hybrid model using:

  • Selective attention: Keeps only important tokens (like TokenLearner).

  • Mamba layers: For long-range sequence efficiency.

  • Cross-layer memory: Preserves contextual info between layers.

️ Architecture Step-by-Step

  1. Patchify input image

  2. Project patches + add positional embeddings

  3. Alternate blocks:

    • Token Retention module: reduce token count dynamically

    • Mamba Layer: scan across reduced tokens

  4. Global memory pool updated across layers

  5. Final aggregation + classification head

✅ Benefits

  • Fewer tokens → faster inference

  • Handles longer input contexts (e.g., video frames, high-res images)

  • Preserves accuracy by keeping informative tokens


4. SigLIP: Efficient CLIP-style Vision-Language Learning

A Comprehensive Guide to Leveraging Azure OpenAI Landing Zone for Seamless  AI Development and Architecture | by Jiadong Chen | GoPenAI

SigLIP (by Google Research) modifies CLIP’s contrastive loss:

  • Uses Sigmoid cross-entropy instead of softmax + temperature.

  • Improves stability and training efficiency.

How SigLIP Works

  1. Encode image and text independently using ViT and Transformer.

  2. Normalize embeddings

  3. Compute pairwise similarity matrix

  4. Apply sigmoid loss (binary cross-entropy with labels as 1 if match, else 0).

SigLIP Loss

loss = BCEWithLogits(similarity_matrix, match_labels)

✅ Advantages

  • No need for negative mining

  • More robust to noise

  • Compatible with massive datasets (LAION, CC3M)


5. Qwen-VL: Open Source Multimodal Powerhouse

Introducing Qwen-VL | Qwen

Qwen-VL is Alibaba’s open multimodal LLM, extending Qwen-7B with:

  • Vision encoder: Typically a CLIP ViT-G/14 or Swin

  • Multimodal Adapter: Projects vision embeddings into LLM token space

  • Qwen LLM Decoder: Generates answers with cross-attention on vision tokens

️ Qwen Architecture

Image → Vision Encoder → Adapter → Qwen LLM

Text Prompt

Supports vision-language instruction tuning and few-shot multimodal prompting.


How to Build These Architectures

Let’s outline steps for each:

✅ Vision Transformer

class VisionTransformer(nn.Module):
def __init__(...):
self.patch_embed = ...
self.transformer_blocks = nn.Sequential(...)
self.cls_head = nn.Linear(...)

def forward(self, x):
x = patchify(x)
x = self.patch_embed(x)
x = self.transformer_blocks(x)
return self.cls_head(x[:, 0]) # CLS token

✅ Mamba Layer

Use state-spaces/mamba

from mamba_ssm import MambaBlock

class VisionMamba(nn.Module):
def __init__(...):
self.blocks = nn.Sequential(
*[MambaBlock(...) for _ in range(depth)]
)

✅ STORM

  • Use a TokenLearner module to dynamically drop less important tokens

  • Insert Mamba or attention blocks

  • Maintain memory vector updated each layer

✅ SigLIP

  • Use CLIP-style architecture

  • Modify the loss to:

loss = torch.nn.BCEWithLogitsLoss()(similarity_matrix, labels)

✅ Qwen-VL

Use HuggingFace or Alibaba’s repo:

image_embeds = vision_encoder(image)
vision_tokens = adapter(image_embeds)
output = qwen_llm.generate([vision_tokens, prompt_tokens])

Final Thoughts

These architectures—ViTs, Mamba, STORM, SigLIP, Qwen—are shaping the future of efficient, scalable vision and multimodal understanding. Whether you’re building research-grade models or production ML systems, understanding their internals will let you innovate and optimize.


Bonus: Combine Them!

Imagine building a pipeline where:

  • STORM encodes long-form video

  • Mamba compresses tokens for efficient modeling

  • SigLIP aligns vision and language

  • Qwen decodes responses from rich vision-text input

This is not science fiction. This is today.

Leave a Reply

Your email address will not be published. Required fields are marked *

Home
Courses
Services
Search