Vision Language Models

Advancing Vision-Language Models: New Alignment Methods in TRL

Beyond DPO: New Multimodal Alignment Methods in TRL Vision-Language Models (VLMs) are becoming more capable, but aligning them with human preferences remains crucial. Previously, we introduced Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for VLMs in TRL. Now, we’re pushing further with three advanced alignment techniques:  Mixed Preference Optimization (MPO) – Combines DPO, SFT, and quality loss for better reasoning  Group Relative Policy…

Read More
VISION MODELS

A Deep Dive into Modern Vision Architectures: ViTs, Mamba Layers, STORM, SigLIP, and Qwen

Introduction As the AI landscape rapidly evolves, vision architectures are undergoing a revolution. We’ve moved beyond CNNs into the age of Vision Transformers (ViTs), hybrid systems like SigLIP, long-sequence models such as Mamba, and powerful multimodal models like Qwen-VL. Then there’s STORM—a new architecture combining selective attention, token reduction, and memory. This blog walks you…

Read More
Home
Courses
Services
Search