Vision-Language Models

Advancing Vision-Language Models: New Alignment Methods in TRL

Editor2 weeks ago03 mins

Beyond DPO: New Multimodal Alignment Methods in TRL Vision-Language Models (VLMs) are becoming more capable, but aligning them with human preferences remains crucial. Previously, we introduced Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for VLMs in TRL. Now, we’re pushing further with three advanced alignment techniques: Mixed Preference Optimization (MPO) – Combines DPO, SFT, and quality loss for better reasoning Group Relative Policy…

A Deep Dive into Modern Vision Architectures: ViTs, Mamba Layers, STORM, SigLIP, and Qwen

Editor2 months ago07 mins

Introduction As the AI landscape rapidly evolves, vision architectures are undergoing a revolution. We’ve moved beyond CNNs into the age of Vision Transformers (ViTs), hybrid systems like SigLIP, long-sequence models such as Mamba, and powerful multimodal models like Qwen-VL. Then there’s STORM—a new architecture combining selective attention, token reduction, and memory. This blog walks you…