Advancing Vision-Language Models: New Alignment Methods in TRL

Vision Language Models Vision Language Models

Beyond DPO: New Multimodal Alignment Methods in TRL

Vision-Language Models (VLMs) are becoming more capable, but aligning them with human preferences remains crucial. Previously, we introduced Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for VLMs in TRL. Now, we’re pushing further with three advanced alignment techniques:

 Mixed Preference Optimization (MPO) – Combines DPO, SFT, and quality loss for better reasoning
 Group Relative Policy Optimization (GRPO) – Optimizes over response groups for robustness
 Group Sequence Policy Optimization (GSPO) – Stable sequence-level updates (great for MoE models)

All three are now available in TRL!


Why These Methods Matter

Traditional alignment for VLMs involves:
1️⃣ SFT – Teach the model to follow instructions
2️⃣ DPO – Optimize preferences between pairs of responses

But DPO has limitations:

  • Struggles with multi-aspect preferences (e.g., correctness + coherence)

  • Can lead to repetitive or incoherent reasoning

Our new methods solve these issues!


Deep Dive: The Three New Techniques

1️⃣ Mixed Preference Optimization (MPO)

Problem: DPO-aligned VLMs sometimes fail at complex reasoning tasks.
Solution: MPO combines three losses:
✔ DPO (sigmoid loss) – Optimizes pairwise preferences
✔ Binary Classifier Optimization (BCO) – Ensures response quality
✔ SFT loss – Maintains instruction-following ability

Result: +6.2 pts improvement on MathVista!

How to Use MPO in TRL

2️⃣ Group Relative Policy Optimization (GRPO)

Problem: DPO is noise-sensitive—single bad samples can misguide training.
Solution: GRPO optimizes over groups of responses, averaging noise.

Key Features:
✔ Batch-level policy updates (more stable than per-sample)
✔ Better generalization (learns broader “good response” patterns)
✔ Originally used in DeepSeek Math & DeepSeek R1

GRPO Training in TRL

3️⃣ Group Sequence Policy Optimization (GSPO)

Problem: GRPO computes importance weights per-token, which can be unstable.
Solution: GSPO (from Qwen) uses sequence-level weighting—ideal for Mixture-of-Experts (MoE) models.

Enabling GSPO in GRPOTrainer

⚡ vLLM Integration for Faster Training

TRL now supports vLLM for on-the-fly generation during alignment:

  • Colocate mode (shared GPU)

  • Server mode (separate process)

Example:

Key Takeaways

✅ MPO = Better reasoning (DPO + SFT + BCO)
✅ GRPO = Noise-resistant group optimization
✅ GSPO = Stable sequence-level updates (great for MoE)
✅ vLLM = Faster generation during training

Explore the notebooks and start aligning your VLMs today!

Leave a Reply

Your email address will not be published. Required fields are marked *

Home
Courses
Services
Search