Beyond DPO: New Multimodal Alignment Methods in TRL
Vision-Language Models (VLMs) are becoming more capable, but aligning them with human preferences remains crucial. Previously, we introduced Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for VLMs in TRL. Now, we’re pushing further with three advanced alignment techniques:
Mixed Preference Optimization (MPO) – Combines DPO, SFT, and quality loss for better reasoning
Group Relative Policy Optimization (GRPO) – Optimizes over response groups for robustness
Group Sequence Policy Optimization (GSPO) – Stable sequence-level updates (great for MoE models)
All three are now available in TRL!
Why These Methods Matter
Traditional alignment for VLMs involves:
1️⃣ SFT – Teach the model to follow instructions
2️⃣ DPO – Optimize preferences between pairs of responses
But DPO has limitations:
-
Struggles with multi-aspect preferences (e.g., correctness + coherence)
-
Can lead to repetitive or incoherent reasoning
Our new methods solve these issues!
Deep Dive: The Three New Techniques
1️⃣ Mixed Preference Optimization (MPO)
Problem: DPO-aligned VLMs sometimes fail at complex reasoning tasks.
Solution: MPO combines three losses:
✔ DPO (sigmoid loss) – Optimizes pairwise preferences
✔ Binary Classifier Optimization (BCO) – Ensures response quality
✔ SFT loss – Maintains instruction-following ability
Result: +6.2 pts improvement on MathVista!
How to Use MPO in TRL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
mpo_config = DPOConfig( loss_type=["sigmoid", "bco_pair", "sft"], # MPO components loss_weights=[0.8, 0.2, 1.0], # Weighting scheme from paper ... # Other DPO settings ) mpo_trainer = DPOTrainer( model=model_id, args=mpo_config, train_dataset=dataset, tokenizer=tokenizer, ) mpo_trainer.train() |
2️⃣ Group Relative Policy Optimization (GRPO)
Problem: DPO is noise-sensitive—single bad samples can misguide training.
Solution: GRPO optimizes over groups of responses, averaging noise.
Key Features:
✔ Batch-level policy updates (more stable than per-sample)
✔ Better generalization (learns broader “good response” patterns)
✔ Originally used in DeepSeek Math & DeepSeek R1
GRPO Training in TRL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def format_reward(completions): # Check response structure return [1.0 if re.match(r"<think>.*?</think>.*<answer>.*?</answer>", c) else 0.0] def accuracy_reward(completions, solutions): # Verify correctness return [1.0 if verify(answer, solution) else 0.0 for answer, solution in zip(completions, solutions)] grpo_trainer = GRPOTrainer( model=model, reward_funcs=[format_reward, accuracy_reward], # Multiple reward signals! args=GRPOConfig(learning_rate=1e-5, ...), train_dataset=train_dataset, ) grpo_trainer.train() |
3️⃣ Group Sequence Policy Optimization (GSPO)
Problem: GRPO computes importance weights per-token, which can be unstable.
Solution: GSPO (from Qwen) uses sequence-level weighting—ideal for Mixture-of-Experts (MoE) models.
Enabling GSPO in GRPOTrainer
1 2 3 4 5 6 7 8 |
training_args = GRPOConfig( importance_sampling_level="sequence", # Key difference! epsilon=3e-4, # From Qwen paper loss_type="grpo", ... ) |
⚡ vLLM Integration for Faster Training
TRL now supports vLLM for on-the-fly generation during alignment:
-
Colocate mode (shared GPU)
-
Server mode (separate process)
Example:
1 2 3 4 5 6 7 |
# Start vLLM server trl vllm-serve --model Qwen/Qwen2.5-VL-3B-Instruct # Train with GRPO + vLLM python grpo_vlm.py --use_vllm --vllm_mode server |
Key Takeaways
✅ MPO = Better reasoning (DPO + SFT + BCO)
✅ GRPO = Noise-resistant group optimization
✅ GSPO = Stable sequence-level updates (great for MoE)
✅ vLLM = Faster generation during training
Explore the notebooks and start aligning your VLMs today!