Advancing Vision-Language Models: New Alignment Methods in TRL

Beyond DPO: New Multimodal Alignment Methods in TRL

Vision-Language Models (VLMs) are becoming more capable, but aligning them with human preferences remains crucial. Previously, we introduced Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for VLMs in TRL. Now, we’re pushing further with three advanced alignment techniques:

Mixed Preference Optimization (MPO) – Combines DPO, SFT, and quality loss for better reasoning
Group Relative Policy Optimization (GRPO) – Optimizes over response groups for robustness
Group Sequence Policy Optimization (GSPO) – Stable sequence-level updates (great for MoE models)

All three are now available in TRL!

Why These Methods Matter

Traditional alignment for VLMs involves:
1️⃣ SFT – Teach the model to follow instructions
2️⃣ DPO – Optimize preferences between pairs of responses

But DPO has limitations:

Struggles with multi-aspect preferences (e.g., correctness + coherence)
Can lead to repetitive or incoherent reasoning

Our new methods solve these issues!

Deep Dive: The Three New Techniques

1️⃣ Mixed Preference Optimization (MPO)

Problem: DPO-aligned VLMs sometimes fail at complex reasoning tasks.
Solution: MPO combines three losses:
✔ DPO (sigmoid loss) – Optimizes pairwise preferences
✔ Binary Classifier Optimization (BCO) – Ensures response quality
✔ SFT loss – Maintains instruction-following ability

Result: +6.2 pts improvement on MathVista!

How to Use MPO in TRL



mpo_config = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],  # MPO components
    loss_weights=[0.8, 0.2, 1.0],  # Weighting scheme from paper
    ...  # Other DPO settings
)

mpo_trainer = DPOTrainer(
    model=model_id,
    args=mpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
mpo_trainer.train()

mpo_config = DPOConfig(

loss_type=["sigmoid", "bco_pair", "sft"], # MPO components

loss_weights=[0.8, 0.2, 1.0], # Weighting scheme from paper

... # Other DPO settings

)

mpo_trainer = DPOTrainer(

model=model_id,

args=mpo_config,

train_dataset=dataset,

tokenizer=tokenizer,

)

mpo_trainer.train()

2️⃣ Group Relative Policy Optimization (GRPO)

Problem: DPO is noise-sensitive—single bad samples can misguide training.
Solution: GRPO optimizes over groups of responses, averaging noise.

Key Features:
✔ Batch-level policy updates (more stable than per-sample)
✔ Better generalization (learns broader “good response” patterns)
✔ Originally used in DeepSeek Math & DeepSeek R1

GRPO Training in TRL



def format_reward(completions):  # Check response structure
    return [1.0 if re.match(r"<think>.*?</think>.*<answer>.*?</answer>", c) else 0.0]

def accuracy_reward(completions, solutions):  # Verify correctness
    return [1.0 if verify(answer, solution) else 0.0 for answer, solution in zip(completions, solutions)]

grpo_trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, accuracy_reward],  # Multiple reward signals!
    args=GRPOConfig(learning_rate=1e-5, ...),
    train_dataset=train_dataset,
)
grpo_trainer.train()

def format_reward(completions): # Check response structure

return [1.0 if re.match(r"<think>.*?</think>.*<answer>.*?</answer>", c) else 0.0]

def accuracy_reward(completions, solutions): # Verify correctness

return [1.0 if verify(answer, solution) else 0.0 for answer, solution in zip(completions, solutions)]

grpo_trainer = GRPOTrainer(

model=model,

reward_funcs=[format_reward, accuracy_reward], # Multiple reward signals!

args=GRPOConfig(learning_rate=1e-5, ...),

train_dataset=train_dataset,

)

grpo_trainer.train()

3️⃣ Group Sequence Policy Optimization (GSPO)

Problem: GRPO computes importance weights per-token, which can be unstable.
Solution: GSPO (from Qwen) uses sequence-level weighting—ideal for Mixture-of-Experts (MoE) models.

Enabling GSPO in GRPOTrainer



training_args = GRPOConfig(
    importance_sampling_level="sequence",  # Key difference!
    epsilon=3e-4,  # From Qwen paper
    loss_type="grpo",
    ...
)

training_args = GRPOConfig(

importance_sampling_level="sequence", # Key difference!

epsilon=3e-4, # From Qwen paper

loss_type="grpo",

...

)

⚡ vLLM Integration for Faster Training

TRL now supports vLLM for on-the-fly generation during alignment:

Colocate mode (shared GPU)
Server mode (separate process)

Example:



# Start vLLM server
trl vllm-serve --model Qwen/Qwen2.5-VL-3B-Instruct

# Train with GRPO + vLLM
python grpo_vlm.py --use_vllm --vllm_mode server

# Start vLLM server

trl vllm-serve --model Qwen/Qwen2.5-VL-3B-Instruct

# Train with GRPO + vLLM

python grpo_vlm.py --use_vllm --vllm_mode server

Key Takeaways

✅ MPO = Better reasoning (DPO + SFT + BCO)
✅ GRPO = Noise-resistant group optimization
✅ GSPO = Stable sequence-level updates (great for MoE)
✅ vLLM = Faster generation during training

Explore the notebooks and start aligning your VLMs today!