
Advancing Vision-Language Models: New Alignment Methods in TRL
Beyond DPO: New Multimodal Alignment Methods in TRL Vision-Language Models (VLMs) are becoming more capable, but aligning them with human preferences remains crucial. Previously, we introduced Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for VLMs in TRL. Now, we’re pushing further with three advanced alignment techniques: Mixed Preference Optimization (MPO) – Combines DPO, SFT, and quality loss for better reasoning Group Relative Policy…