Perceptual Fragility of Multimodal Models
Multimodal Large Language Models (MLLMs), despite their impressive capabilities, exhibit perceptual fragility when confronted with visually complex scenes. This weakness primarily stems from reliance on finite training datasets, which are prohibitively expensive to scale.
AOT-SFT and AOT: A New Approach
To address this issue, extbf{AOT-SFT}, a large-scale adversarial dataset designed to improve MLLM robustness, has been introduced. Furthermore, extbf{AOT (Adversarial Opponent Training)}, a self-play framework aimed at developing MLLM robustness through autonomous creation of training data, has been proposed.
Attacker-Defender Co-evolution
The AOT method orchestrates a co-evolution between an image-editing Attacker and a Defender MLLM. The Attacker generates a diverse and dynamic curriculum of image manipulations, forcing the Defender to adapt and improve its perceptual abilities. Experiments demonstrate that AOT significantly enhances the Defender's perceptual robustness and reduces hallucinations, establishing a new scalable paradigm for training more reliable MLLMs.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!