Perceptual Fragility of Multimodal Models

Multimodal Large Language Models (MLLMs), despite their impressive capabilities, exhibit perceptual fragility when confronted with visually complex scenes. This weakness primarily stems from reliance on finite training datasets, which are prohibitively expensive to scale.

AOT-SFT and AOT: A New Approach

To address this issue, extbf{AOT-SFT}, a large-scale adversarial dataset designed to improve MLLM robustness, has been introduced. Furthermore, extbf{AOT (Adversarial Opponent Training)}, a self-play framework aimed at developing MLLM robustness through autonomous creation of training data, has been proposed.

Attacker-Defender Co-evolution

The AOT method orchestrates a co-evolution between an image-editing Attacker and a Defender MLLM. The Attacker generates a diverse and dynamic curriculum of image manipulations, forcing the Defender to adapt and improve its perceptual abilities. Experiments demonstrate that AOT significantly enhances the Defender's perceptual robustness and reduces hallucinations, establishing a new scalable paradigm for training more reliable MLLMs.