Bias in Benchmarks for Autonomous Driving

Multiple Choice Question Answering (MCQA) benchmarks are widely used to evaluate the performance of Vision Language Models (VLMs) in autonomous driving scenarios. However, a recent study highlights how these benchmarks are susceptible to hidden textual biases, which allow models to exploit linguistic patterns rather than understanding the visual context.

Bias Reduction with a New Method

The research proposes a method to significantly reduce this problem. The results show that a VLM fine-tuned on synthetic data can achieve accuracy comparable to that obtained on human-validated benchmarks, even without visual input. The proposed method reduces accuracy based on textual shortcuts from +66.9% to +2.9%, eliminating most linguistic exploits.

Curriculum Learning and Visual Grounding

By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, the model is forced to rely on visual grounding. This ensures that performance accurately reflects perceptual understanding, improving the reliability of VLMs in autonomous driving applications.