A recent benchmark tested 53 language models (LLMs) with a seemingly trivial question: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?". The goal was to assess the models' ability to apply basic logical reasoning.

Test Results

Initially, only 11 out of 53 models provided the correct answer on the first try. However, a more in-depth analysis, with 10 test repetitions for each model, revealed an even more disappointing performance. Only 5 models proved to be able to answer correctly reliably.

Some open-source models, despite failing the initial test, showed improvement in subsequent runs. For example, GLM-4.7 answered correctly 6 out of 10 times.

Analysis by Model Family

The results vary significantly depending on the model family:

  • Anthropic: Only Opus 4.6 achieved a perfect score (10/10).
  • OpenAI: Only GPT-5 passed the test satisfactorily (7/10).
  • Google: Gemini 3 models and Flash Lite all scored 10/10.
  • xAI: Grok-4 (10/10) and Reasoning (8/10) showed good performance.

Models from Meta (Llama), Mistral, and DeepSeek failed the test.

This experiment highlights how, even in simple scenarios, the reliability of reasoning in language models remains an open challenge. For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.