A recent study evaluated the performance of several large language models (LLMs) in a specific task: a multiple-choice test in the field of neuroscience and brain-computer interfaces (BCI). The dataset, consisting of 500 questions, was generated automatically under strict constraints, without human review.

Key Findings

The results showed that the most advanced models, including LLaMA-3.3 70B, achieve similar accuracy, hovering around 88%. Surprisingly, the Qwen3 235B MoE model surpassed this limit, reaching 90.4% accuracy. Smaller models (14B-8B) show a gradual performance decline, without sharp drops.

Analysis of Limitations

The common errors among the models suggest that the difficulties stem not so much from a lack of knowledge, but from problems of epistemic calibration, i.e., the ability to assess the reliability of one's answers in contexts with real constraints such as latency, biological noise, and methodological feasibility.

Methodology

The tests were conducted with rigorous parameters: temperature set to 0, maximum number of tokens to 5, and output limited to a single letter. One item in the dataset was excluded due to incorrect wording.