Altered Riddles: A New Benchmark to Test Large Language Models' Understanding

Introduction: When Memory Trumps Logic

In the rapidly evolving landscape of Large Language Models (LLMs), their ability to understand and respond consistently to instructions is crucial. However, a recurring phenomenon is the tendency of these models to rely on memorized information or pre-existing patterns, even when explicit text provided in the prompt suggests a different answer. This behavior can lead to significant errors, especially in contexts where precision and adherence to provided data are paramount.

An emblematic example of this problem is the surgeon riddle: "The surgeon, who is the boy's father, says, 'I cannot operate on this boy—he's my son!'. Who is the surgeon to the boy?". Despite the text clearly stating the surgeon is the father, many LLMs still tend to answer "The mother," influenced by a more common version of the riddle. To address this gap, "Altered Riddles" has been developed, a new benchmark specifically designed to evaluate LLMs' ability to overcome memorized answers in the face of explicit, contradictory information.

How Altered Riddles Works

Altered Riddles is based on a dataset of common riddles, modified in such a way that the correct answer differs from the original, widely known solution. The goal is to challenge LLMs, observing whether they can process the literal text of the altered prompt or if they revert to the default, memorized answer. The benchmark penalizes models that provide the original riddle's answer when it is clearly incorrect for the modified version.

Initially conceived as a small dataset, the project has recently been revived and transformed into a full benchmark, aiming to provide a more robust measurement of LLMs' contextual understanding capabilities. The results and analysis of the tested models are available on platforms like Hugging Face, where the dataset and updated leaderboard can be consulted, along with a dedicated benchmark page and GitHub repository for further technical details.

Implications for LLM Development and Deployment

An LLM's ability to disregard pre-existing knowledge in favor of explicit instructions is of vital importance, especially for companies considering on-premise deployment. In self-hosted or air-gapped environments, where data sovereignty and regulatory compliance are absolute priorities, it is imperative that a model operates exclusively on internal data and provided instructions, without "hallucinations" or responses based on external information that might be irrelevant, outdated, or non-compliant. An LLM that fails to distinguish between memorized knowledge and explicit input can generate unreliable responses, compromising the integrity of decision-making processes and increasing the TCO due to the need for extensive Fine-tuning or manual validation.

A benchmark like Altered Riddles offers infrastructure architects and CTOs a valuable tool for evaluating model reliability in critical scenarios. Understanding an LLM's limitations in adhering to context can influence decisions regarding model selection, Fine-tuning strategies, and hardware requirements for Inference. For those evaluating on-premise deployment, there are significant trade-offs between performance, costs, and control, and a model's robustness in handling contradictory inputs is a key factor in this equation.

Future Prospects and Collaboration

Currently, the Altered Riddles benchmark faces some limitations, primarily due to computational and financial constraints. This has prevented testing a wide range of proprietary models, focusing for now on more accessible ones. However, the project creator has expressed a willingness to invest further resources to refine the benchmark and expand model coverage, should the project gain sufficient interest and support from the community.

The openness to suggestions and discussions underscores the project's collaborative approach, inviting developers, researchers, and industry professionals to contribute to its improvement. This type of Open Source initiative is fundamental for advancing the understanding of LLMs' capabilities and limitations, providing useful tools for model selection and optimization in various deployment contexts, including those prioritizing data control and sovereignty.