CreativityBench: Evaluating LLM Creative Reasoning in Tool Repurposing

The Challenge of Creativity in LLMs

Recent advancements in Large Language Models (LLMs) have led to strong performance across a wide range of tasks, from complex reasoning to environment interaction. However, their ability for creative problem-solving, particularly through unconventional tool use, remains an underexplored area. Traditionally, LLMs tend to rely on the canonical and predefined usage of objects, limiting their flexibility in scenarios that demand lateral thinking.

The concept of "creative tool use" refers to a model's ability to repurpose available objects by reasoning about their "affordances" and attributes, rather than relying on their primary function. Affordances are the possibilities for action that an object offers an agent, based on its physical properties and context. Understanding and leveraging these implicit possibilities is crucial for artificial intelligence that can adapt and innovate in dynamic environments.

CreativityBench: A New Benchmark for AI Ingenuity

To address this gap, CreativityBench has been introduced as a new benchmark specifically designed to evaluate affordance-based creativity in LLMs. This tool represents a significant first step towards a deeper understanding of the creative reasoning capabilities of current models. The benchmark is built upon a large-scale affordance knowledge base (KB), which includes 4,000 entities and over 150,000 annotations. This KB explicitly links objects, parts, attributes, and actionable uses, providing a rich context for evaluations.

Leveraging this knowledge base, CreativityBench generates 14,000 grounded tasks that require identifying non-obvious yet physically plausible solutions under specific constraints. These tasks are designed to push LLMs beyond simple memorization or pattern recognition, demanding deep reasoning about the physical properties and potential interactions of objects. The goal is to measure a model's true ability to "think outside the box" in a simulated physical context.

Current Limitations and Implications for On-Premise Deployment

Evaluations conducted across ten state-of-the-art Large Language Models, including both closed and open-source models, have yielded significant findings. While models can often select a plausible object for a task, they struggle to identify the correct parts, their affordances, and the underlying physical mechanism needed to solve the problem. This leads to a significant drop in performance compared to what might be expected from general reasoning.

It was observed that improvements from model scaling quickly saturate, indicating that mere dimensional growth is insufficient to unlock creativity. Furthermore, strong general reasoning does not reliably translate to creative affordance discovery, and common Inference-time strategies such as Chain-of-Thought yield limited gains. For those evaluating on-premise LLM deployment, these results underscore the importance of considering not just standard performance metrics, but also deeper reasoning capabilities. The choice of open-source models, often preferred for data sovereignty and TCO, may require additional fine-tuning efforts or integration with specialized reasoning modules to address these challenges. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate the trade-offs between model capabilities and infrastructure requirements.

Future Prospects for Artificial Intelligence

The CreativityBench results clearly suggest that creative tool use remains a major challenge for current artificial intelligence models. This benchmark provides a useful testbed for studying this missing dimension of intelligence, offering crucial insights for the development of planning and reasoning modules in future agents. An AI's ability to creatively repurpose tools is fundamental not only for complex problem-solving but also for autonomy and adaptability in real-world scenarios.

Addressing these limitations will likely require new model architectures or hybrid approaches that combine the power of LLMs with symbolic or physics-based reasoning mechanisms. The CreativityBench study paves the way for more targeted research toward creating AI agents that are not limited to executing predefined tasks, but can innovate and adapt in truly intelligent ways, regardless of their deployment environment, be it cloud or self-hosted.

CreativityBench: Evaluating LLM Creative Reasoning in Tool Repurposing

The Challenge of Creativity in LLMs

CreativityBench: A New Benchmark for AI Ingenuity

Current Limitations and Implications for On-Premise Deployment

Future Prospects for Artificial Intelligence

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

LLM and unexpected requests: when AI responds outside the box

OpenAI: Chinese cops used ChatGPT to plan and track smear ops

Anthropic Introduces Claude Opus 4.6: The Latest Model Evolution

👥 Join 160+ AI explorers