A February 2026 study, conducted by researchers from Amazon, Carnegie Mellon, Stanford, UC Berkeley, and Oxford, highlighted how specific skills provided to AI agents can significantly increase their performance. The research, called SkillsBench, analyzed seven AI agent configurations, including Anthropic's Claude Code, Google's Gemini CLI, and OpenAI's Codex CLI, evaluating their performance on 84 real-world tasks through over 7,300 attempts.

AI Skills: What are they?

An AI agent is a model, such as Claude or GPT, equipped with access to tools and software that allow it to perform tasks autonomously, step by step, rather than simply answering questions. These agents are increasingly used to manage complex activities in various sectors, from analyzing financial reports to processing medical data and managing cybersecurity.

AI skills bridge the gap between the general capabilities of agents and the specialized knowledge needed for specific tasks. Each skill is a structured document that provides instructions, code examples, and reference material to address a particular type of task in a given sector. No further training is needed: the agent reads the skill and applies it.

Study Results

The study found that providing agents with skills written by experts improved their average success rate by 16.2%. In particular, tasks related to the healthcare sector saw a 51.9% improvement, manufacturing 41.9%, cybersecurity 23.2%, and the energy sector 17.9%.

A significant example concerns flood risk analysis: agents operating without guidance achieved a success rate of only 2.9%. When they were provided with a skill that specified the correct statistical methodology, the success rate rose to 80%.

Smaller vs. Larger Models

Another important finding of the study concerns the impact of skills on costs. Anthropic's Claude Haiku 4.5, the smallest and most economical model tested, achieved a success rate of 27.7% with curated skills, surpassing the 22% of Claude Opus 4.5, a significantly more expensive model, operating without skills. This suggests that a smaller, well-instructed model can outperform a larger model left to its own devices.

Focusing Skills

The study also highlighted that the best performance is achieved with two or three focused skill modules, which improve success rates by an average of 18.6%. Overly long skills can consume resources without providing effective guidance.

Human Expertise

Agents who autonomously generated their own skills performed worse than those who had none at all, underscoring the importance of human expertise in creating effective skills. Effective skills require specialized knowledge curated by human experts, which models cannot reliably generate autonomously.