Evaluations (evals) are fundamental to defining and improving the behavior of AI agents, such as those used in Deep Agents, an open-source framework. A thoughtful approach to creating evals is essential to ensure that agents behave as expected in production.
How we curate data for evaluations
There are several ways to source data for evals:
- Feedback from dogfooding our agents.
- Selected evals from external benchmarks, adapted for a specific agent.
- Evals and unit tests written manually for behaviors considered important.
Tracing each eval run allows us to analyze issues, make fixes, and assess the value of a given eval. The goal is to understand failure modes, propose a fix, rerun the agent, and track progress over time.
How we define metrics
Correctness is the starting point when choosing a model for an agent. Subsequently, we move on to efficiency. The metrics measured for each eval run include:
- Correctness: indicates whether the model completed the task correctly.
- Step ratio: ratio between observed agent steps and ideal steps.
- Tool call ratio: ratio between observed tool calls and ideal calls.
- Latency ratio: ratio between observed latency and ideal latency.
- Solve rate: number of expected steps / observed latency, with a score of 0 if the task was not solved correctly.
How we run evals
Evals are run in CI (Continuous Integration) using pytest with GitHub Actions, ensuring a clean and reproducible environment. Each eval creates a Deep Agent instance with a given model, provides it with a task, and calculates correctness and efficiency metrics. It is possible to run a subset of evals using tags to save costs and measure targeted experiments.
For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!