SanityHarness: A new benchmark for coding agents
A developer has released SanityHarness, a new benchmark tool designed to evaluate the capabilities of coding agents and language models (LLMs) in an agent-agnostic manner. The goal is to measure model understanding and agent capabilities, rather than simply repeating training data.
The benchmark consists of tasks in six different programming languages and is available on GitHub for those who want to use it independently.
SanityBoard: The coding model leaderboard
The results of tests performed with SanityHarness are published on SanityBoard, a leaderboard that compares the performance of 49 different agent and model combinations. The leaderboard includes relevant data such as execution dates and agent version numbers.
The developer invites the community to contribute API keys and credits to test a larger number of models and agents. He is committed to maintaining maximum transparency and impartiality in the tests.
Usage costs and future plans
The author highlighted how some credit-based monetization models are excessively expensive. He compared the costs of different services, pointing out how some plans offer significantly better value for money than others.
In the future, he plans to test different MCP (Meta-Cognitive Programming) tools to evaluate their impact on the coding capabilities of agents, as well as compare different configurations of open source models such as Oh-My-Opencode.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!