TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?

TraderBench: Realistic Evaluation of AI Agents in Financial Markets

Evaluating AI agents in the financial sector presents significant challenges. Static benchmarks require costly expert annotation but fail to capture the dynamic decision-making essential in real-world trading. The use of LLM-based judges introduces uncontrolled variance in domain-specific tasks.

TraderBench addresses these issues by combining expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations. Performance is evaluated based on real metrics such as the Sharpe ratio, returns, and drawdown, completely eliminating the variance introduced by judges.

Key Features

The framework includes two novel tracks:

Crypto trading with four progressive market-manipulation transforms.
Options derivatives, scored across P&L accuracy, Greeks, and risk management.

Trading scenarios can be refreshed with new market data to prevent benchmark contamination. Evaluating 13 models (8B open-source to frontier) on ~50 tasks revealed that 8 of 13 models score ~33 on crypto trading, with less than 1-point variation across adversarial conditions, which exposes fixed non-adaptive strategies. Extended "thinking" helps retrieval (+26 points) but has zero impact on trading (+0.3 crypto, -0.1 options).

These findings highlight that current agents lack genuine market adaptation, underscoring the need for performance-grounded evaluation in the financial sector.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?

TraderBench: Realistic Evaluation of AI Agents in Financial Markets

Key Features

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

AI startup CVector raises $5M for its industrial ‘nervous system’

The agentic AI boom is here; operations will decide who wins

Evaluating Skills for Coding Agents: Best Practices