Arbor: Autonomous LLM Inference Optimization with Intelligent Agents

Arbor: Autonomous Agents for LLM Inference Optimization

Optimizing performance in Large Language Model (LLM) Inference presents a complex challenge, historically requiring intensive coordination among engineering teams across multiple layers of the technology stack: from application to framework, compiler, kernel, and hardware. In this scenario, Arbor emerges as a new multi-agent framework that introduces an innovative approach, based on structured tree search, to serve as a cognitive layer for autonomous agents. These agents operate in large, stateful action spaces, aiming to maximize the efficiency of LLM workloads.

Unlike prior autonomous optimization systems, which often operated on isolated targets with stateless evaluation, Arbor maintains an explicit search tree of scored hypotheses. This tree serves as shared working memory across agents, evolving with every measurement and treating failures as diagnostic signals that reshape subsequent exploration. The system expands as prior successes shift the bottleneck distribution, ensuring dynamic and adaptive optimization.

Agent Architecture and Operation

At the core of Arbor is its two-main-agent architecture: an Orchestrator and a Critic. The Orchestrator agent drives the optimization process, delegating tasks to Domain Specialists distributed across the Inference stack. These specialists can operate at the application, framework, compiler, or hardware level, contributing their specific expertise.

Alongside the Orchestrator, the Critic agent operates, whose role is to safeguard system stability. The Critic performs root-cause analysis, introspection, and measurement validation, creating a "checks-and-balances" architecture where neither agent can unilaterally drive the system. This division of labor and cross-checking mechanism are fundamental to Arbor's robustness and reliability. Agent capabilities are decomposed into "hard skills" (domain expertise) and "soft skills" (coordination protocols), enabling fully autonomous multi-day optimization campaigns.

Performance Impact and On-Premise Relevance

The results achieved with Arbor are significant. The framework has demonstrated up to a 193% Pareto improvement in Inference throughput-latency, surpassing vendor-optimized baselines. This data is particularly relevant when compared to a single-agent approach, which without the Arbor framework's support achieves only a 33% throughput improvement and crashes irrecoverably within hours. Arbor's ability to manage complexity and prevent failures is a clear advantage.

For organizations evaluating on-premise LLM deployments, Inference optimization is a critical factor for maximizing existing hardware utilization and controlling the Total Cost of Ownership (TCO). Arbor stands out for its ability to generalize across multiple generations of hardware platforms, with run-to-run variance within 2 percentage points. This demonstrates that the method is hardware-agnostic and highly reproducible, essential qualities for self-hosted environments where stability and predictability are priorities.

Outlook for AI Infrastructure

The introduction of frameworks like Arbor marks a step forward in the automation of AI infrastructure optimization. For CTOs, DevOps leads, and infrastructure architects, such solutions promise to drastically reduce the manual workload associated with fine-tuning LLM performance. The ability to conduct multi-day optimization campaigns autonomously, with reproducible and hardware-agnostic results, offers enormous potential to improve operational efficiency and scalability of on-premise deployments.

Arbor's capability to learn from failures and dynamically adapt to bottlenecks represents a promising model for managing complex AI systems. This approach not only improves performance but also contributes to greater data sovereignty and control over the entire Inference pipeline, crucial aspects for companies operating with stringent compliance and security requirements. For those evaluating the trade-offs between self-hosted and cloud solutions, AI-RADAR continues to explore analytical frameworks and innovative technologies on /llm-onpremise that can support these strategic decisions.