Optimizing LLM Agent Communication: PACT Reduces Inference Costs

The Communication Challenge in Multi-Agent Systems

Multi-agent systems (MAS) built on Large Language Models (LLMs) represent a promising frontier in the development of complex AI applications. These systems, often organized around roles, pipelines, and turn schedules, allow agents to collaborate to achieve articulated goals. However, a critical aspect that can compromise their efficiency is the way agents communicate with each other.

Traditionally, inter-agent communication has been left to unconstrained natural language. While intuitive, this approach leads to a rapid inflation of token usage, saturating the shared context window, and ultimately negatively impacting both system performance and inference costs. For organizations managing LLM workloads on self-hosted infrastructures, optimizing every single token is fundamental for controlling the Total Cost of Ownership (TCO) and maximizing hardware resource utilization.

PACT: A Protocol for Efficiency

To address these challenges, recent research analyzed five common inter-agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. The key lies in the ability of effective messages to consistently preserve action-centered information needed by downstream agents.

Building on this insight, the PACT (Protocolized Action-state Communication and Transmission) protocol was proposed. PACT treats inter-agent communication as a public state-update problem, projecting each raw agent output into a compact action-state record before it enters shared history. This mechanism ensures that only essential information is transmitted, reducing noise and inefficiency.

Across different MAS topologies, PACT consistently improves the performance-cost trade-off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands' resolve rate at -10% tokens-per-resolved, and is resolve-neutral on SWE-agent while halving input tokens. The code is publicly available, offering a concrete solution for those seeking greater efficiency.

Impact on On-Premise Deployments and TCO

Optimizing inter-agent communication, such as that offered by PACT, has a direct and significant impact on on-premise deployments. The reduction in token consumption translates into less pressure on GPU VRAM, allowing for larger batch sizes or the use of bigger models with existing hardware. This results in improved throughput and reduced operational costs, crucial elements for the TCO of a self-hosted AI infrastructure.

For CTOs, DevOps leads, and infrastructure architects evaluating self-hosted alternatives to the cloud, protocol-level efficiency like PACT becomes an enabler. It allows for maximizing investment in dedicated silicon, while ensuring data sovereignty and complete control over the execution environment, aspects often prioritized for regulated industries or stringent security requirements. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs and the cost and performance implications.

Towards More Sustainable AI Systems

Research into optimizing communication in multi-agent systems underscores a fundamental trend in the LLM landscape: the relentless pursuit of efficiency. As models grow larger and systems become more complex, intelligent resource management, particularly of tokens, becomes not just a competitive advantage but an operational necessity.

Solutions like PACT demonstrate that it is possible to achieve high performance while reducing the computational footprint. This is particularly relevant for companies aiming to build and maintain sustainable, scalable, and controllable AI infrastructures. PACT's Open Source approach encourages adoption and further development, contributing to a community that values efficiency and innovation in Large Language Models.