AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Frameworks AI generated

AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Published on 2026-03-05 05:05 🏆 ArXiv cs.LG 📰 Read the original source article →

🏷️ LLM On-Premise 🏷️ Fine-Tuning 🏷️ DevOps

AOI: Apprendimento da traiettorie fallite per la diagnosi autonoma nel cloud

AOI: Autonomous Cloud Diagnosis Through Learning from Failures

Managing cloud infrastructure requires increasingly sophisticated diagnostic systems. AOI (Autonomous Operations Intelligence) is a framework that addresses this challenge, leveraging operational failures as learning opportunities for AI agents.

AOI aims to automate Site Reliability Engineering (SRE) tasks using LLMs, overcoming limitations in access to proprietary data, unsafe action execution, and the inability to improve from failures.

Key Components of AOI

Trainable Diagnostic System: Uses Group Relative Policy Optimization (GRPO) to transfer expert knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data.
Read-Write Separated Execution Architecture: Divides operational trajectories into observation, reasoning, and action phases, ensuring safe learning and preventing unauthorized state mutation.
Evolver for Error Trajectories: Analyzes unsuccessful trajectories and transforms them into corrective supervision signals, enabling continuous data augmentation.

Results

Evaluated on the AIOpsLab benchmark, AOI has demonstrated significant improvements:

AOI runtime achieves a best@5 success rate of 66.3% on 86 tasks, surpassing the previous state-of-the-art (41.9%).
Adding Observer GRPO training, with a locally deployed 14B model, achieves an average of 42.9% avg@1 on 63 tasks with unseen fault types, surpassing Claude Sonnet 4.5.
The Evolver converts 37 failed trajectories into diagnostic guidance, improving the end-to-end avg@5 by 4.8 points and reducing variance by 35%.

AI-Radar Takeaway

A new multi-agent framework, AOI (Autonomous Operations Intelligence), uses failed operational trajectories to improve automated diagnostic systems in the cloud. AOI integrates preference-based learning, a secure execution architecture, and continuous error correction, outperforming state-of-the-art performance in AIOpsLab benchmarks.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Vast.ai GPU Marketplace

Decentralized GPU marketplace with ultra-competitive pricing. Rent from a global network of providers. Perfect for experimentation, development, and cost-optimized workloads.

✓ Lowest prices ✓ Global network ✓ Flexible options

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Cloudflare turns websites into faster food for AI agents

Cloudflare turns websites into faster food for AI agents

Cloudflare shifts its focus from bot barriers to offering structured data for AI agents. The goal is to provide content in more easily processed formats, such a

AWS intrusion: admin access in 10 minutes thanks to AI assist

AWS intrusion: admin access in 10 minutes thanks to AI assist

Researchers demonstrated how an AI-powered intrusion system was able to gain administrative privileges on an AWS cloud environment in under 10 minutes, automati

China's AI Cloud Price Hikes: A Signal for Deployment Strategies

China's AI Cloud Price Hikes: A Signal for Deployment Strategies

Chinese cloud providers are increasing the costs of their AI services, a move reflecting the surging usage of Large Language Models and the demand for computati

Cloud Migration Software with Infrastructure as Code: Tools and Strategies

Cloud Migration Software with Infrastructure as Code: Tools and Strategies

Cloud migration with Infrastructure as Code (IaC) presents complex challenges beyond simply moving workloads. It demands reproducibility, architectural validati

LLM and LDM for Autonomous Edge System Safety: A New Testing Framework

LLM and LDM for Autonomous Edge System Safety: A New Testing Framework

A new framework proposes using LLMs and Latent Diffusion Models to generate fault scenarios and sensor degradations, enhancing the validation of autonomous visi

More in Frameworks

GNOME’s AI Assistant Now Generates Images: Newelle 1.4.5 Arrives

Llama.cpp cuts CUDA synchronizations, boosting on-premise inference performance

DeepSeek V4 Flash and MiniMax M3 on llama.cpp: When will native support arrive?

llama.cpp: Vulkan Tensor Parallelism Now Within Reach

A software veteran builds a local LLM harness and asks the community: what do you need?

Patronus AI secures $50M to crash-test AI agents

→ View all in Frameworks →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in