AOI: Autonomous Cloud Diagnosis Through Learning from Failures
Managing cloud infrastructure requires increasingly sophisticated diagnostic systems. AOI (Autonomous Operations Intelligence) is a framework that addresses this challenge, leveraging operational failures as learning opportunities for AI agents.
AOI aims to automate Site Reliability Engineering (SRE) tasks using LLMs, overcoming limitations in access to proprietary data, unsafe action execution, and the inability to improve from failures.
Key Components of AOI
- Trainable Diagnostic System: Uses Group Relative Policy Optimization (GRPO) to transfer expert knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data.
- Read-Write Separated Execution Architecture: Divides operational trajectories into observation, reasoning, and action phases, ensuring safe learning and preventing unauthorized state mutation.
- Evolver for Error Trajectories: Analyzes unsuccessful trajectories and transforms them into corrective supervision signals, enabling continuous data augmentation.
Results
Evaluated on the AIOpsLab benchmark, AOI has demonstrated significant improvements:
- AOI runtime achieves a best@5 success rate of 66.3% on 86 tasks, surpassing the previous state-of-the-art (41.9%).
- Adding Observer GRPO training, with a locally deployed 14B model, achieves an average of 42.9% avg@1 on 63 tasks with unseen fault types, surpassing Claude Sonnet 4.5.
- The Evolver converts 37 failed trajectories into diagnostic guidance, improving the end-to-end avg@5 by 4.8 points and reducing variance by 35%.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!