1 of 10

AIOpsLab Evaluation

Kosumi Chan

Chaitanya (CC)

2 of 10

Motivation

  • Modern applications are built on complex, distributed architectures like microservices.
  • This complexity increases the frequency and blast radius of system incidents - outages cause losses of millions $$$.
  • Manual, human-driven operations are becoming slow, error-prone, and unsustainable.
  • The Goal: A fundamental shift towards intelligent automation to ensure system reliability and resilience.

Image generated by Chatgpt

3 of 10

The DevOps Evolution

  • Ops: running systems in production—availability, performance, cost, security
  • DevOps: culture + practices to ship fast and safe (collaboration, automation, CI/CD, IaC)
  • Incident lifecycle: Detect → Triage/Localize → RCA → Mitigate → Postmortem
  • Observability pillars: logs, metrics, traces (+ events)
  • Core reliability concepts: SLI/SLO, error budgets, on-call & runbooks

Image by

https://www.shutterstock.com/image-vector/devops-process-8-stages-software-development-1957460116?dd_referrer=https%3A%2F%2Fwww.google.com%2F

4 of 10

Microservices

  • Small, independently deployable services with clear APIs
  • Pros: autonomy, tech heterogeneity, targeted scaling, failure isolation
  • Cons: distributed complexity, network hops, data consistency
  • Operational needs: distributed tracing, centralized logging, dependency maps, chaos/fault injection

https://medium.com/javarevisited/5-essential-frameworks-every-java-developer-should-learn-6ed83315f1fb

5 of 10

Managing Microservices at Scale - Kubernetes

  • Orchestration platform - applications packaged in lightweight Containers
  • Schedules services across a cluster of machines - Nodes
  • Controllers - reconcile desired vs. actual state
  • Automation: autoscaling, rolling updates, rollbacks, health checks, configs

https://medium.com/@anujuly92/explain-the-key-components-of-kubernetes-master-node-worker-node-etcd-kubelet-kube-proxy-etc-2e9154c890d0

6 of 10

The Next Step: AIOps

  • AIOps - Artificial Intelligence for IT Operations
  • Where AI can help?
    • Anomaly Detection: Automatically finding strange patterns in logs and metrics.
    • Event Correlation: Reducing alert noise by grouping related events.
    • Automated Root Cause Analysis: Identifying the source of an incident.

7 of 10

Background - Compound AI Systems(LLM agent)

  • Maximizing LLM capability by integrating it into a compound system with multiple components, tailored for the complex problem we are trying to solve

Image from https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/

ReAct architecture

Image from https://research.google/blog/react-synergizing-reasoning-and-acting-in-language-models/

8 of 10

Background: AIOpsLab

A holistic framework to enable the design, development, and evaluation of autonomous AIOps agents.

Extensible parts:

  • Agents: Different base models and different architectures(GPT-4-W-SHELL, FLASH, ReAct, …)
  • Faults: Symptomatic faults(surface-level) and functional faults(deeper)
  • Workloads: service traffic
  • Evaluations: performance metrics including correctness and Time-To-Detect
  • Services: the cloud application under test

Image from AIOpsLab 2025 paper: https://arxiv.org/pdf/2501.06706

9 of 10

Research questions

  • How does the State-of-Art models(Chatgpt-5, Gemini-Pro 2.5) perform? In the paper, GPT-4-W-SHELL significantly outperforms GPT-3.5-W-SHELL.
  • How does the performance of an agent improve with more thinking budget?
  • AIOpsLab has a set of simplified, isolated faults but how does an agent perform facing real-world systemic faults?
  • How to evaluate the safety of an agent: it does not have to be better than human but at least it should not make the bad situation worse.

10 of 10

Approach

  1. Reproduce(partially) result in the paper evaluation
  2. Add SoA models including Chatgpt-5 and Gemini-2.5Pro
  3. Evaluate new models and the impact of thinking budget
  4. Construct new faults that are closer to real-world settings
  5. Design safety metrics for the cluster
  6. Evaluate agent performance