1 of 10

AIOpsLab Evaluation

Kosumi Chan

Chaitanya (CC)

2 of 10

Motivation

Modern applications are built on complex, distributed architectures like microservices.
This complexity increases the frequency and blast radius of system incidents - outages cause losses of millions $$$.
Manual, human-driven operations are becoming slow, error-prone, and unsustainable.
The Goal: A fundamental shift towards intelligent automation to ensure system reliability and resilience.

Image generated by Chatgpt

Good morning. I want to start by talking about the fundamental challenge driving our work. The way we build and run software has radically changed. We've moved from simple, monolithic applications—where everything was in one large codebase—to highly distributed systems composed of hundreds, or even thousands, of independent microservices.

While this architectural shift has given us incredible development speed and scalability, it has created a massive operational challenge. When something goes wrong in a monolith, you generally know where to look. But in a microservice environment, a single user request might travel through dozens of services, making incident diagnosis like finding a needle in a global haystack.

The result is that human operators are facing immense pressure. They are trying to manually sift through mountains of data to find the root cause of failures. This process is not only slow, leading to longer downtimes, but it's also prone to human error. We've simply reached a point where the complexity of our systems has outpaced our ability to manage them manually. This sets the stage for our core motivation: the urgent need for a new paradigm of intelligent, automated operations to keep these critical systems running."

Prev content:

Businesses increasingly rely on cloud services like AWS to scale their applications.

Cloud faults can lead to outages, which may cause losses of millions $$$.

Currently, Site Reliability Engineers (SREs) must remain on call to manually detect, diagnose, and remediate non-trivial faults when they occur.

However, AI agents are promising for automating fault detection, and executing remediation actions in real time.

3 of 10

The DevOps Evolution

Ops: running systems in production—availability, performance, cost, security
DevOps: culture + practices to ship fast and safe (collaboration, automation, CI/CD, IaC)
Incident lifecycle: Detect → Triage/Localize → RCA → Mitigate → Postmortem
Observability pillars: logs, metrics, traces (+ events)
Core reliability concepts: SLI/SLO, error budgets, on-call & runbooks

Image by

https://www.shutterstock.com/image-vector/devops-process-8-stages-software-development-1957460116?dd_referrer=https%3A%2F%2Fwww.google.com%2F

Traditionally, we had a model where developers would write code and then essentially 'throw it over a wall' to the operations team to run and maintain. This created what was known as the 'wall of confusion.' The development team's goal was to push out new features quickly, while the operations team's goal was to keep everything stable, which often meant resisting change.

DevOps isn’t a role so much as a set of practices that align Dev and Ops—automated tests, continuous delivery, infrastructure as code—so changes are frequent but safe.

Development and Ops teams work together, sharing responsibility for the application throughout its entire lifecycle. This is enabled by a heavy focus on automation—automating everything from building and testing the code to deploying it and monitoring its performance.

When incidents happen, the journey is detect, localize, explain, fix, and learn.

Observability is how we see the system—metrics for time-series health, logs for events and context, traces for cross-service flows.

Reliability is formalized through SLIs and SLOs, with an error budget that tells us how much risk we can take on. Practically, we care about shrinking time to detect and time to restore, while keeping change failure rate low.

4 of 10

Microservices

Small, independently deployable services with clear APIs
Pros: autonomy, tech heterogeneity, targeted scaling, failure isolation
Cons: distributed complexity, network hops, data consistency
Operational needs: distributed tracing, centralized logging, dependency maps, chaos/fault injection

https://medium.com/javarevisited/5-essential-frameworks-every-java-developer-should-learn-6ed83315f1fb

5 of 10

Managing Microservices at Scale - Kubernetes

Orchestration platform - applications packaged in lightweight Containers
Schedules services across a cluster of machines - Nodes
Controllers - reconcile desired vs. actual state
Automation: autoscaling, rolling updates, rollbacks, health checks, configs

https://medium.com/@anujuly92/explain-the-key-components-of-kubernetes-master-node-worker-node-etcd-kubelet-kube-proxy-etc-2e9154c890d0

So, if we have hundreds of microservices, how do we actually run them? Managing them manually across a fleet of servers is simply not feasible. This is the problem that Kubernetes solves.

At its core, Kubernetes is an orchestration engine. You package your microservices into standard units called containers, and you give them to Kubernetes. From there, Kubernetes takes over the management. It decides which servers to run them on. If a service becomes unhealthy, Kubernetes can automatically restart it. If a service is receiving too much traffic, Kubernetes can automatically scale it up by deploying more copies. And if an entire server fails, Kubernetes will move all the services that were running on it to healthy servers in the cluster.

It accomplishes this through a declarative model. You don't tell Kubernetes how to do something; you simply declare the state you want—for example, 'I want three copies of my API server running at all times'—and Kubernetes works continuously to make that desired state a reality. It has become the foundational operating system for the cloud, but it's important to remember that while it solves the problem of running services, it doesn't by itself solve the problem of diagnosing failures within those services.

6 of 10

The Next Step: AIOps

AIOps - Artificial Intelligence for IT Operations
Where AI can help?

Anomaly Detection: Automatically finding strange patterns in logs and metrics.
Event Correlation: Reducing alert noise by grouping related events.
Automated Root Cause Analysis: Identifying the source of an incident.

All these trends converge: complex microservices running on Kubernetes, operated by DevOps teams awash in logs, metrics, and traces—so the human-only approach is no longer enough.

AIOps applies AI to make sense of this data and automate operational work.

In practice, it’s not “push button, fix prod.”

The near-term wins are reducing noise, correlating dozens of alerts into one actionable incident, spotting anomalies earlier, forecasting risk to flag dangerous deploys, and accelerating human decisions with better context.

LLM assistants help translate messy telemetry and guide on-call through runbooks.

The sensible path is crawl-walk-run: start with assistive insights, move to recommendations, then human-approved automation for low-risk tasks—always with strong guardrails, access controls, and clear explanations.

The goal is to shift Ops from reactive firefighting to proactive prevention, and ultimately toward self-healing systems that can detect, diagnose, and remediate incidents on their own.

7 of 10

Background - Compound AI Systems(LLM agent)

Maximizing LLM capability by integrating it into a compound system with multiple components, tailored for the complex problem we are trying to solve

Image from https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/

ReAct architecture

Image from https://research.google/blog/react-synergizing-reasoning-and-acting-in-language-models/

8 of 10

Background: AIOpsLab

A holistic framework to enable the design, development, and evaluation of autonomous AIOps agents.

Extensible parts:

Agents: Different base models and different architectures(GPT-4-W-SHELL, FLASH, ReAct, …)
Faults: Symptomatic faults(surface-level) and functional faults(deeper)
Workloads: service traffic
Evaluations: performance metrics including correctness and Time-To-Detect
Services: the cloud application under test

Image from AIOpsLab 2025 paper: https://arxiv.org/pdf/2501.06706

9 of 10

Research questions

How does the State-of-Art models(Chatgpt-5, Gemini-Pro 2.5) perform? In the paper, GPT-4-W-SHELL significantly outperforms GPT-3.5-W-SHELL.
How does the performance of an agent improve with more thinking budget?
AIOpsLab has a set of simplified, isolated faults but how does an agent perform facing real-world systemic faults?
How to evaluate the safety of an agent: it does not have to be better than human but at least it should not make the bad situation worse.

10 of 10

Approach

Reproduce(partially) result in the paper evaluation
Add SoA models including Chatgpt-5 and Gemini-2.5Pro
Evaluate new models and the impact of thinking budget
Construct new faults that are closer to real-world settings
Design safety metrics for the cluster
Evaluate agent performance