Shahram Anver & Willem Pienaar

Dec 12, 2024

10 min

LLM powered agents enable a new approach to production operations. These AI agents can investigate issues and drive them to resolution - using your existing tools and processes.

Today, the more grounded reality is that these agents are effective at specific operational tasks, particularly in investigating and diagnosing issues. While fully autonomous operation remains a future goal, current agents are already saving engineering teams countless hours of investigation work. Think of them as an AI Site Reliability Engineer (SRE), teammates who never sleep and love diving into those 3AM alerts.

This post explores how AI SREs work: how agents build system understanding, investigate issues, and learn from each interaction. We'll examine current agent capabilities, practical limitations, and the path towards autonomous operation.

‍

The Breaking Point

Every engineering team with production responsibilities faces the same challenge: production environments generate an endless amount of operational work. Each new service multiplies potential failure modes, creating a constant stream of alerts and incidents. Engineers often spend their days context switching between building software and investigating issues, while important development work gets delayed.

Many teams have accepted triage as normal. Lower priority alerts wait while critical issues are handled. Engineers learn to ignore alerts until they boil over. Minor problems accumulate and accumulate until they suddenly cascade into major incidents. Then, at 3AM an engineer has to quickly resolve a customer facing incident. The reactive approach works until it doesn't.

Traditional approaches haven't solved this problem. Adding more engineers follows Metcalfe's Law: coordination overhead grows as n², quickly overwhelming any linear capacity gains. Adding more tools creates sprawl. Scripts break when systems change. Automation requires constant maintenance. Monitoring tools generate more noise than signal. The result is engineers spending hours investigating issues that could have been prevented, while actual development work sits idle.

‍

AI SRE: From Reactive to Proactive Operations

Large language models, when deployed as autonomous agents, enable AI SREs that make independent operational decisions. Unlike copilots that require human guidance, these agents determine which tools to use and when — querying Datadog metrics, checking PagerDuty alerts, running kubectl commands. They can process thousands of signals simultaneously, maintain context across interactions, and learn continuously from experience.

An AI SRE connects directly to your production environment through existing APIs and permissions. It builds system understanding through documentation, metrics, logs, and alerts. Unlike static runbooks or traditional automation, agents can handle novel situations they haven’t been trained on by reasoning through them from first principles.

Cleric AI SRE resolves a latency issue and deploys a fix with engineering approval

These agents excel at investigation and diagnosis, processing thousands of signals to identify potential issues. They analyze system metrics, logs, and traces - presenting both their findings and the evidence chain that led to their conclusions.

While an agent may misinterpret correlations or lack complete system context, it consistently narrows the search space by surfacing relevant patterns and potential causes, while engineers maintain full control over its actions.

‍

An AI SRE at Work: When Minor Alerts Matter

Often, a low severity alert contains early warning signs of an impending incident. An AI SRE processes these signals to prevent cascade failures. Here's an all-too-common example to demonstrate.

At 3AM, a Redis latency alert triggers in your recommendation service. Average command latency: 2.3ms, up from baseline 0.8ms. Memory fragmentation ratio: 1.89. Used memory: 12.8GB of 16GB allocated. The alert is marked P3 — low priority. Your on-call engineer, already dealing with a deployment issue, queues it for morning review. These metrics are well below crisis thresholds, and similar alerts have resolved themselves before.

The AI SRE sees something different. The recommendation service handles both product suggestions and session management — a design choice documented in last month's architecture review. Morning traffic will bring 30x more requests. Memory pressure now means potential cascade failure during peak hours.

The AI SRE expands the investigation rapidly. Memory usage in related services shows subtle pressure: cart service queue depth growing 15% hour over hour, session service error rate creeping up 0.01% every 30 minutes. A backup process, scheduled during "low traffic" hours, is competing for resources. Three downstream services show early warning signs in their connection pools.

Each metric individually looks minor. Together, they reveal an imminent system-wide failure:

Growing memory pressure will trigger Redis evictions
Session data loss will force re-authentication
Shopping cart operations will timeout
Product recommendations will degrade
Customer experience impacts will compound
Recovery will require multiple service restarts

The morning scenario is clear: Engineers would arrive to multiple incidents, customer complaints, and pressure to fix everything immediately. Recovery would take hours. Revenue impact would be significant. The post-mortem would show all the warning signs were there — buried in dashboards, hidden in metrics, lost in alert noise.

Instead, with high confidence in its analysis, the AI SRE triggers a PagerDuty escalation at 3 AM:

The on call engineer, presented with clear evidence and low risk remediation, merges the fix. By morning:

Memory usage stabilized at 68%
Queue depths normal
Error rates baseline
No customer impact
Knowledge captured and shared

This case shows the compound value of automated prevention. The immediate win was avoiding a morning outage, but the lasting benefit is capturing this failure pattern for future detection. Better system understanding means fewer incidents and more engineering time spent on actual development.

The key AI SRE capabilities that enabled this prevention:

Building operational knowledge
Awareness
Investigation
Resolution

Let's examine each of these capabilities in detail.

‍

Core Capabilities

Building Operational Knowledge

Operational knowledge enables an AI SRE to quickly diagnose issues and eventually anticipate failures by understanding how systems interact and depend on each other.

The agent builds a knowledge graph from multiple sources: system queries, infrastructure code, monitoring data, documentation, and team communications. This captures service relationships and operational state, while a language model further infers additional connections from unstructured data - like linking services through shared variables or identifying implicit dependencies.

Graph building typically happens as a background process. Consider a simple Kubernetes deployment that the agent identifies as part of a Kubernetes cluster scan:

This Kubernetes object is processed, and a structured representation of it’s properties and relations are inserted into the AI SRE’s knowledge graph:

From this deployment spec alone, the agent learns the service's Redis dependency, its ML inference requirements, and its resource constraints. Additional API calls then reveal actual usage patterns and performance characteristics:

The agent also extracts critical context from team discussions in Slack:

These different sources combine to build a more complete knowledge graph:

The graph updates through continuous background processes and active investigations - each interaction with production systems either validates existing knowledge or reveals new relationships. During investigations, the agent traverses these relationships to identify potential causes - from a service to its dependencies to their resource constraints to known failure patterns.

This knowledge building is crucial because modern systems are deeply interconnected:

The graph doesn't need to be perfect to be useful. Some relationships may be tentative, some properties may be outdated. What matters is capturing enough structure to reason about system behavior and guide investigations effectively.

Awareness

Engineers are flooded with operational noise—alerts, tickets, deployment issues, configuration changes, and support questions. An AI SRE integrates with your entire operational stack to filter this stream, detecting which signals need attention and which can be deprioritized. The goal is simple: take action when needed, stay quiet when not.

The agent processes signals across your environment, combining them to understand impact. A developer question about Redis might add context to minor latency spikes. A support ticket could connect seemingly unrelated errors. A configuration change could explain recent performance patterns. Over time, it learns which combinations deserve attention and which can be safely ignored.

The agent starts by learning from engineer feedback—which patterns matter, what impacts business, when to act. As it processes more signals and observes their business impact, it builds confidence in autonomous decisions. A latency spike that once required engineer review becomes a known pattern with clear action thresholds. A deployment pattern that caused customer impact becomes an automatic investigation trigger.

This autonomous triage lets engineers step back from constant operational noise. Instead of reviewing every alert or signal, they can focus on building resilient systems and addressing structural problems. The AI SRE, through investigation, will let engineers know which alerts they really need to pay attention to.

‍

Investigation

An AI SRE investigates like an engineer, but operates concurrently across many paths. When an issue occurs, the agent immediately draws on its awareness of the environment—recent deployments, team discussions, past incidents, and known failure patterns. Using its knowledge graph, it identifies which systems and dependencies could be involved in the issue.

The agent generates multiple hypotheses about potential root causes and tests them in parallel. It uses the same tools engineers use—querying Datadog metrics, checking Kubernetes logs, examining traces—but can investigate dozens of paths simultaneously. Each API response or command output either supports or challenges a hypothesis.

The depth of investigation scales with business impact. A critical authentication issue might warrant a deep investigation across multiple systems. A development environment warning might only justify a cursory search. An engineer-defined budget determines how long the agent spends gathering and analyzing evidence.

Consider a login failure investigation: The agent starts with three parallel paths based on historical patterns:

Direct dependencies: database connections, auth service health
Related services: cache layer, session management
Environmental factors: recent deployments, configuration changes

Within minutes, the investigation narrowed to database connectivity:

Connection pool at 95/100 capacity [1]
Query latency increased 5x over baseline [2]
No recent deployment or config changes [3]
Similar patterns in past incidents [4]

Each finding increases confidence in a path. Error logs shows connection failures starting exactly when pool usage hit 95%. Traces reveal query times climbing as connections are exhausted. Other paths show normal behavior. An evidence chain builds a high confidence in a connection pool saturation as the root cause.

The agent documents every step: The commands run, data collected, paths explored, its findings. The investigation continues until a root cause is identified with high confidence, human input is needed, or the agent consumes its budget. Engineers can then review the agent’s findings and take over, or provide more context and ask the agent to continue its search.

There are, of course, limitations. Complex interactions between systems can hide root causes. Tribal knowledge or context may be missing or inconsistent. The agent may not have access to a certain VPC. But even if the AI SRE doesn't find the exact cause, its iterative process results in findings that significantly reduces the search space for engineers—turning hours of investigation into minutes.

‍

Resolution

Resolution turns investigation findings and a root cause into concrete changes in your infrastructure, such as updating Kubernetes configurations or service parameters. The goal is to close the loop from detection to fix with minimal human intervention, but this requires earning trust through consistently successful changes.

The agent operates under environment specific rules that determine its level of autonomy:

Development: Can auto-implement previously approved changes
Staging: Can adjust resource limits and common configurations with team lead approval
Production: All changes require engineering review and explicit approval

Consider a recent login failure investigation

After changes deploy, the agent monitors the signals (system health, metrics, logs) that triggered the original investigation. While it can track if error rates or latency change, it can't definitively prove these changes caused any improvements. Production systems are too complex for simple before/after validation.

Successfully merged changes become part of the agent's knowledge base - not as guaranteed solutions or scripts, but as directionally useful approaches worth considering for future issues. Teams typically start the agent with read-only access in production, gradually expanding its capabilities based on its track record of suggesting useful changes.

‍

The Road Ahead

Production systems have grown beyond human scale. The complexity and speed of modern infrastructure means engineers can't keep up with operational demands through traditional approaches—no matter how many dashboards we build or people we hire.

Engineering teams are starting to deploy AI SREs to handle this operational load. These systems work alongside engineers to investigate and solve problems in production environments, reducing the time spent on operational tasks from hours to minutes. They build deep understanding of your systems through each interaction, accumulating knowledge that would typically be scattered across wikis, tickets, and team chat.

Today's AI SREs are most effective at investigation and diagnosis. They can explore system relationships and test multiple hypotheses quickly, but they're gated in their ability to make production changes. Teams start by having agents monitor low priority workloads in production, gradually expanding to more critical services as reliability is demonstrated.

The path forward is through progressive trust building. Start with specific subsystems, expand scope as the AI SRE proves reliable, and gradually increase autonomy. Each successful investigation and resolution builds confidence in the system's capabilities.

AI agents are already changing how engineering teams operate. They spend less time investigating issues and more time improving systems. Knowledge transfers automatically between teams and shifts. Operational overhead no longer scales linearly with system growth. While fully self-healing systems remain a distant goal, the path forward is clear - teams will progressively delegate more operational responsibility to AI agents, focusing their attention on building better products.

‍

Get Started with Cleric

We're building Cleric to help engineering teams take the first step toward self-healing systems.

Try Cleric: Get early access to start reducing operational overhead
Join our team: We're hiring engineers in San Francisco to build self-healing systems

‍

Subscribe for technical deep dives, customer stories, and our product roadmap.

What is an AI SRE?