AIOpsLab: Building AI agents for autonomous clouds

Graphical user interface, applications, icons

In an increasingly complex digital environment, enterprises and cloud providers face significant challenges in developing, deploying, and maintaining advanced IT applications. While the widespread adoption of microservices and cloud-based serverless architectures has streamlined certain aspects of application development, it has also introduced many operational challenges, especially in fault diagnosis and mitigation. These complexities can result in outages and large-scale business interruptions, highlighting the critical need for robust solutions that ensure high availability and reliability of cloud services. Masu. As five-nine availability expectations increase, organizations must navigate complex operational demands to maintain customer satisfaction and business continuity.

To address these challenges, recent research on using AIOps agents for cloud operations, such as AI agents for incident root cause analysis (RCA) and triage, relies on proprietary services and datasets. Other previous studies use frameworks that are specific to the solution they are building, or ad hoc, static benchmarks and metrics that fail to capture the dynamic nature of real-world cloud services. Users who are using Azure AI Agent Service to develop agents for cloud operations tasks can use AIOpsLab to evaluate and improve them. Furthermore, current approaches do not agree on standard metrics or standard classifications of operational tasks. This requires a research framework based on standardized principles to build, test, compare, and improve AIOps agents. This framework should allow agents to interact with realistic service operation tasks in a reproducible manner. Must be able to scale flexibly to new applications, workloads, and failures. The key is not only to evaluate the AI agent, but also to allow users to improve the agent itself. For example, it provides sufficient observability and also serves as a training environment (“gym”) to generate samples for learning.

We developed AIOpsLab, a comprehensive evaluation framework for researchers and developers to enable the design, development, evaluation, and enhancement of AIOps agents. It also serves the purpose of reproducibility, standardization, interoperability, and scalable benchmarking. AIOpsLab is open sourced on GitHub (Opens in new tab) The MIT license allows researchers and engineers to use it to evaluate AIOps agents at scale. AIOpsLab’s research paper was accepted at SoCC’24 (Annual ACM Symposium on Cloud Computing).

AIOpsLab system flowchart. This graph is divided into four main sections: AIOps tasks, orchestrators, problem caches, and services. AIOps tasks list various applications such as social networks, hotel reservations, and e-commerce, each with associated data, actions, and metrics. These tasks connect to Orchestrator. The orchestrator is the central element and interacts with the various components. The orchestrator receives a problem query Q that details the problem, task T, workload W, failure F, and solution S. The orchestrator is also responsible for deploying or running the workload and injecting faults. Perform actions based on the service state relayed by the agent. The problem cache connects to the workload generator and fault generator to create a workload W for the service. Service components provide observability through traces, metrics, and logs. Communicate with Orchestrator to provide service state updates. Components are connected by arrows that indicate the flow of data and control between parts of the system. — Figure 1. AIOpsLab system architecture.

Agent Cloud Interface (ACI)

AIOpsLab uses an intermediate orchestrator to strictly separate agents and application services. It provides several interfaces for integrating and extending other system parts. First, you establish a session with the agent to share information about your benchmark problem: (1) problem description, (2) instructions (such as response format), and (3) APIs that can be called as actions.

An API is a set of documented tools designed to help agents solve tasks, such as retrieving logs, retrieving metrics, and running shells. There are no restrictions on agent implementation. The orchestrator raises the issue and polls to take the next action considering the previous results. Each action must be a valid API call, which the orchestrator validates and executes. The orchestrator has privileged access to the deployment and can perform any actions (scale up, redeploy, etc.) using appropriate tools (helm, kubectl, etc.) to resolve issues on behalf of the agent. Finally, the orchestrator calls the workload and fault generator to create service interruptions that serve as live benchmark problems. AIOpsLab provides additional APIs to extend to new services and generators.

The example shows how to onboard an agent to AIOpsLab

Imported from aiopslab Orchestrator class agent: def __init__(self, prob, instructs, apis): self.prompt = self.set_prompt(prob, instructs, apis) self.llm = GPT4() async def get_action(self, state: str) -> str: return self.llm.generate(self.prompt + state) #Initialize orchestrator orch = Orchestrator() pid = “misconfig_app_hotel_res-mitigation-1” prob_desc, instructs, apis = orch.init_problem(pid) #Register and evaluate agent Agent = Agent(prob_desc, instructs, apis) orch.register_agent(agent, name =”myAgent) “) asyncio.run(orch.start_problem(max_steps=10))

service

AIOpsLab abstracts a diverse set of services to reflect differences in the operational environment. This includes live execution services implemented using various architectural principles such as microservices, serverless, and monolithic.

We also leverage a suite of open source applications such as DeathStarBench, which provides artifacts such as source code and commit history, as well as runtime telemetry. Adding tools like BluePrint will help extend AIOpsLab to other academic and production services.

workload generator

AIOpsLab’s workload generator plays a key role by creating simulations of both faulty and normal scenarios. From the orchestrator, you receive specifications such as the task, desired effects, scale, and duration. Generators can use models trained on real production traces to generate workloads that match these specifications. Inspired by real-life incidents, flawed scenarios may simulate conditions such as resource exhaustion, exploit edge cases, or cause cascading failures. Typical scenarios mimic typical operational patterns, such as daily activity cycles and multi-user interactions. Multiple workloads can be stored in a problem cache for use by the orchestrator if different characteristics (service calls, user distribution, arrival times, etc.) can lead to the desired effect. In conjunction with the fault generator, the workload generator can also create complex fault scenarios using workloads.

fault generator

AIOpsLab has a new push button fault generator designed to be universally applicable to a variety of cloud scenarios. Our approach integrates application and domain knowledge to create adaptable policies and “oracles” that are compatible with AIOps scenarios. This includes fine-grained fault injection that allows you to simulate complex faults inspired by operational incidents. Additionally, you can inject faults at various system levels, maintain semantic integrity, and uncover root causes while accounting for interdependencies between cloud microservices. The versatility of the fault injector enables thorough testing and evaluation of AIOps capabilities, increasing the reliability and robustness of cloud systems.

observability

AIOpsLab is equipped with an extensible observability layer designed to provide comprehensive monitoring capabilities across various system layers of any AIOps tool. AIOpsLab collects (1) traces from Jeager detailing the end-to-end path of a request through a distributed system, (2) application logs formatted and recorded by Filebeat and Logstash, (3) monitored system metrics, and more. , collects a wide range of telemetry data. Written by Prometheus. Additionally, AIOpsLab also captures low-level system information such as syscall logs and cluster information. As mentioned earlier, address potential data overload through flexible APIs and coordinate telemetry data related to AIOps tools.

AIOpsLab currently supports four main tasks within the AIOps domain: These tasks include incident detection, localization, root cause diagnosis, and mitigation. Additionally, it supports several popular agent frameworks, including React, Autogen, and TaskWeaver. Two key insights from the research highlight the importance of observability and well-designed ACI. Observability is critical to clearly diagnosing root causes. For example, pinpointing a misconfigured API gateway is critical to preventing service downtime.

Flexibility is also an important factor. The ability to execute arbitrary shell commands enables effective troubleshooting in real-time scenarios. Finally, robust error handling is essential. Providing agents with quality feedback about execution barriers, such as database connection failures, ensures quick resolution and continuous improvement.

ACI example for diagnostic tasks

agent: To understand you first need to look at the logs if Is there a mistake? or Warnings that may indicate a problem.

action:

“`get_logs(“After Service Creation”, “Test Social Network”) “”
service: (2024-08-04 23:18:49.365494): …

Frugality: Sunday, August 4, 2024 23:19:19 TSocket::open() connect() : Connection refused

agent: Unable to connect to… “User Service” at the port 9090Even if the pod is running. This suggests a possible network issue or service discovery misconfiguration.

next step

This research project employs Microsoft’s security standards and responsible AI principles, and we envision this research developing into a critical resource for organizations looking to optimize their IT operations. Additionally, we plan to work with various generative AI teams to incorporate AIOpsLab as a benchmark scenario to evaluate state-of-the-art models. In doing so, we aim to foster innovation and accelerate the development of more advanced AIOps solutions. This research is essential not only for IT professionals but also for anyone invested in the future of technology. This is because an increasingly automated world has the potential to redefine how organizations manage operations, respond to incidents, and ultimately serve customers.

Acknowledgment

We would like to thank yingfang Chen, Manish Shetty, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, and Suman Nath for their contributions to this project. .

Source link

What's Hot

Instead of Timothée Chalamett or Tom Holland, Sean Penn declares the Oscar-winning actress “the last movie star.” Hollywood

Does an American pope change U.S. politics? : The NPR Politics Podcast : NPR

Amazon will face Elon Musk’s Tesla with the robot launch.

AIOpsLab: Building AI agents for autonomous clouds

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

As Deepseek and ChatGpt Surge, is Delhi behind?

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

Alice Munro’s Passive Voice | New Yorker

2025 Best Actress Oscar Predictions

20 Most Anticipated Sex Movies of 2025

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

As Deepseek and ChatGpt Surge, is Delhi behind?

Openai’s Sam Altman reveals his daily use of ChatGpt, and that’s not what you think

Our Picks

Instead of Timothée Chalamett or Tom Holland, Sean Penn declares the Oscar-winning actress “the last movie star.” Hollywood

Does an American pope change U.S. politics? : The NPR Politics Podcast : NPR

Amazon will face Elon Musk’s Tesla with the robot launch.

Most Popular

ATUA AI (TUA) develops cutting-edge AI infrastructure to optimize distributed operations

10 things you should never say to an AI chatbot

Character.AI faces lawsuit over child safety concerns

Subscribe to Updates

What's Hot

AIOpsLab: Building AI agents for autonomous clouds

Agent Cloud Interface (ACI)

The example shows how to onboard an agent to AIOpsLab

service

workload generator

fault generator

Microsoft Research Co-Pilot Experience

observability

ACI example for diagnostic tasks

next step

Acknowledgment

Related Posts