
In an increasingly complex digital environment, enterprises and cloud providers face significant challenges in developing, deploying, and maintaining advanced IT applications. While the widespread adoption of microservices and cloud-based serverless architectures has streamlined certain aspects of application development, it has also introduced many operational challenges, especially in fault diagnosis and mitigation. These complexities can result in outages and large-scale business interruptions, highlighting the critical need for robust solutions that ensure high availability and reliability of cloud services. Masu. As five-nine availability expectations increase, organizations must navigate complex operational demands to maintain customer satisfaction and business continuity.
To address these challenges, recent research on using AIOps agents for cloud operations, such as AI agents for incident root cause analysis (RCA) and triage, relies on proprietary services and datasets. Other previous studies use frameworks that are specific to the solution they are building, or ad hoc, static benchmarks and metrics that fail to capture the dynamic nature of real-world cloud services. Users who are using Azure AI Agent Service to develop agents for cloud operations tasks can use AIOpsLab to evaluate and improve them. Furthermore, current approaches do not agree on standard metrics or standard classifications of operational tasks. This requires a research framework based on standardized principles to build, test, compare, and improve AIOps agents. This framework should allow agents to interact with realistic service operation tasks in a reproducible manner. Must be able to scale flexibly to new applications, workloads, and failures. The key is not only to evaluate the AI agent, but also to allow users to improve the agent itself. For example, it provides sufficient observability and also serves as a training environment (“gym”) to generate samples for learning.
We developed AIOpsLab, a comprehensive evaluation framework for researchers and developers to enable the design, development, evaluation, and enhancement of AIOps agents. It also serves the purpose of reproducibility, standardization, interoperability, and scalable benchmarking. AIOpsLab is open sourced on GitHub (Opens in new tab) The MIT license allows researchers and engineers to use it to evaluate AIOps agents at scale. AIOpsLab’s research paper was accepted at SoCC’24 (Annual ACM Symposium on Cloud Computing).

Agent Cloud Interface (ACI)
AIOpsLab uses an intermediate orchestrator to strictly separate agents and application services. It provides several interfaces for integrating and extending other system parts. First, you establish a session with the agent to share information about your benchmark problem: (1) problem description, (2) instructions (such as response format), and (3) APIs that can be called as actions.
An API is a set of documented tools designed to help agents solve tasks, such as retrieving logs, retrieving metrics, and running shells. There are no restrictions on agent implementation. The orchestrator raises the issue and polls to take the next action considering the previous results. Each action must be a valid API call, which the orchestrator validates and executes. The orchestrator has privileged access to the deployment and can perform any actions (scale up, redeploy, etc.) using appropriate tools (helm, kubectl, etc.) to resolve issues on behalf of the agent. Finally, the orchestrator calls the workload and fault generator to create service interruptions that serve as live benchmark problems. AIOpsLab provides additional APIs to extend to new services and generators.
The example shows how to onboard an agent to AIOpsLab
Imported from aiopslab Orchestrator class agent: def __init__(self, prob, instructs, apis): self.prompt = self.set_prompt(prob, instructs, apis) self.llm = GPT4() async def get_action(self, state: str) -> str: return self.llm.generate(self.prompt + state) #Initialize orchestrator orch = Orchestrator() pid = “misconfig_app_hotel_res-mitigation-1” prob_desc, instructs, apis = orch.init_problem(pid) #Register and evaluate agent Agent = Agent(prob_desc, instructs, apis) orch.register_agent(agent, name =”myAgent) “) asyncio.run(orch.start_problem(max_steps=10))
service
AIOpsLab abstracts a diverse set of services to reflect differences in the operational environment. This includes live execution services implemented using various architectural principles such as microservices, serverless, and monolithic.
We also leverage a suite of open source applications such as DeathStarBench, which provides artifacts such as source code and commit history, as well as runtime telemetry. Adding tools like BluePrint will help extend AIOpsLab to other academic and production services.
workload generator
AIOpsLab’s workload generator plays a key role by creating simulations of both faulty and normal scenarios. From the orchestrator, you receive specifications such as the task, desired effects, scale, and duration. Generators can use models trained on real production traces to generate workloads that match these specifications. Inspired by real-life incidents, flawed scenarios may simulate conditions such as resource exhaustion, exploit edge cases, or cause cascading failures. Typical scenarios mimic typical operational patterns, such as daily activity cycles and multi-user interactions. Multiple workloads can be stored in a problem cache for use by the orchestrator if different characteristics (service calls, user distribution, arrival times, etc.) can lead to the desired effect. In conjunction with the fault generator, the workload generator can also create complex fault scenarios using workloads.
fault generator
AIOpsLab has a new push button fault generator designed to be universally applicable to a variety of cloud scenarios. Our approach integrates application and domain knowledge to create adaptable policies and “oracles” that are compatible with AIOps scenarios. This includes fine-grained fault injection that allows you to simulate complex faults inspired by operational incidents. Additionally, you can inject faults at various system levels, maintain semantic integrity, and uncover root causes while accounting for interdependencies between cloud microservices. The versatility of the fault injector enables thorough testing and evaluation of AIOps capabilities, increasing the reliability and robustness of cloud systems.
Spotlight: AI-powered experiences
Microsoft Research Co-Pilot Experience
Learn more about Microsoft research through AI-powered experiences
observability
AIOpsLab is equipped with an extensible observability layer designed to provide comprehensive monitoring capabilities across various system layers of any AIOps tool. AIOpsLab collects (1) traces from Jeager detailing the end-to-end path of a request through a distributed system, (2) application logs formatted and recorded by Filebeat and Logstash, (3) monitored system metrics, and more. , collects a wide range of telemetry data. Written by Prometheus. Additionally, AIOpsLab also captures low-level system information such as syscall logs and cluster information. As mentioned earlier, address potential data overload through flexible APIs and coordinate telemetry data related to AIOps tools.
AIOpsLab currently supports four main tasks within the AIOps domain: These tasks include incident detection, localization, root cause diagnosis, and mitigation. Additionally, it supports several popular agent frameworks, including React, Autogen, and TaskWeaver. Two key insights from the research highlight the importance of observability and well-designed ACI. Observability is critical to clearly diagnosing root causes. For example, pinpointing a misconfigured API gateway is critical to preventing service downtime.
Flexibility is also an important factor. The ability to execute arbitrary shell commands enables effective troubleshooting in real-time scenarios. Finally, robust error handling is essential. Providing agents with quality feedback about execution barriers, such as database connection failures, ensures quick resolution and continuous improvement.
ACI example for diagnostic tasks
action:
“`get_logs(“After Service Creation”, “Test Social Network”) “”
service: (2024-08-04 23:18:49.365494): …
Frugality: Sunday, August 4, 2024 23:19:19 TSocket::open() connect() : Connection refused
agent: Unable to connect to… “User Service” at the port 9090Even if the pod is running. This suggests a possible network issue or service discovery misconfiguration.
next step
This research project employs Microsoft’s security standards and responsible AI principles, and we envision this research developing into a critical resource for organizations looking to optimize their IT operations. Additionally, we plan to work with various generative AI teams to incorporate AIOpsLab as a benchmark scenario to evaluate state-of-the-art models. In doing so, we aim to foster innovation and accelerate the development of more advanced AIOps solutions. This research is essential not only for IT professionals but also for anyone invested in the future of technology. This is because an increasingly automated world has the potential to redefine how organizations manage operations, respond to incidents, and ultimately serve customers.
Acknowledgment
We would like to thank yingfang Chen, Manish Shetty, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, and Suman Nath for their contributions to this project. .