ChaosEater: Fully Automating Chaos Engineering with Large Language Models

Background

Chaos Engineering (CE) is an engineering technique aimed at improving the resiliency of distributed systems. It involves artificially injecting specific failures into a distributed system and observing its behavior in response. Based on the observation, the system can be proactively reconfigured to handle those failures. Systematically, these CE operations are organized into a CE cycle consisting of four phases:

Hypothesis: Define steady states (i.e., normal behavior) of the system and injected failures. Then, make a hypothesis that the steady states are maintained in the system even when the failures occur.
Experiment: Inject the failures into the system while logging the system's response behavior.
Analysis: Analyze the logged data and check if the hypothesis is satisfied. If so, this CE cycle is finished here. If not, move to (4).
Improvement: Reconfigure the system to satisfy the hypothesis. The reconfigured system is tested again in (2) and (3), i.e., repeat (2) to (4) until the hypothesis is satisfied.

Recent CE tools, such as Chaos Monkey and Chaos Mesh, realize the automated execution of predefined CE experiments. Moreover, monitoring tools, such as Prometheus and k6, enable automating metric collection and threshold-based testing during chaos experiments. However, generative tasks, such as defining these experiments and reconfiguring the system after the experiments, still remain manual. To reduce the costs of these manual operations, we propose ChaosEater, a system for automating the entire CE cycle with Large Language Models (LLMs).

Figure 1: The systematic CE cycle. Existing tools partially automate the experiment and analysis, whereas hypothesis and improvement remain manual. ChaosEater automates all of them.

Proposed System: ChaosEater

ChaosEater is the first LLM-based system designed to fully automate the systematic CE cycle.
Its three key features to achieve fully automated CE are as follows:

1. Infrastructure as Code x LLMs for Software Engineering

In modern software systems, the entire system configuration can be managed as code through Infrastructure as Code (IaC). Additionally, existing CE tools enable the code-based management of failure injection. Therefore, CE for software systems is considered as one of the Software Engineering (SE) tasks. Meanwhile, LLMs have achieved significant success in SE tasks, such as code generation and debugging. Given the compatibility between IaC and LLMs for SE, we propose applying LLMs to CE for software systems. In particular, we here focus on Kubernetes (K8s) systems as software systems.

2. Agentic Workflow for Chaos Engineering

Agentic Workflows, which define task flows (transitions) among multiple LLM agents, have been actively designed in various fields. Similarly, we design a novel agentic workflow for CE according to the systematic CE cycle (Figure 2). When a user's K8s system is provided as input, the workflow allows each LLM agent to autonomously perform hypothesis formulation, chaos experiments using Chaos Mesh, result analysis, and system reconfiguration. As a result, we obtain a K8s system reconfigured to satisfy the hypothesis as the final output. This is the first attempt to design an agentic workflow for FULLY automating the systematic CE cycle.

3. New components: VaC, Interfaces between LLMs and CE tools

We propose not only the novel workflow and instruction prompts for LLM agents but also component technologies essential for full CE automation, such as unit test-based hypothesis validation and integration with existing CE tools. For the former, we introduce a novel concept called Validation as Code (VaC). Instead of having the LLM validate hypotheses by inspecting logs on the spot, it generates unit test code during hypothesis formulation, which is then used to validate the hypothesis during chaos experiments. By explicitly representing and fixing the validation process through code, this approach ensures transparency and consistency in LLMs' decision-making. For the latter, we develop interfaces to bridge LLMs with existing CE tools, such as a rule-based algorithm that converts LLM's experimental plans into Chaos Mesh workflow manifests for scheduling complex chaos experiments.

Figure 2: The agentic workflow of ChaosEater and its input and output. The internal agents autonomously complete the systematic CE cycle using CE tools.

Examples

Example #1: Nginx

System description

Nginx is a small-scale system that consists of two K8s manifests (i.e., two resources): pod.yaml and service.yaml. The former defines a Pod resource including a Nginx container, and the latter defines Service resource routing TCP traffic to the Pod.

Problem setting

To verify whether ChaosEater can improve the system when there are resiliency issues, we intentionally configure the resource with a non-resilient setting; we set the Pod's restartPolicy to Never in pod.yaml. With this configuration, once the Pod goes down, it will never restart, resulting in extended service outages. we validate whether ChaosEater correctly identifies and addresses this resiliency issue through a reasonable CE cycle.

Results

Given the Nginx, ChaosEater defined "The Pod should be running at least 90% of the time during the check period" as one of the steady states during the hypothesis phase. It then generated a failure scenario for a cyberattack, where the Pod would go down after a network delay.

In the experiment phase, ChaosEater executed the chaos experiment to validate the steady states and successfully discovered that the Pod had not restarted after its failure.

In the analysis and improvement phases, ChaosEater analyzed the results and identified that the issue was caused by the restartPolicy being set to Never. It then replaced the Pod resource with a Depolyment resource with three replicas.

Finally, ChaosEater re-executed the chaos experiment on the reconfigured Nginx and confirmed that the hypothesis was satisfied.

The cost and time for this CE cycle were approximately 0.21 USD and 11 minutes, respectively.

Example #2: SockShop

System description

SockShop is a practical and large-scale e-commerce system that consists of 29 manifests, which define the resources and databases for front-end pages, user information, order, payment, shipping, and so on. The number of replicas of all the Deployment resources is originally set to one. However, this setting could lead to downtime of the single replica when it goes down.

Problem setting

To narrow down this original resiliency issue to a single point, we increase the replicas for Deployment resources other than front-end-dep.yaml to two, while keeping a single replica for front-end-dep.yaml. This RELATIVELY reduces the redundancy/resiliency of the front-end resource. We validate whether ChaosEater correctly identifies and addresses this resiliency issue through a reasonable CE cycle.

Results

Given the SockShop with adjusted replica counts, ChaosEater defined "front-end resources are always in the Ready state" as one of the steady states during the hypothesis phase. It then generated a failure scenario for a Black Friday sale, where the front-end resource would go down after an increase in CPU usage of the carts-db resource due to excessive access.

In the experiment phase, ChaosEater executed the chaos experiment to validate the steady states and successfully discovered the existence of downtime after the front-end resource failure.

In the analysis and improvement phases, ChaosEater analyzed the results and identified that the downtime was caused by the replica count of the front-end resource being set to 1. It then increased the replica count of the front-end resource to 2.

Finally, ChaosEater re-executed the chaos experiment on the reconfigured SockShop and confirmed that the hypothesis was satisfied.

The cost and time for this CE cycle were approximately 0.84 USD and 25 minutes, respectively.

Discussion

Broader Impacts

In recent years, the creation of software applications from natural language (text2app) using LLMs has been actively explored. However, most text2app systems give little consideration to the resiliency of the underlying system infrastructure of their generated applications. We believe that ChaosEater can effectively address this issue. By seamlessly integrating ChaosEater with these text2app systems, it becomes possible to automatically build resilient application systems in an end-to-end manner.

Moreover, the outputs of ChaosEater can also serve as training materials (including both good and bad practices) for the Chaos Game Day, a hands-on training exercise for CE engineers.

Limitations

The current ChaosEater mainly has three limitations:

Deployed environment: Although CE is ideally conducted in actual production environments, ChaosEater is currently only supported in development environments.

Limited to K8s manifest reconfiguration: Software systems consist of not only K8s manifests but also other types of codebases, such as HTML/CSS/JS and Python. To optimally improve system resiliency, reconfiguration of all types of codebases is necessary. However, ChaosEater currently supports reconfiguring only K8s manifests.

Vulnerability discovery: In case studies, ChaosEater succesfully improved systems with somewhat obvious resiliency issues. However, for systems that already possess a certain level of resiliency, ChaosEater fails to find new hidden issues through a CE cycle. This is a challenging task even for skilled engineers. Therefore, ChaosEater is currently capable of performing only at a level comparable to or lower than that of engineers.

Future Directions

Given the current limitations above, we share some future directions for ChaosEater:

Production deployment and security: If ChaosEater is deployed in production environments, further research on security will be necessary. This includes controlling more carefully the impact range of artificial failures (i.e., blast radius), preventing ChaosEater from being misused as a proxy to attack production services, and proposing emergency response measures, such as a higher-level monitoring system that continuously monitors ChaosEater and can intervene if necessary.

LLMs x Graphs: When reconfiguring systems across multiple types of code, it is essential to consider their dependencies. To achieve this, LLMs must have the capability to recoginize the complex dependencies as a graph. We belive that leveraging recent advancements in LLMs x Graphs could effectively address this challenge.

Fully automation of long-term multiple CE cycles: To overcome the third limitation, it is necessary to conduct multiple CE cycles for more complex systems over extended operational periods. By using ChaosEater's output as input for the next CE cycle, we can automate multiple CE cycles even with the current ChaosEater. However, we additionally need to develop techniques to manage the long-term history of the continuous CE cycles.

Evaluation frameworks: As there are currently no datasets and benchmarks for CE, we will consturct them to enable more solid validation of ChaosEater. Besides, we plan to propose new metrics for quantiatively evaluating CE cycles conducted by ChaosEater. This is not easy because, even in cases where no improvements are made, CE cycles can still provide valuable insights. Therefore, the quality of CE cycles should not be judged solely based on whether improvements were made; metrics that consider its philosophical aspects are also necessary.

Citation

If you find this work useful, please cite our paper as follows:


@misc{dkiku2025chaoseater,
    title={ChaosEater: Fully Automating Chaos Engineering with Large Language Models}, 
    author={Daisuke Kikuta and Hiroki Ikeuchi and Kengo Tajiri},
    year={2025},
    eprint={2501.11107},
    archivePrefix={arXiv},
    primaryClass={cs.SE},
    url={https://arxiv.org/abs/2501.11107}, 
}

Copied!