Generates a diagram highlighting what happened during an incident by ingesting the retrospective and associated codebase. LLM-powered.
$ incidentdiagram -f example_incident.txt -u https://github.com/user/impactedcodebase
.
Chart generated in artifacts/incident.md
In the fictional scenario illustrated above, the diagram shows an outage affecting a blog. A developer introduced a new feature to verify disk space before image uploads, but mistakenly set the requirement to 500GB. Production servers did not have this amount of available space, preventing the blog admin from uploading images to new posts and readers from adding images to their comments.
.env
file with OpenAPI/Gemini/Anthropic API Key (at least one)- Python > 3.10
python -m venv .venv
source .venv/bin/activate
pip install .
cp .example.env .env # Add api keys to .env after copying
incidentdiagram -f example_incident.txt -u https://github.com/Rootly-AI-Labs/EventOrOutage
The example above uses an incident that goes along with the app https://github.com/Rootly-AI-Labs/EventOrOutage
and a fictive incident retrospective in example_incident.txt
.
IncidentDiagram process works in 3 main steps:
- Understand the impacted codebase, the file structure, and the code within the files to generate a description of components and their relationships. Returned in JSON format.
- Understand postmortem/incident retrospective and provide a list of components that were affected, matching the components of the codebase. Returned in JSON format.
- Create a diagram(mermaid.js) showing the components and their relationships, highlighting the components the incident affected. Returned in a MD format.
All of these steps are done by prompting LLMs
Here are a few ways you can use IncidentDiagram:
incidentdiagram -f incident.txt -u https://github.com/Rootly-AI-Labs/EventOrOutage
– will download the code from github and generate a diagram based on the incident summary in incident.txtincidentdiagram -f incident.txt -u https://github.com/Rootly-AI-Labs/EventOrOutage/tree/main -m gpt-4o
– Use a different modelincidentdiagram -iu www.postmortems.com/1345 -u https://github.com/Rootly-AI-Labs/EventOrOutage -m claude-3.5
– Download the incident summary from a URL and generate a diagram
- LLMs: Open AI LLMs, Anthropic LLMs, Gemini LLMs.
- Agent: HuggingFace smolagents
- Data Sources: External APIs for holidays, news, and event tracking
- Ability to handle a large application code base
- Ability to ingest multiple code bases and IaC files
- Add ollama models
This is a prototype meant to demonstrate how LLM can have a positive impact on SRE teams and is not meant to be used in production.
Explaining an outage can be challenging, especially for complex incidents in distributed systems, which have become the norm. People also have different preferences for how information is presented, and often, a visual representation is worth a thousand words. However, manually creating application and infrastructure diagrams is time-consuming, making it impractical to do so for every incident. That's why we believe Incident Diagram could be a valuable tool for SREs and on-call practitioners, helping them quickly visualize and understand what went wrong.
This project was developed by the Rootly AI Labs. The AI Labs is building the future of system reliability and operational excellence. We operate as an open-source incubator, sharing ideas, experimenting, and rapidly prototyping. We're committed to ensuring our research benefits the entire community.