May 1, 2026

Daniel Anderson

Read Time: ~10 minutes

<aside> 📌

At Tilt, we seek individuals who challenge personal assumptions, value ownership and trust, and strive for excellence to inspire and empower their team. If this article connected with you, join our team!

Join Tilt.

</aside>

<aside> 📎

After 18 months and ~800 incidents, our Rootly classification fields were mostly empty, no one fills in six drop-downs at 2am. This post walks through how Tilt built an automated incident classifier: a scheduled .NET pipeline that scrapes the Slack channel, sends the conversation to Claude, and writes back the cause, vendor, component, and team, while quietly backing off any field a human corrected.

</aside>

https://open.spotify.com/episode/2K1PR5RLw2ZRYoVvFZxUC5?si=zW9O3MBcRsyaYXITf6vekg

At Tilt, we use Rootly (our incident management platform) and Slack for handling production incidents. Something breaks, a Slack channel gets created, engineers jump in, and the conversation plays out. Diagnosis, mitigation, resolution. The usual.

What doesn't happen reliably is the part that follows: classifying the incident. What was the cause? Did the error come from our systems or from an external integration? Which team owns the fix?

No one wants to bother with collating incident data at 2am after dealing with a complicated production issue. But that data is critical for understanding trends in system reliability.

So we engineered a solution using AI.


🔍 The Problem: Incomplete Data, Incomplete Picture

Let me paint the picture. We use Rootly to manage incident data. When you create an incident, there are a slew of fields you can populate: cause, forced/unforced, owning team, vendor, component, and more. After about a year and a half we'd gathered roughly 800 incidents. On review, the vast majority were barely classified. The old field system wasn't helping either. Causes had cryptic prefixes, there was no clean way to distinguish "which vendor caused this" from "which internal system broke", and half the fields were just empty.

And honestly, you can't blame anyone. You've just spent hours on a production incident, it's late, you're tired, and now you're expected to fill in six confusing drop menus worth of information before you can close the tab? On top of all that, there's no forcing function and certainly no reward for following through. The missing value isn't recognised until weeks or even months later when someone asks a question no one can answer.

Our engineering leadership kept asking the platform team questions like "which vendors are causing us the most incidents?" and the honest answer was... we don't really know.

That needed to change.