Why We Built AI Incident Intelligence into Sentinel

Every engineering team eventually faces the same 2am scenario: an alert fires, someone gets paged, and the first 20 minutes are spent answering the same question — “what actually broke?”

This is the gap we set out to close when we built Sentinel. Not just knowing that something went down, but knowing why it went down — automatically, within seconds of the incident starting.

The problem with traditional uptime monitoring

Most uptime monitoring tools do one thing well: they tell you when a service stops responding. An HTTP check fails, an alert fires, a Slack message appears. That's useful. But it only answers half the question.

The other half — “is this a flapping network issue, a deployment regression, a database problem, or infrastructure saturation?” — still requires a human to investigate. That investigation takes time. And in the meantime, your users are waiting.

Classic monitoring gives you this:

ALERT: api.example.com is DOWN

HTTP 503 · Response time: timeout after 30s

Triggered at 02:14:33 UTC

Now someone on-call has to figure out the rest. Log into the server. Check recent deployments. Look at database metrics. Cross-reference multiple dashboards. On a good day, that's 15 minutes. On a bad day, it's an hour.

What AI incident intelligence adds

When a Sentinel monitor detects a failure, it doesn't just log the HTTP status code. It aggregates everything it knows about the incident: the timeline of failures, which monitors are affected, response patterns over time, the sequence of events leading up to the outage, and any correlated changes across the system.

That context gets sent to an LLM (currently Claude via the Anthropic API, configurable to your own key) with a structured prompt designed to produce actionable SRE analysis — not marketing fluff.

The output looks like this:

Root Cause

Database connection pool exhausted due to a slow query on the users table introduced in deploy v2.4.1 (14:23 UTC). P95 latency climbed from 45ms to 8,200ms over 6 minutes before the pool threshold was reached.

Suggested Actions

Roll back v2.4.1 or hotfix the N+1 query on users.last_login_at
Increase connection pool from 20 → 50 as immediate mitigation
Add statement_timeout = 3000 to prevent future pool exhaustion

Affected Services

api.example.comauth.example.comapp.example.com (degraded)

The difference from generic monitoring is specificity. Not just “database issue” but which table, which deploy, what the causal chain was. Not just “fix the database” but concrete, ordered steps that a junior engineer on call at 2am can actually execute.

How the technical implementation works

Under the hood, Sentinel's AI pipeline runs as a BullMQ job in the worker service. When a monitor transitions to down or when an incident is manually triggered, the worker:

Fetches the last N check results for the affected monitor (response codes, latency trends, headers)
Queries for concurrently failing monitors on the same team — cross-service correlation
Retrieves any open incidents and their timelines
Constructs a structured context object and submits it to the LLM API
Parses the response into typed fields (rootCause, suggestedActions, impact) and persists them to the incidents table
Deduplicates against existing non-resolved incidents to prevent duplicate AI reports

The AI credits system (10/mo on Business, 50/mo on Enterprise, top-up packs available) exists because LLM API calls have real costs. We wanted to make AI accessible at every tier without hiding it behind a prohibitive paywall — but also without silently eating costs on high-volume incidents.

What it doesn't do (yet)

To be clear about scope: Sentinel's AI analyzes the monitoring data it has access to. It doesn't connect to your application logs, APM metrics, or code repository. The analysis is based on what it can observe from the outside — response codes, latency, concurrent failures, timing patterns.

For most incidents, that's enough context to produce a useful first diagnosis. For complex application-layer bugs, the suggested actions will be less specific. The goal isn't to replace your SRE — it's to give them a head start when they're woken up at 2am.

We're working on a log ingestion pipeline that will let Sentinel correlate monitoring events with structured application logs. When that ships, the analysis will get significantly more specific.

Why self-hosted matters for AI features

There's a conversation that happens in every company that uses SaaS monitoring tools: “should we worry about our infrastructure details going to a third-party SaaS?” For most teams, the answer is “probably fine, we accept the terms.” But for teams in regulated industries, government contractors, or companies with strict data residency requirements, the answer is more complicated.

Because Sentinel runs on your infrastructure, you control where the AI API calls go. You can use our default Anthropic integration, point it at your own API key, or route through an internal proxy. The monitoring data that feeds the AI analysis stays within your network. That's not possible with SaaS-only monitoring tools.

What's next

The incident intelligence system is in active development. Upcoming features:

Autopilot mode — AI drafts status page updates and sends notifications autonomously during incidents
Log correlation — structured log ingestion to improve root cause specificity
Proactive anomaly detection — surface degradation before it becomes an outage
Post-mortem templates — export AI-generated reports in standardized formats (Google SRE template, PagerDuty format)

If this approach resonates with how you think about on-call, we'd love for you to try it. Sentinel is free for 20 monitors with no credit card required.