The Macro: Being On-Call Is the Worst Part of Every Engineer’s Job
On-call pages at 3 AM are the bane of software engineering. An alert fires. You wake up, open your laptop, and start the investigation: checking dashboards, reading logs, correlating metrics with recent deployments, and trying to figure out what broke. This process takes 30 minutes to several hours, during which your service might be degraded or down.
The investigation process is remarkably consistent across incidents. Check the monitoring dashboard. Look at recent deployments. Read error logs. Trace requests through the system. Compare current metrics to baseline. Identify the root cause. Write and test a fix. Apply it.
This is exactly the kind of structured, information-gathering workflow that AI should be able to handle. The data is all digital. The investigation follows established patterns. The fix often involves reverting a deployment or adjusting a configuration.
Yet most incident management tools focus on alerting, communication, and post-incident review rather than the actual investigation step. PagerDuty pages you. Opsgenie routes the alert. Incident.io manages the response process. But none of them investigate the problem for you.
IncidentFox, backed by Y Combinator, is an AI SRE agent that automatically investigates incidents, identifies root causes, and prepares remediation for human approval.
The Micro: Former Roblox Engineers Who Lived the On-Call Life
Jimmy Wei (CEO) was a software engineer at Roblox and previously worked at Meta FAIR on conversational AI. Long Yi (CTO) was an SRE on Roblox’s Stateful Infrastructure team. Both experienced the on-call grind firsthand at a company with massive scale and complexity.
The product integrates with 40+ tools including Datadog, PagerDuty, AWS, Kubernetes, GitHub, and Slack. When an alert fires, IncidentFox automatically begins investigating: querying logs, checking metrics, reviewing recent deployments, and tracing through the system to identify the root cause.
The investigation results are presented with root cause analysis and recommended fix scripts. Critically, remediation requires human approval before execution. This “investigate automatically, fix with approval” model addresses the trust gap that prevents teams from giving AI full autonomy over production systems.
The system auto-learns each customer’s stack without requiring weeks of manual configuration. This is important because every company’s infrastructure is different, and an SRE tool that requires extensive setup before it becomes useful defeats the purpose of automation.
Deployment options include SaaS, on-premises, or self-hosted open source. The open-source option builds community trust and allows teams to inspect and customize the agent’s behavior.
Competitors include Shoreline.io (runbook automation), BigPanda (AIOps event correlation), and various observability platforms that offer some automated analysis. IncidentFox’s focus specifically on autonomous investigation and remediation is a distinct product category.
The Verdict
IncidentFox is solving the specific problem that makes on-call miserable: the investigation step. If AI can reliably identify root causes and prepare fixes, engineers can approve remediation from bed instead of spending an hour debugging.
At 30 days: what percentage of incidents does IncidentFox correctly identify the root cause for without human intervention?
At 60 days: are engineers trusting and approving IncidentFox’s remediation scripts, or are they still manually verifying?
At 90 days: is IncidentFox reducing mean time to resolution (MTTR) measurably across customer deployments?
I think IncidentFox is building the right product for a universal pain point. Every engineering team with production systems needs incident investigation, and the process is structured enough for AI to handle effectively. The human-approval model is smart because it builds trust incrementally. As the AI proves reliable, teams will gain confidence in its recommendations, and the “approve from bed” workflow becomes the default.