The Macro: Observability Tools Show You the Fire but Do Not Put It Out
The observability market is enormous and still growing. Datadog is a $40 billion company. New Relic, Splunk (now part of Cisco), Grafana Labs, Honeycomb, Lightstep (now part of ServiceNow). Billions of dollars spent on tools that collect logs, metrics, and traces. These products are genuinely good at showing you that something is wrong. They are bad at telling you why it is wrong and useless at fixing it.
The workflow has not changed in a decade. Alert fires. Engineer wakes up. Opens dashboard. Looks at metrics. Pivots to traces. Reads logs. Correlates timestamps. Opens the codebase. Finds the offending commit. Writes a fix. Gets it reviewed. Deploys. Goes back to sleep. This process takes hours on a good day. On a bad day, it takes the entire on-call shift.
AI should be able to do most of this. The logs are structured data. The traces are structured data. The codebase is indexed. The connection between “this span is slow” and “this function was changed in this PR three days ago” is a reasoning problem, and reasoning is exactly what LLMs have gotten good at. The question is whether anyone can build a product that actually works end-to-end, not just a demo that looks impressive on a three-service toy application.
Sentry does error tracking but not root cause analysis. PagerDuty routes alerts but does not investigate them. Rootly and incident.io manage the incident response process but do not debug the code. There is a clear gap between “we know something broke” and “here is the fix.” That gap is where TraceRoot lives.
The Micro: A Meta Research Scientist Who Fixed 300 Bugs and Got Tired of It
Xinwei He is the co-founder and CEO. He was a Research Scientist at Meta, a Founding Engineer at Kumo.AI (a Forbes AI 50 company), and a Stanford CS graduate. He has published in ICLR, NeurIPS, and KDD. He contributed to PyTorch Geometric, which is one of the most widely used graph neural network libraries in the world. Before starting TraceRoot, he and his co-founder fixed over 300 production bugs across Meta and AWS. They did not set out to build a startup. They set out to stop doing the same tedious debugging work over and over.
TraceRoot (Y Combinator Summer 2025) is an open-source AI agent platform that connects to your existing telemetry data and automatically investigates production issues. When something breaks, the agent summarizes the issue, traces through logs, spans, and GitHub context, and assembles everything into a single execution tree. You can see exactly what happened, why it happened, and what code is responsible.
The visual trace exploration is particularly well done. You can zoom into log clusters and suspicious spans, which is the kind of interactive debugging experience that existing observability dashboards do not provide. Most dashboards give you a wall of data. TraceRoot gives you a narrative.
The real differentiator is what happens after the investigation. TraceRoot can generate GitHub issues or draft pull requests directly from its analysis. It is not just telling you what broke. It is proposing the fix. That is the leap from observability tool to automated engineer, and it is the leap that makes this company interesting.
The team is two people in San Francisco. The SDK has surpassed 10,000 downloads, which is a solid signal for an open-source developer tool at this stage. The open-source approach is smart because it lowers the adoption barrier and lets engineers evaluate the product without a sales conversation.
The Verdict
TraceRoot is building for a future where production debugging is not a human activity. I think that future is closer than most engineering leaders realize. The individual components already work. LLMs can read code. They can reason about traces. They can correlate events across time. The hard part is stitching it all together into something reliable enough that an on-call engineer trusts it at 3 AM.
The open-source strategy gives TraceRoot a distribution advantage over closed-source competitors. Komodor, which does Kubernetes troubleshooting with AI, raised significant funding but requires a sales conversation. Zebrium (now part of ScienceLogic) does ML-based root cause analysis but is not agent-based. TraceRoot’s approach of connecting to existing telemetry rather than replacing it is the right call. Nobody wants to rip out Datadog. Everyone wants Datadog to be smarter.
Thirty days, I want to see the agent’s accuracy on real production incidents, not curated demos. What percentage of root cause analyses are actually correct? Sixty days, the question is whether teams are using the auto-generated PRs or just the investigation summaries. If the PRs are getting merged, this is transformative. If they are getting ignored, TraceRoot is an expensive log viewer. Ninety days, I want to know the retention curve. Do teams that adopt TraceRoot reduce their mean time to resolution? If the MTTR numbers are compelling, enterprise sales become straightforward. The pitch writes itself: “Your engineers spend 40% of their time debugging. We cut that in half.”