The Macro: AI Agents Are Flying Blind
There is a dirty secret in the AI agent boom. Most teams shipping autonomous agents have almost no idea what those agents are doing between input and output. They know what went in. They know what came out. Everything in the middle is a black box wrapped in a prayer.
Traditional observability tools were built for deterministic software. Datadog, New Relic, Grafana, Splunk. They are excellent at telling you that your API latency spiked at 3:47 PM or that your database connection pool is exhausted. They are terrible at telling you why your AI agent decided to book a flight to Cleveland instead of Charlotte. The failure modes are different. The debugging workflow is different. The entire mental model of what “monitoring” means needs to change when your software makes decisions instead of just executing instructions.
The market is starting to recognize this. LangSmith from LangChain was early to the tracing game but is tightly coupled to the LangChain ecosystem, which limits its appeal to teams that have built their own agent frameworks. Braintrust is doing interesting work on evals but is more focused on the testing side than on production monitoring. Helicone tracks costs and latency but does not go deep on behavioral analysis. Arize and Weights & Biases are coming at this from the ML monitoring angle, which is adjacent but not quite the same problem.
What nobody has nailed yet is the full stack: tracing every step an agent takes, evaluating whether those steps were correct, and tracking what all of it costs. That is three separate problems that need to feel like one product. The team that gets this right will own a category.
The Micro: Building the Control Plane for Autonomous AI
Fernstone is building developer-first observability for AI agents. The product covers three core areas: tracing (see every step your agent takes), evals (measure whether those steps were good), and cost tracking (know what you are spending per agent run, per model call, per token).
The “developer-first” framing matters. A lot of AI observability tools are built for ML engineers or data scientists. Fernstone is aimed at the software engineer who just shipped an agent and needs to debug it at 2 AM when something goes wrong. The interface is designed to feel like the debugging tools developers already know, not like a data science notebook or an ML experiment tracker.
The founding team comes from a Y Combinator Fall 2025 batch. They are based in New York and have been building in the observability space with a focus on making AI systems inspectable and accountable. The product is live at fernstone.io.
What catches my attention is the cost tracking angle. Most teams building AI agents are hemorrhaging money and do not realize it until the bill arrives. A single agent run can involve dozens of model calls, each with different token costs depending on the model and provider. Without granular cost tracking, you cannot optimize. You cannot even budget accurately. This is especially painful for startups where a runaway agent can burn through a month of API credits in an afternoon.
The eval component is where the real long-term value sits. Tracing tells you what happened. Cost tracking tells you what it cost. Evals tell you whether it was right. That third piece is the hardest to build and the most defensible once you get it right. If Fernstone can help teams define what “correct” looks like for their agents and then automatically flag when behavior drifts, they will become indispensable.
The competitive landscape is fragmented and early. LangSmith has distribution through the LangChain ecosystem but alienates anyone not using LangChain. Braintrust is strong on evals but weaker on production monitoring. The big incumbents like Datadog have barely started thinking about this problem. That fragmentation is Fernstone’s opportunity: build the integrated tool that does tracing, evals, and cost tracking in one place, and you win by being complete while everyone else is partial.
The Verdict
I think Fernstone is going after one of the most important infrastructure problems in AI right now. The shift from copilots to autonomous agents changes everything about how we monitor, debug, and manage AI systems. Someone needs to build the observability layer for this new world, and the incumbents are not moving fast enough.
The risk is timing. The AI agent market is still early, and many teams are in the “move fast and figure out monitoring later” phase. Fernstone needs enough teams to be running agents in production that monitoring becomes urgent, not optional. That tipping point is coming, but it may be six months away.
In 30 days, I want to see how many production agent deployments they are monitoring. Not sandbox projects, not demos, real production workloads. In 60 days, the question is whether their eval framework is flexible enough to handle the wild variety of agent architectures out there. In 90 days, I want to see whether cost tracking alone is enough to drive adoption or whether tracing and evals are the real hook. If teams sign up for cost tracking and stay for evals, that is a great wedge strategy. This is a category that barely exists today and will be enormous in two years. Fernstone is early, which is exactly where you want to be.