The Macro: Verification Has a Speed Problem
Formal verification is one of those ideas that everyone in software agrees is good and almost nobody actually uses. The concept is simple. Instead of testing your code with examples and hoping you covered the edge cases, you mathematically prove that the code does what it is supposed to do. No bugs. No missed branches. Certainty.
In practice, it is agonizingly slow. Writing formally verified code takes five to ten times longer than writing normal code. The tools are arcane. The languages are specialized. Coq, Lean, Isabelle, Dafny. These are not things you pick up in a weekend bootcamp. The talent pool is tiny, concentrated in a handful of academic labs and a few companies like Galois and Formal Methods Inc. that have built businesses around high-assurance software for defense and aerospace.
The result is that formal verification gets used where the cost of failure is catastrophic and the budget is large. Cryptographic libraries. Avionics. Parts of the Linux kernel. The seL4 microkernel. AWS used TLA+ for some of their core distributed systems. These are impressive projects, but they represent a rounding error of all software written globally.
Meanwhile, the rest of the industry ships code with test suites that cover maybe 60% of branches and hopes for the best. AI-generated code is making this worse, not better. When a model writes code, it writes plausible code. It usually works. But “usually works” and “provably correct” are separated by an enormous gulf, and as more critical systems incorporate AI-generated components, that gulf starts to matter.
The question has always been: can you make verification fast enough and accessible enough that regular engineers use it? Not PhD mathematicians. Not formal methods specialists. Engineers.
The Micro: Verified Code From Robotics Roots
Theorem is a four-person team in San Francisco, out of YC’s Spring 2025 batch. The company was founded by Jason Gross and Rajashree Agrawal.
Jason’s background is directly relevant here. He developed performant, verified code that powers trillions of HTTPS connections to Chrome. That is not a throwaway line on a bio. The cryptographic primitives underlying TLS are exactly the kind of code where formal verification delivers massive value, and Jason has done it at browser scale. He knows both sides of the problem: what verification can do and why it is too slow for most teams to bother with.
The core claim is striking. Theorem is training models that make program verification 10,000 times faster. If that number holds up in real-world usage, it changes the economics of verification completely. A task that took a specialized engineer a week could take a systems engineer an hour. That is the difference between “we cannot afford to verify this” and “we cannot afford not to.”
The approach they call “program equivalence driven development” is clever. An engineer writes a reference implementation, the clean, readable, obviously correct version. Theorem’s platform then produces an optimized version and generates a mathematical proof that the two behave identically. You get the performance of hand-tuned code with the correctness guarantees of the simple version. That workflow makes sense to me in a way that most formal methods tooling does not.
The practical results they are citing are concrete. Developers have used the system to find zero-day vulnerabilities in GPU-accelerated code and cryptographic implementations. They have also applied it to code migration, specifically Python to Rust conversions, where proving equivalence between the old and new versions is enormously valuable.
The team has a waitlist and a careers page, which suggests they are still in controlled rollout rather than general availability. Their blog is active, and they have a presence on GitHub under theorem-labs.
The Verdict
I think Theorem is working on one of the highest-leverage problems in software tooling right now. The timing is right. AI-generated code is proliferating, and the need for correctness guarantees is growing in proportion. If they can deliver on the 10,000x speed improvement in production settings, this becomes a tool that security teams and infrastructure engineers actually reach for.
The competitive question is whether the large model providers will build verification into their own code generation tools. OpenAI and Anthropic are both investing in code correctness. If Claude or GPT can eventually generate verified code natively, the standalone verification tool becomes less necessary. But that future is not here yet, and the technical depth required to do this well is substantial. This is not something you bolt on as a feature.
At 30 days, I would want to see what percentage of verification runs complete without human intervention. The promise is automation, but formal methods have a long history of requiring expert babysitting. At 60 days, the question is whether the developer experience is good enough that someone who has never written a Lean proof can use it productively. At 90 days, I would want to know whether any security team has adopted it as a standard part of their review process.
The market for provably correct software is about to get much larger. Theorem is positioned to be the tool that makes it practical. That is a big bet with a credible team behind it.