The Macro: The GPU Tax Is Bleeding AI Labs Dry
Every AI lab has the same dirty secret. The compute bill is absurd. Training a frontier model costs millions. Even fine-tuning a mid-size model on a decent dataset can run five or six figures. And almost all of that money flows to the same place: NVIDIA. Their CUDA ecosystem has become the default runtime for AI training, and that lock-in means you pay whatever NVIDIA and its cloud partners decide to charge.
The market knows this is a problem. Lambda Labs built a business renting GPU clusters. CoreWeave raised billions to become an NVIDIA-focused cloud provider. Together AI and Modal are attacking the inference side. Anyscale and Ray are trying to make distributed training less painful. But almost all of these companies are built on top of CUDA. They are making NVIDIA GPUs easier to use, not making NVIDIA GPUs less necessary.
That distinction matters. AMD has competitive hardware with its MI300X series. Intel has Gaudi. Cerebras and Graphcore built custom AI chips from scratch. TPUs exist and are genuinely fast for certain workloads. But nobody uses them for training at scale because the software ecosystem is not there. CUDA has twenty years of library development, community tooling, and institutional knowledge. Switching costs are enormous.
If you could make any hardware run any training workload efficiently, without rewriting code, the entire GPU market would restructure overnight. That is a big if.
The Micro: Three Brothers Who Finished College Before Most People Finish High School
SF Tensor was founded by Ben, Luk, and Tom Koska. All three are brothers. Luk and Tom both completed their CS degrees at sixteen years old. Ben has been publishing AI research since high school and focuses on GPU kernel optimization. They are a four-person team out of San Francisco, YC Fall 2025 batch.
The product has two pieces. Elastic Cloud automatically finds the cheapest available GPUs across multiple cloud providers and handles spot instance management, so you are not locked into one vendor’s pricing. The bigger bet is automatic kernel optimization, which takes your training code and rewrites the low-level compute kernels to run efficiently on whatever hardware is available. NVIDIA, AMD, TPUs, whatever. No code changes required on your end.
I want to be specific about why the kernel optimization piece is interesting. When you train a model on NVIDIA hardware, your code calls CUDA kernels that have been hand-tuned for that specific GPU architecture. If you want to run the same code on an AMD GPU, you need different kernels. Writing those kernels is specialized work that very few people in the world can do. SF Tensor is building a compiler that does it automatically.
The pricing model is clever. Instead of charging per GPU-hour like every other cloud provider, SF Tensor prices based on the savings they deliver. If they cut your compute costs by 80%, they take a percentage of the savings. That aligns incentives nicely and it means customers have zero downside risk in trying the product.
They offer two tiers. Tensor Cloud handles experiments and medium-scale training with up to 64 GPUs on the Pro plan. Enterprise Forward-Deployed runs in your own cloud infrastructure with up to 10,000 GPUs, 24/7 support, and custom kernel optimization. That enterprise tier is where the real revenue potential sits.
The team is currently hiring GPU kernel engineers, compiler engineers, and AI compilation researchers. Those job postings tell you where the hard technical problems are. Building a general-purpose kernel optimizer is a compiler engineering challenge, not an AI challenge. The fact that they are hiring for it signals they are taking the hard path rather than wrapping existing tools.
The Verdict
SF Tensor is swinging at one of the biggest structural problems in AI infrastructure. If their kernel optimizer works as advertised, it does not just save money. It fundamentally changes which hardware is viable for AI training. That would be a very big deal for AMD, for custom chip startups, and for every AI lab currently hostage to NVIDIA pricing.
The risk is that this is genuinely hard computer science. Automatic kernel optimization across heterogeneous hardware is a problem that has defeated well-funded research labs. The Koska brothers are clearly talented, but they are competing against NVIDIA’s compiler team, which is one of the best-resourced engineering organizations on the planet.
In thirty days, I want to see benchmark numbers. How close does auto-optimized AMD performance get to hand-tuned CUDA on standard training workloads? In sixty days, the question is whether any serious AI lab has switched production training to non-NVIDIA hardware using SF Tensor. In ninety days, I want to know if the savings-based pricing model is generating meaningful revenue or if enterprise customers are stuck in procurement cycles. Breaking CUDA lock-in would be transformative. The question is whether a four-person startup can do what the entire semiconductor industry has failed to do for a decade.