The Macro: GPUs Are Expensive and Mostly Idle
The AI infrastructure market has a dirty economics problem. GPU servers are expensive to run. An A100 instance on a major cloud provider costs somewhere between $2 and $4 per hour. If your AI model handles bursty traffic, which most do, you’re paying for that GPU during all the hours nobody is using it. If you scale to handle peak load, you’re overpaying 80% of the time. If you scale for average load, your users get timeouts during spikes.
This is the exact same problem that serverless compute solved for CPU workloads a decade ago. AWS Lambda didn’t win because it was technically superior to running your own servers. It won because the pricing model matched actual usage patterns. You pay when code runs. You don’t pay when it doesn’t. For workloads with variable traffic, the math is just better.
The GPU version of this problem is harder. CPU functions cold-start in milliseconds. GPU workloads need to load models that can be gigabytes in size. The cold start problem is real and it’s the reason most teams still run persistent GPU instances even when they know the utilization is terrible.
The competitive field is getting crowded. Modal has built a strong developer experience around serverless GPU compute. Replicate offers a simple API for running open-source models. Banana (now Potassium) tried the serverless GPU pitch early. RunPod provides on-demand GPU rental. And of course, the hyperscalers are all building their own serverless AI inference products, though those tend to be locked into their broader cloud ecosystems.
The question for any new entrant is: what’s different enough about your approach to justify switching costs?
The Micro: South African Founders Who Already Had an Exit
Cerebrium is a serverless GPU compute platform for deploying and scaling AI models. You push your model, Cerebrium handles the infrastructure, scaling, and cold start optimization. No server management, no capacity planning, no 3 AM pages when a GPU node goes down.
The founding team is Michael Louis and Jonathan Irwin. Michael is a South African entrepreneur who was previously CTO of OneCart, which was acquired by Walmart’s Massmart division. That’s a meaningful credential because it means he’s built and scaled production infrastructure that a Fortune 1 company deemed worth acquiring. Jonathan has 8+ years as a JavaScript developer in lead roles, with a finance background from the University of Cape Town and data science training from Tilburg University. They came through YC’s Winter 2022 batch and are based in New York.
The South African tech scene doesn’t get enough attention in Silicon Valley, and it produces founders with a particular kind of resourcefulness. Building tech products in markets with infrastructure constraints teaches you to optimize aggressively, which is exactly the skill set you need when your product’s value proposition is “we make GPUs cheaper to use.”
The technical pitch centers on solving the cold start problem. If Cerebrium can get model loading times down to something that feels instant to end users, the serverless model works. If cold starts are noticeable, teams will keep running persistent instances because user experience trumps cost savings every time.
From what I can see, the platform is live and actively serving customers. The website presents Cerebrium as serverless infrastructure specifically designed for AI, which is a clean positioning. They’re not trying to be a general-purpose cloud provider. They’re focused on the specific problem of running AI models without managing servers.
The pricing model matters enormously here, and I’d want to see a direct comparison with Modal and the hyperscaler options. Serverless GPU pricing is tricky because the cold start overhead gets baked into per-request costs. If Cerebrium’s cold starts are long, the effective per-request price goes up even if the sticker price looks competitive.
The Verdict
I think the serverless GPU market is real and growing fast. Every company experimenting with AI models hits the infrastructure wall eventually. Either they’re overpaying for idle GPUs or they’re dealing with the operational complexity of autoscaling GPU clusters. Serverless is the obvious answer, and the question is just which platform executes best.
At 30 days, I’d want to see cold start benchmarks against Modal and Replicate on the same models. That’s the number that determines whether Cerebrium is competitive or an also-ran. Cold start performance is the entire game in serverless GPU.
At 60 days, the question is developer experience. Can a team go from “we have a model” to “it’s running in production” in an afternoon? Modal has set a high bar here with their Python-native workflow. Cerebrium needs to be at least as smooth.
At 90 days, I’d look at cost comparisons on real production workloads. The serverless pitch is that you save money on idle time. But if the per-request cost is high enough, the savings disappear for workloads with consistent traffic. The sweet spot is bursty, variable-load inference, and I’d want to see case studies proving the math works.
The founding team’s background is a genuine strength. Having built and sold production infrastructure before means they understand what enterprise customers actually need, which is reliability and cost predictability, not just a cool developer experience.
Cerebrium is in a good market with strong competition. The winners will be decided by cold start performance, pricing transparency, and developer ergonomics. All three of those are execution problems, not vision problems, and execution is where prior operating experience pays off.