DeepSeek V4: 1.6 Trillion Parameters, 1 Million Token Context

DeepSeek dropped a preview of V4 this week, and the headline numbers are hard to ignore: a 1.6 trillion parameter model and a 1 million token context window shipping as open weights.

That’s the V4-Pro. There’s also a V4-Flash variant at 284 billion parameters, which is more approachable for teams that can’t throw a data center at inference. Both models use a hybrid attention architecture that DeepSeek says cuts compute and memory costs significantly compared to standard transformer setups. The context window alone is the kind of spec that makes people stop scrolling.

Let me explain why 1 million tokens matters, because the number sounds impressive in a vague marketing way but the implications are real. One million tokens is roughly 750,000 words. That’s the entire Lord of the Rings trilogy, with room to spare. It means you can feed a model your entire codebase, your entire document history, a year’s worth of support tickets, and ask a question that requires synthesizing all of it. Most production systems today are still working around context limits, chunking documents, building retrieval pipelines, doing all the plumbing that exists specifically because models forget things after 128,000 or 200,000 tokens. A context window this size doesn’t eliminate that problem entirely, but it makes a huge chunk of it just go away.

The architecture choice here is interesting. DeepSeek is using what they’re calling a hybrid attention approach, which mixes different attention mechanisms to handle long contexts without the memory costs ballooning into something unrunnable. Full attention over a million tokens would be computationally catastrophic. The memory requirements scale quadratically with context length under naive attention, which is why most labs cap things where they do. DeepSeek’s engineering on this is genuinely the interesting part of the release, not just the parameter count. Mixture of Experts architectures (the “MoE” in the model family name) have become the standard way to get massive theoretical parameter counts while keeping active compute reasonable. V4-Pro at 1.6T parameters sounds enormous, but MoE models only activate a fraction of those parameters on any given forward pass. The actual inference cost is much lower than the headline number implies.

DeepSeek has been doing this for a while now. Their V3 series showed that a Chinese AI lab running lean could produce models that competed with the best American frontier labs on benchmarks, often at a fraction of the reported training cost. That pattern has continued through R1 and the various point releases since. V4 looks like another step in that direction, with the context window as the new capability being pushed forward.

Worth being clear about what this release is. It’s a preview. DeepSeek released model weights on Hugging Face under the deepseek-ai organization, and the V4 collection was updated recently with both V4-Pro and V4-Flash available. The Hugging Face listing already has 498 upvotes on the collection page, which tells you the developer community showed up immediately. This is preview-stage software. Benchmark numbers, full technical papers, API pricing, and production guidance are presumably coming later.

Open weights matter. A lot.

The distinction between “open source” and “open weights” is one people argue about constantly in AI circles, and it’s a real distinction. Open weights means you can download the model, run it yourself, fine-tune it, build on top of it. It does not necessarily mean you get training code, full data details, or commercial freedom without restrictions. DeepSeek has generally been permissive with their releases compared to some alternatives, which is a big part of why they have the developer following they do. The Hugging Face model hub currently hosts their entire collection going back through DeepSeek-LLM, DeepSeek-Coder, DeepSeek-VL2, and the R1 series. That’s a lot of freely available foundation work.

The practical consequence of open weights at this scale is that self-hosted inference becomes a real option for organizations with the hardware budget. For anyone building on AI right now, the dependency on API providers is a genuine business risk. Providers change pricing. They change terms. They change rate limits. They deprecate models on timelines that don’t match your deployment schedules. Running your own weights eliminates that risk, at the cost of significant infrastructure overhead. V4-Pro at 1.6T parameters is not something you run on a consumer GPU cluster. But the V4-Flash variant at 284B is more tractable, and quantized versions will almost certainly appear within days of the full release.

The team at DeepSeek has built an unusual position in the market. They’re affiliated with the Chinese quantitative trading firm High-Flyer, which funded the initial compute investment, but they publish research and release weights openly in a way that has earned genuine respect from the broader AI research community. Their papers on MoE training efficiency were read carefully by people at every major lab. Whether that openness continues as geopolitical pressure around AI hardware and model access increases is a real question, but for now the releases keep coming.

I’ve been watching the community reaction since the Hugging Face drop and it splits pretty predictably. Researchers are focused on the hybrid attention claims and want the technical paper to validate the context scaling approach. Developers are asking immediately about quantization, about vLLM support, about what inference hardware you actually need to run Flash at reasonable throughput. Enterprise people are asking about licensing. All three groups are asking at the same time in the same Discord servers, which is its own kind of chaos.

The 1 million token context window is going to get a lot of attention in benchmark discussions, but context length benchmarks are notoriously easy to game. The real test is whether the model actually retrieves and uses information from the middle of a million-token context, which is where models typically fall apart. The “lost in the middle” problem is well-documented in the literature: you can technically have a large context window, but the model’s effective attention degrades badly on content that isn’t near the beginning or end of the input. DeepSeek’s hybrid attention claims are specifically aimed at this problem, but I’d want to see independent evals before trusting that it’s solved rather than reduced.

The fact that they’re releasing two variants is smart product thinking. Flash at 284B is the one most teams will actually try first. Pro at 1.6T is what you benchmark against and aspire to run when you can afford it. Having both available at launch means DeepSeek captures developers at multiple infrastructure levels simultaneously. It’s the same playbook as their V3 release, where different size tiers let different kinds of organizations engage with the work.

On Product Hunt, V4 hit the daily top three shortly after launch, which is a reasonable signal that the developer audience noticed.

The broader picture here is that open-weight frontier models have gotten serious fast. Eighteen months ago the conventional wisdom was that truly frontier capability required closed systems and proprietary infrastructure. That’s looking increasingly questionable. DeepSeek isn’t the only lab pushing on this, but they’re one of the most consistent ones, and V4 continues a streak of releases that have forced people to update their priors about what a non-Big-Tech lab can ship.

For developers building production AI applications right now, the calculus on foundation model selection is genuinely complicated. The Stanford HAI AI Index has been tracking the narrowing gap between open and closed models for the past two years, and DeepSeek’s work shows up prominently in that analysis. V4 is another data point in the same direction. The question isn’t whether open-weight models can compete on capability anymore. The question is whether the inference infrastructure and tooling ecosystem catches up fast enough to make self-hosting practical for mainstream engineering teams, not just the ones with dedicated ML platform engineers.

DeepSeek has been releasing weights steadily since their earliest public models, and V4 preview is what happens when a well-resourced lab treats open release as a feature rather than a liability, with a hybrid attention architecture designed specifically to make a 1 million token window run on something less than a room full of H100s.

DeepSeek V4: 1.6 Trillion Parameters, 1 Million Token Context

More on this

The HUGE Brief