
A week on from the soon-to-be-legendary market meltdown DeepSeek generated, the noise is subsiding. With more data coming to light, we can finally start to parse out what’s what, far from the money-people’s frenzy and the swanky headlines.
Is the 30x price reduction a fair assessment?
No. Many headlines have shared a $6m training cost figure for DeepSeek V3. This is wrong. The $6m does not include “costs associated with prior research and ablation experiments on architectures, algorithms and data”. The pre-training cost is a very narrow portion of the total cost. Excluded are important pieces of the puzzle like R&D and TCO of the hardware itself. Excluded are also potential subsidies from the Chinese state.
For reference, Claude 3.5 Sonnet cost $10s of millions to train, and if that was the total cost Anthropic needed, then they would not raise billions from Google and $10s of billions from Amazon.
$500M CAPEX is a more likely price according to SemiAnalysis, given the rumours of a 10k A100s cluster circling around High-Flyer, DeepSeek’s owners. High-Flyer and DeepSeek today often share resources, both human and computational. There are also talks of 50k H100s potentially smuggled through Malaysia.
Furthermore, comparison of training costs between models trained at different times is inherently flawed: training costs have been improving non-stop. First movers have always spent more… and it seems that DeepSeek may have “distilled” OpenAI’s models. Newcomers stand on the shoulder of giants, that’s just how science works.
As Cohere co-founder Nick Frosst said in late January, “It’s been clear for some time now that innovating and creating greater efficiencies — rather than just throwing unlimited compute at the problem — will spur the next round of technology breakthroughs. This is a clarifying moment when people are realizing what’s long been obvious”.
What do we know about DeepSeek performance?
We know that R1 is comparable to OpenAI’s o1 from a quality perspective (on some benchmarks, not all), although it lags o3.
We also know that DeepSeek’s models incorporate important breakthroughs, highlighting a path to more cost-effective AI.
Of note : FP8 mixed precision training, Multi-head Latent Attention (MLA), Multi-Token Prediction (MTP) and Auxiliary-loss-free load balancing… all of which increase efficiency.
There are however several factors ensuring that DeepSeek’s overall GPU requirements are unlikely to decline:
FP8 mixed-precision training is excellent for large language models (LLMs) due to its efficiency in handling massive datasets and parameter counts. For more complex tasks however, that require higher numerical stability, precision, or dynamic range, the same cannot be said.
Multi-Token Prediction allows DeepSeek to predict multiple tokens simultaneously, improving inference throughput by up to 1.8x. This reduces the per-task GPU load during real-time applications like chatbots or coding assistants. This however comes with challenges like prediction errors, low acceptance rates, and increased verification complexity. These issues make it less suitable for applications requiring strict accuracy, fine-grained control, or high-context dependency (e.g., code generation, formal logic constructs).
This market is heated and looking for a reason to sneeze. The fact this news came from China is a bigger trigger than technological improvements.
So… are the training improvements real?
Efficiency gains are likely to produce about 30% training improvements according to ex-Google Tech Lead Daniel Golding. There are a few ways to look at this as a positive for the market.
Firstly, we don’t have enough power, capital, data centers, or chips to meet the demand as currently forecasted in the most optimistic cases (80 GW+). Thus, improvements had to happen.
In addition, the market cannot really support 300kw racks with today’s technology (liquid cooling is great but costly and carries risks). If efficiency improvements slow the curve of increasing power densities, we’ll be very lucky indeed — the alternative is a bunch of rapidly obsolete data center capacity and rapidly increasing per MW build costs.
30% improvement doesn’t mean we need 30% fewer chips or data centers. It means we get 30% more power. Goldman Sachs has lamented that AI isn’t delivering enough ROI — well, this increases the gain significantly. There is a logical fallacy in IT, that a 10% efficiency gain means 10% less data center and 10% less server machines. And yet, that has been repeatedly disproven.
Wait, are you talking about this Jevon’s Paradox I keep hearing about?
Yes. It’s an over-used model, but useful. Any shift toward cheaper, more powerful, and less energy-intensive algorithms has the potential to significantly expand AI adoption / the total-addressable-market, which could ultimately fuel demand for both large-scale and distributed data center infrastructures.
That, in turn, means that A.I. companies may be able to achieve very powerful capabilities with far less investment than previously thought. And it suggests that we may soon see a flood of investment into smaller A.I. start-ups, and much more competition for the giants of Silicon Valley (which, because of the enormous costs of training their models, have mostly been competing with each other until now).
Efficiencies reduce cost per task, but total GPU utilization increases as more tasks, larger models, and broader applications are adopted.
This view has been backed by a few ecosystem players. “Because the value of having a more intelligent system is so high,” wrote Anthropic cofounder Dario Amodei, it “causes companies to spend more, not less, on training models”. Baxtel VP of Sales & Operations Mitch Lenzi would opine, saying that “innovation in AI doesn’t reduce demand — it fuels it. As AI becomes more accessible and cost-effective, the industry will see continued expansion, maintaining the need for high-performance data center infrastructure”.
Speaking of cost per task: what about inferencing?
Training a model is only the beginning. Using it also uses compute / energy. In the same way that creating Google is one thing, but searching it is another.
The first thing to say on this is that the cost of inferencing always needed to scale down to reach mass AI adoption. In fact, inference costs have been coming down every months; gains in efficiency are a given, and the landscape is not drastically changed for those who understand it. New breakthroughs (likely due to geopolitical tensions) can only accelerate what is already underway.
BUT… there are some indications that Deepseek models are less efficient for inference than they let on. In fact, the energy it saves in training is offset by its more intensive techniques for answering questions, and by the long answers they produce.
Some preliminary tests have shown that, overall, when tested on 40 prompts, DeepSeek has a similar energy efficiency to Meta models, but tends to generate much longer responses and therefore uses 87% more energy.
Researcher Sasha Luccioni commented that “If we started adopting this paradigm widely, inference energy usage would skyrocket. If all of the models that are released are more compute intensive and become chain-of-thought, then it completely voids any efficiency gains”.
So… what will change, really?
What will change is the type of data centers built. On the data center side, the move from primarily building Training capacity to constructing Inference sites has been talked about for some time. Advances seen here do not significantly accelerate this trend. The industry will move from 80/20 training:inference new-construction in 2025 to 80/20 inference training new-construction in 2029.
The biggest risk to the current “AI infrastructure” players is that a distilled version of DeepSeek models can be run locally at the edge on a high end work station. That means that a similar model will run on a superphone in c.2 years. If inference moves to the edge because it is “good enough,” we are living in a very different world with very different winners — i.e. the biggest PC and smartphone upgrade cycle we have ever seen.
BUT batching massively lowers costs and more compute increases tokens/second, so inference in the cloud still has a lot of advantage. We need to prepare for more on-device AI, linking with data-center, likely with lower densities.
In conclusion…
The biggest winners are the builders.
More efficient compute doesn’t mean you need less compute: it allows the industry to apply more compute at inference time in order to generate a higher level of intelligence and a higher quality of service (crucial for agentic AI, which the industry is turning to). As intelligence gets cheaper, we will throw more brute-force intelligence at every one of the world’s key problems.
DeepSeek’s innovations are real, but they don’t upend AI infrastructure economics. CAPEX investment remains key, inference is moving toward the edge, but cloud inference will continue dominating due to batching advantages. This is an evolution, not a revolution.
This is why a lot of market players have been reassuring. Mark Zuckerberg, during Meta’s latest post-earnings call said the he continues “to think that investing very heavily in CAPEX & infra. is going to be a strategic advantage over time”. He’s right. Over time.
BUT, there is a chance the “deflator” of better models outweighs the increased usage in the short therm. There’s a pertinent case study: DWDM (Dense wavelength-division multiplexing) massively increases the fiber supply. And so, while the Jevons paradox case was 100% correct in the long run, 97% of the fiber laid in 2001 was unlit. Most of that fiber is now lit today, but Jevons paradox (overhyped today) can be long run right and in the short run the companies are entirely divorced from reality.
Confusing? Yes. But let’s be honest, that’s the point. Each day brings new offerings, opportunities, challenges and products within the AI sphere. We are being given a ring side seat to a technology that is constantly changing, evolving and challenging the way we work and interact with technology. This is a blessing.
Good luck out there.
Farklı geçmişlere sahip yeni arkadaşlar edinmek için heyecan verici bir yol sağlayarak, rastgele yabancılarla cinsel sohbet ve canlı görüntülü sohbet imkanı sunuyor.
https://soh--bet.blogspot.com/2024/12/edebi-bitiren-gabile-sohbet.html
https://askyeriniz.blogspot.com/2024/11/edebi-bitiren-cinsel-sohbet-ve-edebi.html
https://sohbetbe.blogspot.com/2025/02/sohbet-odalar-nostalji-yenilik-ve.html
https://livechattt.blogspot.com/2025/02/uyeliksiz-sohbet-nostalji-yenilik-ve.html
This blog provided some interesting insights! I appreciate the detailed information and how you covered the topic comprehensively. If you want to know Prabhas wife name become a point of curiosity for fans. The way you’ve presented the information is engaging and keeps readers hooked till the end. Keep up the great work—looking forward to reading more of your posts on celebrity lives and beyond!