Assessing Reactions to Chinese AI Startup DeepSeek and the Firm's Advanced Open Source/Weight AI Models
Complex issues around access to training, inference hardware also put spotlight on hardware scaling and US export controls.
Note: This is a complex issue with many angles, as we will see below, and an evolving story. This is a preliminary snapshot of some of the key issues around the emergence of DeepSeek, its position in the Chinese AI space, and the implications for US China competition in AI. This is the first of a series of posts on this issue that will be released in the coming weeks, as I learn more about DeepSeek, talk to people in the sector in China and outside, and continue to gain further understanding of both the technical details of the evolving AI model training landscape in China and the geopolitical ramifications of innovative efforts by Chinese companies to overcome restrictions on access to advanced GPUs. The story also involves US efforts, via projects such as Stargate, to develop massive AI training and inference capacity, both in the US and internationally as part of the US China competition. The DeepSeek story is now officially roiling Silicon Valley, with open source/weigh model leader Meta particularly concerned—Meta has set up several “war rooms”1 to take apart DeepSeek’s models. Let the games begin…
Statement attributed to Nvidia spokesperson on Monday:
“DeepSeek is an excellent AI advancement and a perfect example of Test Time Scaling. DeepSeek’s work illustrates how new models can be created using that technique, leveraging widely-available models and compute that is fully export control compliant. Inference requires significant numbers of NVIDIA GPUs and high-performance networking. We now have three scaling laws: pre-training and post-training, which continue, and new test-time scaling.”
The most interesting developments in the AI sector over the past several weeks involved the “emergence” of Chinese startup DeepSeek and the reactions to both the performance of the firm’s V3 and new R1 reasoning model and the responses from different quarters to DeepSeek’s papers asserting it was able to use much lower numbers of GPUs for training, and much less capable GPUs than those available to leading western model developers. For a good analysis of the R1 model, see Tony Peng’s piece on Recode. Much has been written over the past few weeks about the origins of DeepSeek and what its approach to model development portends for China’s AI sector. Tony sums the current situation up well:
“The big picture is the Chinese AI labs are narrowing the gap with their U.S. counterparts. They are good at rapidly adopting and building on proven technologies. Once a direction in AI research is validated as effective, they can efficiently implement it and use their engineering strengths to deliver competitive or even superior results while using fewer resources.’
Tony also notes the increasing numbers of Chinese AI companies developing reasoning models that are comparable to leading western models for some benchmarks: “Moonshot AI, a $3.3 billion-valued Alibaba-backed AI startup, introduced Kimi k1.5, a multimodal reasoning LLM that also matches o1 on math tasks.”
The really interesting part of the equation is that we could be in a situation where Chinese open source models see high rates of adoption by developers around the world, including in the US—a situation that would be unprecedented and never envisaged by the designers of US export controls, for example. Here the export controls would have contributed not only significant innovation by Chinese AI developers but also the uptake of a Chinese model by US researchers and companies, something for which there is no easy tool in the US policy toolbox to prevent happening on a large scale. We are in the early innings here of the open source/weight contest. While the ability of DeepSeek to continue training and innovating its R2 and R3 models remains unclear, at a minimum, DeepSeek has thrown down quite a challenge to the AI community and the US government.
“DeepSeek’s reasoning model is one of the most amazing and impressive breakthroughs I’ve ever seen—and as open source, a profound gift to the world.” —Venture capitalist and tech investor Marc Andreessen, last week
The new reasoning models such as OpenAI’s o1 (I use the pro mode version), DeepSeek R1 and Kimi k1.5 are all part of a paradigm shift known as test-time compute. Test-time compute refers to the computational resources and processing effort required for a machine learning model to generate predictions (i.e., perform inference) once it has already been trained. In other words, it’s the cost—measured in time, memory, or FLOPs—of running the model on new, unseen inputs. This approach appears to be where DeepSeek is particularly adept, reducing the costs of inference, which will become a big differentiator for open-source models as the industry overall shifts the mix of compute from training towards inference as more applications are deployed.
The emergence of DeepSeek has thus been intriguing and has invited focused analysis of these issues. However, much like Stargate, the latest attention is not new: last summer, with the firm’s V2 model, DeepSeek was already attracting attention from people who closely watch China’s dynamic AI space. A US AI company executive noted last June that the V2 “came close to matching Meta’s latest Llama 3 model but with lower pricing. Its price is about 100th the cost of OpenAI’s GPT-4 and a fifth of Anthropic’s Claude 3 Haiku.” Stargate has also been under development since early last year, and did not commence with the Trump press conference last week.
As I have written, the emergence of DeepSeek as a significant player in the AI space has been surprising to Chinese bureaucrats in Beijing eager to see Chinese companies remain competitive in the face of unprecedented US export controls and the recent AI Diffusion Rule. It has been equally surprising for the CEOs of OpenAI and Meta at venues like Davos, where the issue dominated discussions among the gathered financial and tech elites. The idea that a hedge fund staffed with quants could pivot to pursuing artificial general intelligence (AGI)—employing talented software and hardware engineers with access to GPU clusters (we’ll get to which ones)—and quickly produce models performing near the top of benchmark scoreboards is certainly something to ponder. However, particularly in the age of Stargate, the issue is more complex than some have rushed to judge. (To be fair, some observers attributed DeepSeek’s spin off from Hangzhou-based hedge fund High-Flyer Capital Management as also driven by issues such as calls for more controls over quantitative automated training and a Chinese government crackdown on quants last year).
“We believe AGI is the violent beauty of model x data x computing power. Embark on a ‘deep quest’ with us on the journey towards AGI!”—Job recruitment advertisement for DeepSeek
One important view, well articulated recently by Angela Zhang, was that DeepSeek’s claims to have trained its V3 model on 2048 H800 Nvidia GPUs showed that US export controls have resulted in accelerating innovation by Chinese companies, enabling them to catch up to western leaders. This clearly is a complex issue to unpack and make strong assertions about. Let’s begin by looking at the basis for DeepSeek’s claims and the apparent success in having its models—including V3 and its newest R1 model, both open source/weight under the MIT license—end up storming parts of the AI world, including rapid uptake in research and other areas where open source/weight models have been gaining traction. The numbers for R1 on well-known benchmarks, for example, are impressive, and account for what appears to be a broad uptake of this model—now considered among the best, if not the best, of open source/weigh models.
The Technical Report in which DeepSeek discusses its training approach to V3 is important to read before jumping to conclusions. In the paper, DeepSeek lays out the various pieces of the model architecture which it sought to pull together to achieve “efficient inference and cost-effective training.” These include a Mixture of Experts (MoE) mode, using Multi-head Latent Attention (MLA), the DeepSeekMoW architecture, Supervised Fine-Tuning and Reinforcement Learning, and more. The authors of the paper claim that they used only 2,788M H800 GPU hours for full training at a cost of $5.57 million. Of particular interest was the ability of the model developers to optimize intra-node communications using Infiniband and NVLink, two different technologies for interconnecting GPUs and distributing compute loads.
Here is a small sample from the paper: To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB (50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic.
Clearly, for example, the hardware engineers at DeepSeek understand GPU-to-GPU communications much better than officials at the Commerce Department who wrote the rules on GPU-to-GPU transfer rate that resulted in the Nvidia redesign of the H100 to the H800. And just as clearly, good hardware and software engineers innovating in the real world trump writers of export control regulations trying to draw lines around how technology is accessed and developed.
The V3 paper at the least suggests that hardware engineers can overcome things like export control-mandated reductions in transfer rates that US government bureaucrats believed would slow down China’s ability to develop advanced or frontier models:
Although the H800 (and similar “export-compliant” GPUs) officially cut GPU-to-GPU bandwidth below the H100 thresholds, numerous architectural and software-level strategies can mitigate or even circumvent the resulting bottlenecks in real-world AI/HPC workloads. By carefully choreographing how data moves among GPUs (compression, scheduling, partial topologies, caching optimizations), designers can enable performance comparable to (or in certain workflows exceeding) the standard H100—even though the nominal link specification is lower.
This is presumably the core thesis of the alphaXiv 2412.19437 article on DeepSeek V3: that through creative connectivity optimizations and advanced HPC/AI software stacks, these “compliance-limited” GPUs can come surprisingly close to the flagship, unrestricted H100 in practical usage. (thanks to ChatGPT-4 o1 pro mode for an assist here)
This illustrates just one of the ways that export controls have impacted innovation cycles with Chinese AI model development ecosystem, particularly those at DeepSeek and likely at other major players like Bytedance and Huawei, companies with deep benches of hardware and software engineering experience. This does not necessarily mean that export controls are not “effective.” A full assessment of that issue requires broader understanding about what the goal of the controls actually is, what the scope of the collateral damage to US and allied firms and the industry has been and will be, and how the controls have galvanized broader efforts within China’s semiconductor industry to collaborate and innovate faster. I have tried to tackle this issue here and here and in other articles.
What is Going on with DeepSeek and Access to GPUs: A Tale of Three or More Clusters, Training vs Inference, and API Hosting
The other and perhaps more interesting response to DeepSeek V3 and R1—and claims of reduced numbers of GPUs used—came in the form of statements from industry commentators and Scale.ai CEO Alexandr Wang questioning DeepSeek’s training cluster and cost claims, asserting that the firm actually had access to 50,000 H100 GPUs, in addition to a cluster of A100 GPUs. Wang claimed at Davos that DeepSeek and High-Flyer could not talk about these other clusters because of the sensitivity of the export control issue.
“When it comes to the Chinese accessing NVIDIA's advanced GPUs, the reality is yes and no. You know the Chinese labs, they have more H100s than, than people think. My understanding is that DeepSeek has about fifty thousand H100s….they can't talk about obviously because it is against the export controls that United States has put in place….they have more chips than other people expect."—Scale.ai CEO Alexandr Wang
So we now have at least three and likely a fourth cluster at play here. Let’s look at each one:
2048 Nvidia H800s. DeepSeek claims to have used this to train V3.
10,000 Nvidia A100s. This is a High-Flyer cluster used for financial modelling, and would not likely be optimized for AI model training. The A100s were acquired before the October 2022 US export controls on advanced GPUs were released. This was the second GPU cluster built by High-Flyer.
50.000 Nvidia H100s. This is the cluster some observers claim, without evidence, that DeepSeek has access to. While it is unclear where this nice round number is from, some may be working back from the numbers around inference compute for the R1 model and extrapolating a number of GPUs required to support existing levels. But as the V3 and R1 papers indicate, DeepSeek has used approaches like MLA and distillation to reduce inferencing costs significantly, so any attempt to work back to some number of GPUs needs to include calculations on how the many measures to achieve efficiency used by DeepSeek impact the numbers of GPUs that would be required compared to other advanced models. The issue of training versus inference in terms of GPU numbers is also at play here. It remains unclear exactly what Wang is claiming about the notional H100s and how they are allegedly being used.
X number of AMD M200/300 Instinct GPUs. In an early January press release, GPU maker AMD highlighted its “long term collaboration” with DeepSeek and integration of DeepSeek-V3 with AMD Instinct GPUs (M200/M300). This adds another wrinkle to the training and inference issue with respect to DeepSeek.
Huawei Ascent 910B/C GPUs. Like many other Chinese firms, DeepSeek is almost certainly working with Huawei’s advanced GPUs, likely for inference primarily, where the firm’s engineers proficient in programmer hardware directly will be able to improve the performance of these domestically available GPUs. This is likely to be part of a broader effort to develop a Chinese version of Nvidia’s CUDA acceleration suite.2 It is likely that Huawei was able to use a third party company to manufacture Ascend GPU dies at global foundry leader TSMC before this was discovered by TechInsights as part of a teardown on a Huawei Atlas 310T3. It is likely that Huawei was able to obtain several million Ascend 910B dies, which were then packaged with high bandwidth memory (HBM) from Samsung and/or SK Hynix in China. Should it want to use Huawei hardware, there are not (or not likely to be) any limitations in the short- to medium-term.
The full scope of DeepSeek’s collaboration with AMD remains unclear, as does whether DeepSeek also has access to a large cluster of AMD GPUs for training, for example. The collaborative effort appears to be focused on inference, and the press release notes that:
Leveraging AMD ROCm™ software and AMD Instinct™ GPU accelerators across key stages of DeepSeek-V3 development further strengthens a long-standing collaboration with AMD and commitment to an open software approach for AI. Scalable infrastructure from AMD enables developers to build powerful visual reasoning and understanding applications.
In addition, AMD and DeepSeek are working with the development environment focused SGLang team to deploy DeepSeek models for inference. To add further confusion around the issue of DeepSeek access to GPUs, a December X post on this collaboration thanked DataCrunch, a hosting firm, for contributing GPU resources for the effort. The post noted that: “The SGLang and DeepSeek teams worked together to support DeepSeek V3 FP8 on NVIDIA and AMD GPUs from day one. SGLang has supported MLA and DP attention optimizations for several months, making it one of the top open-source engines for running DeepSeek models.”
Clearly DeepSeek has a well-developed strategy for collaborating with multiple GPU companies—apparently much more closely with AMD than Nvidia—and is hedging its bets by partnering with AMD to enable wider use of DeepSeek models for inference. DeepSeek’s initial reliance on Nvidia GPUs for training creates a single-vendor dependency, which carries risks, particularly given US government focus on Nvidia GPUs and exports to Chinese end users. There are any number of competitive and innovation-related reasons for partnering with AMD:
While DeepSeek’s software optimizations for NVLink and InfiniBand are Nvidia-specific, AMD’s Instinct GPUs and Infinity Fabric interconnect provide alternative features that could align with DeepSeek’s requirements for high memory bandwidth and capacity: AMD’s GPUs (e.g., MI250X and MI300) offer HBM2e/3 memory with 1.6 TB/s+ bandwidth and up to 128 GB capacity, which is crucial for large-scale AI model training and inference. This matches or even exceeds the memory performance of NVIDIA H800 GPUs, making AMD an attractive partner for inference-heavy tasks.
Compute performance: AMD Instinct GPUs provide competitive FP16/FP32 compute performance, particularly in dense matrix multiplications central to AI training. Unified architecture with MI300: AMD’s MI300 integrates CPU and GPU cores with unified memory, which could simplify DeepSeek’s architecture for specific workloads and reduce interconnect dependency.
DeepSeek may also want to invest in AMD’s ROCm (Radeon Open Compute) software stack, which offers an open and flexible alternative to Nvidia’s proprietary CUDA environment: ROCm allows deep-level software customizations similar to those DeepSeek already performed with Nvidia’s NVLink and InfiniBand optimizations. By leveraging ROCm, DeepSeek could optimize AMD hardware for its specific workloads, enabling high performance without being locked into Nvidia’s ecosystem. Supporting ROCm positions DeepSeek to diversify its infrastructure, reducing the impact of potential export restrictions or hardware shortages.
DeepSeek might see partnering with AMD as part of a broader strategy to develop proprietary AI hardware expertise. By working closely with AMD, DeepSeek could gain insights into GPU architecture and software optimization, potentially positioning itself to develop custom hardware solutions in the future. This could also be about aligning with emerging global standards: as AMD continues to grow its presence in AI and high-performance computing, collaborating ensures that DeepSeek stays at the forefront of emerging technologies, including alternative interconnects, memory architectures, and AI accelerators.
The involvement of DataCrunch and hosting firm Baseten is also interesting, and suggests that DeepSeek models will be hosted for inference outside of China on advanced GPU hardware, a situation not apparently covered by US export controls. In addition, the plot thickens. In the X post, the DeepSeek/AMD/SGLang team offer “special thanks to Meituan’s Search & Recommend Platform Team, Baseten’s Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.” Meituan, a food delivery company, has extensive experience in AI model development, including distributed training.4 Meituan has numerous in-house GPU clusters, including Nvidia A100s acquired before US controls. Baseten is a startup model hosting platform5, which now offers APIs for DeepSeek, Alibaba, and Meta’s Llama open source/weight models, and many others.
We are in unchartered waters here. The availability of a diverse and global ecosystem, involving clusters of various sizes of GPUs available for training and inference, and the increasing reluctance of Chinese AI developers to fully disclose their access to particularly advanced controlled GPUs for training, will make fully assessing how advanced frontier models are being developed in China difficult. That said, if companies such a DeepSeek continue to publish detailed papers on model development and open source/weight these models, with the apparent approval of the Chinese government for now, it should still be possible to assess levels of access to advanced GPUs.
“We always wanted to carry out larger-scale experiments, so we’ve always aimed to deploy as much computational power as possible, We wanted to find a paradigm that can fully describe the entire financial market.”—DeepSeek CEO and founder Liang Wenfeng, for Chinese tech site 36Kr, last year
Proponents of US Export Controls Aghast at Claims that DeepSeek is a Game Changer
In addition to discussions around the issue of access to GPUs for training models such as DeepSeek V2, V3, and R1, some observers have been critical of the claims by Zhang about the impact of US export controls on China’s AI development ecosystem. The arguments here need to be unpacked.
One is that the DeepSeek situation does not diminish the need for export controls to prevent China from developing advanced AI. This line of argument, always questionable from a geopolitical point of view, now sounds even more tortured. The claim is that despite the apparent success of DeepSeek in driving down costs for training and inference, export controls are still needed to continue to slow the ability of Chinese companies to continue to develop more advanced models. But as I and Alvin Graylin and others have argued, there has to be balance between controlling China’s access to advanced AI hardware for potential military use, and denying Chinese society the many benefits brought by advanced AI. Miles Brundage from the Frontier Model Forum recently nicely captured this policy dilemma:
“There are multiple reasons why the U.S. has an interest in slowing down Chinese AI development. However, to be clear, this doesn’t mean we shouldn’t have a policy vision that allows China to grow their economy and have beneficial uses of AI. We don’t necessarily need to choose between letting Nvidia sell whatever they want and completely cutting off China. There should probably be something more nuanced with more fine-grained controls.”
But this nuance is seldom articulated by the cheerleaders of US export control policy from various think tanks—and now the CEOs of some leading AI companies that are aligning with the “China AI threat” narrative prevalent within the DC Beltway. Drawing moving lines around technology levels for GPUs, controlling the export of semiconductor manufacturing tools from the US, Japan, and the Netherlands to slow China down on AI, and putting the US in a position to control where advanced AI model development is done globally via the AI Diffusion Rule issued in early January would appear to be far from the “more fine-grained controls” called for by Brundage. And all of these of course create substantial collateral damage to US technology companies, global supply chains, and efforts to develop a global framework to control advanced models and the valid security risks they will pose.
Coupled with key elements of what I have called the “Sullivan Doctrine”, in which former National Security Advisor Jake Sullivan in fall 2022 stressed that the US would attempt to maintain an absolute rather than a sliding technology advantage over China in key sectors such as advanced compute, from Beijing’s perspective US export controls are nothing less than an effort to constrain China’s economic development and deny China the benefits of advanced AI applications. Hence the talking points, frequently ticked off in interviews and podcasts and present at Davos last week, that the US must “win” the AI race with China—meaning that proponents of export controls must doggedly stick to their view that these measures will eventually be effective in slowing China, despite growing evidence the the situation is actually much more complex and fluid. This body of evidence includes but is not limited to the case of DeepSeek and of progress by Huawei and SMIC in producing advanced semiconductors, including GPUs. One US official told me recently that the goal of the controls is just to “throw some sand in the gears.”
This far, my observation is that most discussions around the significance of DeepSeek miss two major issues:
1) The success of a company like DeepSeek in developing very advanced models should reinforce the argument that Alvin and I have made that collaboration between the US and China on the real risks of advanced AI models leveraged by malicious actors is critical. Indeed, it is now even more vital, given the movement towards open source/weight models and the closing gap between the capabilities of proprietary and open models.
2) While experienced analysts of AI and AI safety such as Brundage have called for more nuance in export controls to ensure the benefits of AI can accrue to societies, including China, few commentators have attempted to assess the cost and benefits of trying to get to a more nuanced implementation of export controls, nor whether it is even possible. For example, when asked about the end game of the implementation of sweeping controls on semiconductors and tools that began in October 2022 based on fears of future AI gains in China, US officials appear to have no concrete answers. Reflecting what was likely significant disagreement within the Biden administration on the utility of the controls, outgoing Secretary of Commerce Gina Raimondo in late December 2024 noted (somewhat surprisingly, given the role of her Department) that, “Trying to hold China back is a fool’s errand.” Proponents of the existing or future export controls may want to ponder this comment. At a minimum, while we do not have the full picture yet, and there are many variables at play, in the short term, the DeepSeek story would seem to reinforce Raimondo’s comment.
In the next look at this rapidly evolving issue, I will take a closer look at DeepSeek’s partnerships and deployments of its models, and at how to think about DeepSeek and Alibaba’s reasoning models in comparison to leading western models in the age of agentic AI—which will doubtless be a huge part of the AI development scene in 2025.
Meta has reportedly established several specialized groups of researchers whose task is to to understand how DeepSeek designs its models and leverage this analysis to improve the Llama set of models. Meta has been slow to release its Llama 4 model, but in the wake of the furor over DeepSeek, officials are hinting that the model is coming within the first quarter. For more details, see this excellent recent piece: Meta Scrambles After Chinese AI Equals Its Own, Upending Silicon Valley.
This is another complex issue, tied up with developers, willingness to switch development environments, and performance issues up and down the stack. Last summer, Kendra Schaeffer and I wrote on this topic: Indeed, discussions with Chinese AI industry observers over recent months reveal a widespread belief that Huawei is the only company with a chance to develop as an alternative to Nvidia for China. Some believe this process will take time but that Huawei will eventually develop a hardware-software stack that creates the kind of synergies that Nvidia’s close coupling of GPUs and software development environments provides to programmers [via CUDA, etc.]. However, the current industry consensus is that Huawei’s products are not remotely competitive with Nvidia or AMD in terms of hardware-software synergy. Indeed, despite the fact that comparisons of raw technical performance parameters between Nvidia A100 chips and Huawei Ascend 910B chips seem to indicate the chips are relatively comparable, the numbers do not necessarily reflect performance in real-world applications.
According to TechInsights, the Ascend 910B was manufactured on the TSMC N7HPC node process. TSMC’s N7HPC node is a specialized 7nm FinFET process geared toward high-performance computing silicon, offering enhanced speed, robust metal stacks, and wider operating voltages compared to standard N7. By catering to HPC workloads’ higher clock-speed and reliability demands, N7HPC is the go-to choice for cutting-edge CPU, GPU, and AI accelerator designs requiring maximum throughput.
Meituan (美团) is one of China’s largest and most diverse e-commerce platforms for local services, offering everything from food delivery and ride-hailing to hotel booking and travel. Given its vast operational scale—millions of daily orders, hundreds of thousands of merchants, and complex real-time logistics—Meituan has invested heavily in advanced AI and algorithmic development to optimize nearly every facet of its ecosystem.
While Meituan does not publish a detailed list of GPU models used in production, it is safe to say the Search & Recommend Platform Team—and Meituan’s broader AI organization—primarily leverages NVIDIA data-center GPUs such as the V100, A100, or, where export rules apply, the A800/H800 variants. This aligns with standard industry practice among Chinese tech giants handling large-scale AI workloads. Meituan may also potentially experiment with AMD Instinct solutions, but the public information strongly suggests NVIDIA’s GPU ecosystem remains central to Meituan’s AI training infrastructure.
Baseten provides a platform for deploying machine learning models (including large language models, computer vision models, etc.) in a production-ready environment. Most of Baseten’s hosted offering runs on major public cloud providers. While Baseten’s documentation does not always explicitly state which cloud(s) power their service, it is very common for startups to use AWS, GCP, or Azure behind the scenes. Overall, Baseten’s approach resembles other “serverless” or “managed MLOps” solutions, using big cloud providers under the hood and providing a unified platform for model hosting, application building, and real-time AI serving.
Brilliant analysis, Paul.