Tracking the GPUs: DeepSeek's training and inference journey

The firm's access to and use of advanced AI hardware is a microcosm of US China tech competition on AI and semiconductors, and has set China's domestic AI ecosystem 'on fire'

Feb 06, 2025

This is the latest in a series of deep dives on the DeepSeek phenomenon. In this episode, I will focus on the issue of training and inference hardware that DeepSeek has used, is using, and will use in the future, and what it tells us about US China technology competition, particularly across the AI stack.

As last week ended, media reports—almost certainly based on deliberate leaks from the Commerce Department or other US government sources—suggested that the US government was investigating transfers of GPUs via Singapore, that could have ended up in China and been used by DeepSeek or other Chinese startups. As I noted in an earlier post, there had also been some speculation on X that the US State Department was cooperating with TSMC and Taiwan authorities to track shipments of Nvidia GPU dies to China. One key thing to remember here: Singapore is more a billing hub for multinationals than necessarily a shipping hub for GPUs, so the numbers around GPUs and Singapore have to be considered in this light.

These rumors align with the unverified assertions from some quarters that DeepSeek actually has had access to tens of thousands of GPUs. These assertions continue to make the rounds, with one technology focused organization calculating that the actual cost of the hardware for training the GPUs was in excess of $1 billion. The ever credible Glenn Luk has a very nice takedown of this analysis here, which basically explains why a hedge fund like High Flyer Capital Management, with somewhere near $8 billion in AUM, did not have a capex budget near $1 billion to buy all the alleged “Hoppers.” That and CEO Liang WenFeng’s complaints about the firm’s access to advanced GPUs suggest that those pushing the line that there are 50,000 H100s sitting around in a secret location near Hangzhou have other motivations. Nvidia has also stated that the GPUs DeepSeek has used for training were compliant with US export controls. No one I have spoken with who understands the industry well believes the claims. They appear to originate from contractors in China with whose access to concrete information is unverified and whose analysis appears based primarily if not entirely on speculation. As one industry veteran put it to me:

I am wondering: if DeepSeek indeed had 50,000 H100s, why would they spend so much time on optimizing hardware engineering? This does not make sense at all. The core innovation is based on the fact that they relied on H800s to train their models, and the key effort was addressing the GPU-to-GPU interconnectivity bandwidth issue, to overcome the shortcomings of the H800.

The 50,000 H100s claim also plays nicely into the claim that the timing of the release of DeepSeek’s R1 model was deliberate. A long-time China political analyst made this claim on a recent panel that I was on, so it is clearly still making the rounds, and some appear to find it credible. Tech investor Kevin Xu had a nice takedown of this and some other conspiracy theories here:

So if you want to really understand why DeepSeek does what it does and open source everything, start there. It’s not a political statement, not to troll Stargate or Trump inauguration, or to help their quant fund’s shorts on NVDA (though if that were the case, it’d be quite brilliant and savage).

In a previous note, I focused on the options DeepSeek had for accessing GPUs for training. Since that note, there has been much discussion about the complexity of the GPU supply chain, particularly the issue of Nvidia’s knowledge of the ultimate end users of its advanced GPUs. This is clearly a complex issue, because the GPUs that Nvidia contracts global foundry leader TSMC to build at its Taiwan-based fabs are not simply shipped directly to end users. The bulk are shipped to other locations in Asia, including China, to be packaged and eventually integrated into systems such as AI servers and PCs, via a host of intermediaries included Super Micro, Foxconn, Dell, HPE, Lenovo, and many other smaller players. In addition, there are many smaller cloud services providers leveraging access to hyperscalers and offering their own interfaces to a variety of hardware that are operating all around the world, including in China.

Whether DeepSeek was able to purchase directly or have access indirectly to restricted or more advanced GPUs to train its models is thus a complex issue, and one which the US government is actively attempting to run to ground. At this point, the issue of hardware used for training DeepSeek models will remain important to fully understand. The best guess being that engineers had access to a somewhat larger cluster to do training and validation—likely up to no more than 10,000 H800s—and an optimized cluster mentioned in the V3 paper of 2048 H800s essentially functioning as H100s due to GPU to GPU transfer optimizations. In the meantime, focus has shifted to the major spread of the DeepSeek R1 model all over the globe, including on US cloud services providers, and perhaps most importantly in China. Let’s turn to the implications of this trend for inference.

So far, most major US cloud providers, including Amazon, Microsoft Azure, and others, have announced they are hosting the R1 model, as they typically do with advanced open source/weight models. This means that developers do not have to access DeepSeek’s API in China, if there is concern about query data being stored on servers in China. Companies concerned about proprietary data are generally concerned about these types of issues in any case, and this is not unique to APIs located in China. Users of the DeepSeek smartphone app on IoS or Android are accessing inference servers in China. Just how is DeepSeek expanding its capability to support inference for its advanced models? This is a complicated issue that also dovetails with China’s domestic semiconductor industry capability, as mentioned in my initial post on DeepSeek.

First, of course DeepSeek’s V3 and R1 models are already running on Nvidia hardware, likely a variety of A100, A800, H100, H800, H20, and even B200 GPUs, depending on where the model’s are hosted and which company is hosting them. There is no provision in the existing export control regime around advanced GPUs to prevent any company from running inference using DeepSeek on any specific hardware. The current situation in China with respect to advanced GPUs is complex, as I have written over the past year. All of the above GPUs are available in varying quantities, with some acquired by Chinese firms before export controls were revised. Their availability depends on multiple factors, including how contracts were fulfilled, which systems integrators were involved, the success of third-party efforts to traffic restricted GPUs, and the identity of the end users in cases where systems were transferred in potential violation of US export control laws. Tracking down every last Nvidia, Intel, or AMD GPU that has ended up in China over the past two to three years is, to paraphrase former Secretary of Commerce Raimondo, a fool’s errand.

Even more interesting as we look forward is the sudden interest in DeepSeek’s models running on domestic Chinese hardware. This is going to be the most important factor in determining whether Chinese firms developing advanced AI models will be able to keep pace—and at what handicap—with western firms with full access to the latest hardware from Nvidia, AMD, and Intel, and inference players such as Groq, Cerebras, and a number of other smaller but very capable developers of specialized AI hardware.

As I have noted, in China there are two sources for advanced GPUs at present, both involving Huawei:

Huawei’s domestic partnership with domestic foundry leader SMIC to produce Ascend 910B server semiconductors with capable GPU support. These are typically packaged as part of Huawei’s Atlas server offerings, with multiple Ascends included in various configurations. Huawei and SMIC’s ability to produce Ascend 910XX semiconductors using SMIC’s 7nm processes has of course been hampered by US controls on semiconductor manufacturing tools. For an in-depth look at how Huawei and SMIC and the rest of the Chinese semiconductor industry is responding to US controls, see here.
Huawei’s use of third party companies to manufacturing advanced Ascend designs by HiSilicon at TSMC. The full extent of this effort remains unclear, but it has likely enable Huawei to obtain at least 3 million Ascend 910B dies manufactured using TSMC’s 7NMHPC process, which is optimized for advance compute applications such as GPUs. The full extent of the availability of Ascend 910Bs and 910Cs, the latter being a pairing of two 910B dies with high bandwidth memory (HBM) remains unclear, but it is likely to be in the low millions, depending on die yields, packaging yields, and a host of other factors.

There are many claims being made about the Huawei Ascend GPUs and optimizing their operation for inferencing of DeepSeek’s models. It appears clear that a team from Huawei is working with DeepSeek on this, according to multiple sources and individuals I have spoken with in China. DeepSeek has now claimed that the performance of Huawei’s Ascend 910C chip has reached 60% of NVIDIA H100 for running DeepSeek models. What we are seeing is the emergence of Advanced AI Team China.

Here is what Huawei claims on this. The firm is working with SiliconFlow on this hosting effort for DeepSeek models:

Per a report, the Chinese tech giant has partnered with SiliconFlow to make new DeepSeek models available to consumers via Ascend Cloud service. The Ascend solution will evaluate resources including various hardware techs, self-developed clusters, AI modules, and accelerator cards of Huawei Cloud unit.
SiliconFlow is a Chinese AI startup based in Beijing. It focuses on developing high-performance and cost-effective solutions for large-scale AI model inference.

However, there are also some reports that DeepSeek could be gearing up to train its next AI model using 32,000 Huawei Ascend 910Cs. While Huawei might struggle to meet demand, it could now prioritize DeepSeek as part of the Advanced AI Team China approach—though there are certainly other capable AI teams at Chinese companies including Alibaba, Tencent, Baidu, Bytedance, Moonshot, Minimax, and others.

It is the software development environment that will be key for China’s AI sector

With all the focus on GPUs, interconnectivity, etc., the real issue that DeepSeek raises is how the firm’s emergence will contribute to the development of a fully Chinese AI stack—and who will build it. The answer is looking increasingly like Huawei, with a little help from its friends like DeepSeek but with all hands on deck in the China AI sector. Already, there is discussion of the capabilities of DeepSeek engineers to do the low-level programming required to optimize both training and inference runs using Huawei hardware. This is a complex issue, where more than a decade of development in the US around hardware optimization, first for gaming and now for generative AI, has meant developers have focused on using Nvidia’s hardware and software stack, centered on key piece such as CUDA. Any effort to build a Chinese AI stack will be tough, given this. But US export controls have clearly provided just the incentive needed for Chinese firms and developers to come together and develop a new approach that does not rely on western technology. There is no putting this additional genie back in the bottle. Young Chinese STEM students and engineers without significant training were able to master much of the low-level hardware programming that will be needed here, with DeepSeek showing the way. This trend will continue.

It will not be easy, but given the pressures to work collaboratively, almost certainly with government support and encouragement, the combination of Huawei, DeepSeek, and other startups and players such as SiliconFlow, the prospects for a workable and capable Chinese AI stack suddenly look significantly brighter than when I wrote about this last summer.

Here are some of the key elements of this:

Development environments. Huawei’s MindSpore already has made significant inroads in China in competition with TensorFlow and PyTorch. It provides a high-level API (Python-based) for building and training models, similar in spirit to PyTorch or TensorFlow. In practice, most AI developers using Nvidia hardware leverage frameworks like TensorFlow and PyTorch that hide CUDA behind the scenes. MindSpore aims to offer a similar ease of use for Huawei’s Ascend 9XX server/GPUs, with Python APIs and automatic differentiation. CUDA is a lower-level parallel computing platform (primarily in C/C++) for programming GPUs, and cuDNN is a GPU-optimized library of neural network primitives used by frameworks to achieve high performance. This means MindSpore acts as a complete framework for model development, whereas CUDA/cuDNN are part of the under-the-hood infrastructure in NVIDIA’s ecosystem.
API-level compatibility. MindSpore can also target Nvidia GPUs or CPUs by switching backend – for example, MindSpore has a GPU backend supporting CUDA 10/11 and cuDNN 8. However, its primary optimization focus is Huawei’s Ascend NPUs. By contrast, CUDA is exclusive to Nvidia GPUs and is the foundation for all major frameworks on that hardware.
While Huawei’s approach is to provide an end-to-end platform (framework + silicon), Nvidia’s approach relies on a broad ecosystem. Developers can choose TensorFlow, PyTorch, JAX, etc., all of which use CUDA/cuDNN under the hood. This difference means MindSpore’s out-of-the-box performance on Ascend is directly managed by Huawei’s stack, whereas on Nvidia, framework developers and Nvidia’s library engineers collaboratively ensure performance. Huawei claims MindSpore is highly optimized for Ascend GPUs, exploiting the hardware innovations in these AI processors.
Overall, Nvidia’s GPUs provide more seamless multi-GPU training integration (thanks to NVLink/NVSwitch and software like NCCL), which can be critical for cutting-edge research that demands tight coupling of software and hardware and GPU-to-GPU communication. Huawei’s Ascend chips are designed to scale out using standard networking and offer built-in capabilities to do so, which is more of a commodity approach: flexible but potentially less performant at small scale synchronization. In scenarios where massive parallelism is needed and Nvidia GPUs are restricted—such as in some Chinese data centers—Huawei’s solution can fill the gap, albeit with slightly different cluster design considerations.

The development of a domestic AI hardware and software stack will be an extended and complex process. Clearly, DeepSeek has played a major role in galvanizing the Chinese AI sector to get behind a broader and longer-term effort to innovate in and around the constraints the sector finds itself under. This week another Chinese GPU leader, Moore Threads, pledged support for DeepSeek and a domestic AI stack solution on its KUAE GPU intelligence computing cluster, including through distributed deployment of V3 and R1. Chinese firms are likely to use innovative solutions such as distributed training and inference to overcome some of the limitations imposed by restricted access to the most advanced western hardware. Moore Threads General Manager Zhang Jianzhong—formerly Nvidia’s China GM, made some glowing comments about DeepSeek’s importance:

“[Moore Threads] will pay tribute to DeepSek by using locally made GPUs to set China’s AI ecosystem n fire….” Moore Threads stressed that DeepSeek’s V3 and R1 models have provided “inspiration” for developers.

Other players in the Chinees AI and advanced compute ecosystem are piling on. Hygon Information Technology this week pledged to use its computing clusters to support DeepSeek’s V3 and R1 models. Other inference-focused semiconductor design firms like Cambricon are also bullish on DeepSeek.

I will be examining these developments in more detail in subsequent posts, along with the evolving US government response to DeepSeek in the early days of the Trump administration, which could see measures such as new controls that would capture Nvidia’s H20 GPUs.

AIStackDecrypted

Discussion about this post