The DeepSeek Effect: What DeepSeek means outside of China
Western tech leaders embrace DeepSeek despite government skepticism, bans, and conspiracy theories. This is not the "shunning" of Chinese AI open source models you were looking for...
It seems hard to believe that the release of DeepSeek’s R1 model was just over a month ago, given the whirlwind of events that have followed it. As I noted in my last article, the DeepSeek Effect is now sweeping through China, well beyond the DeepSeek “Sputnik” Moment.
But outside China, the DeepSeek Effect also continues, albeit with some important differences that need to be explored in more depth. This will be the subject of this post.
Before writing this, I ran several versions of the DeepSeek models available on Hugging Face and Github on my home RTX 4090 GPU based PC—the 4090 of course is banned for export to China. An observation: while smaller models with 3b parameters, for example, ran fine, the larger models such as R1 and an MoE model with 16b parameters exceeded available memory capacity and required more advanced GPUs and more VRAM to run, such as the H100.
Banning the app for government use will do nothing to slow spread of DeepSeek API
Despite some media reports suggesting China was embracing DeepSeek and Western countries were “shunning” it, the uptake of the firm’s AI models has in fact been broad. It is now available across all major western cloud services providers and open source repositories (Github, Hugging Face), and is being used by major western applications developers that leverage open source/weight models such as Perplexity.
Currently the DeepSeek app has been removed from app stores in South Korea, Italy, and a handful of other countries. As of February 25, there has been no effort to try and ban or restrict use of DeepSeek models hosted on thousands of servers and home PCs in the US or other locations. The Trump administration appears set to release an executive order related to DeepSeek soon, which could require removal of the app from Google and Apple app stores; it remains unclear what other actions the administration plans to take on DeepSeek, given its open source/weight distribution, and the lack of any clear authorities to restrict the distribution of these types of models.
Current bans on the DeepSeek smartphone app include the following:
South Korea. Temporarily unavailable in Google and Apple app stores, but available on website.
Australia. Banned on federal government systems.
Taiwan. Ban applies to government agencies.
Italy. The Italian government data protection bureau has asked DeepSeek to limit how it processes user data. The outcome of this effort remains unclear. According to the Italian government, “contrary to what was found by the Authority, the companies (DeepSeek Beijing and Hangzhou) declared that they do not operate in Italy and that European legislation does not apply to them.”
US states including Virginia, Texas, and New York. Ban is on use of app by state employees.
US government agencies, NASA, and the Department of the Navy. Bans on the app and website for their personnel.
US Congress. A bipartisan congressional bill, No DeepSeek on Government Devices Act, will soon be introduced to ban DeepSeek software from government devices. In addition, last week Sen. Tom Cotton (R-AR) sent a letter to the Office of Management and Budget calling on the Trump administration to ban DeepSeek on federal government devices. Seemingly without any basis in fact or technological reality about how chatbots or APIs work, nor any understanding about what they are used for, Cotton warned that using Chinese AI tools on government platforms “will almost certainly provide the CCP with U.S. government data as well as advance information on our nation’s policies.”
These actions are separate from efforts by the Trump administration to investigate DeepSeek’s potential use of export controlled technology that are underway at the National Security Council and Commerce Department. See my previous posts on the issue of which GPUs DeepSeek has most likely had access to over the past year.
One of the key unresolved issues, and the subject of much debate in Washington DC, is how to control model weights—the new AI Diffusion Rule proposes controlling only proprietary weights—and what to do about open source models/weights, given where DeepSeek is showing up on benchmark leaderboards. The open source model/weight issue is even more complex, given that there is a relatively free flow of information within the open source community, and even among practitioners at companies pursuing closed models and other peers in the field. DeepSeek CEO Liang, for example, was able to fly to the US and meet with OpenAI engineers to talk shop—in the current climate of US China relations, with growing concern about both the hardware and software capabilities of firms across the Chinese AI stack, this type of technical cross-fertilization could come under pressure. It remains unclear what actions the Trump administration will take beyond possible bans on downloads of the DeepSeek app, but there are extreme actions such as the Hawley bill calling for decoupling of the US and China AI sectors out there that could resurface.
“Providing cloud services is not our main goal. Our aim is still to achieve A.G.I.”
“Innovation starts with confidence — and we often see that more from young people.”
—DeepSeek CEO and founder Liang Wenfeng
Open source industry embrace is wide and deep
In the meantime, DeepSeek’s R1 model has seen huge interest from developers on sites such as Hugging Face and Github. On Hugging Face, R1 has over 10.1K likes and has rocketed up to become the most popular model among the 1.5 million users on the site, all within less than two months.
Other companies with applications that leverage open source/weight models have also embraced DeepSeek, including AI search major Perplexity. Perplexity has integrated the DeepSeek R1 model, allowing its Pro users to access its advanced reasoning capabilities within the Perplexity search platform, providing deeper research options with source-backed answers. Importantly, Perplexity hosts DeepSeek R1 on its own US and European servers, mitigating concerns about data privacy and potential censorship associated with the model’s Chinese origins. Some have criticized Perplexity’s decision, erroneously calling DeepSeek R1 a “state-backed model,” a demonstrably false narrative.
Perplexity CEO Aravind Srinivas has been an outspoken proponent of DeepSeek. Perplexity is now using a customized version of R1 for its Deep Research offering. Srinivas confirmed the use of RI on a post on X, noting that “[we] can easily enable something like Deep Research at 10-100x lower pricing, using a custom version of R1.” Because R1 is open source, Perplexity has built its own version on top of the original model—in late February Perplexity introduced its own open-source version of R1, called R1 1776, “that has been post-trained to provide uncensored, unbiased, and factual information.” Users (such as the author) downloading DeepSeek models and running them locally do not get censored results, of course.
In addition, DeepSeek models are currently hosted on major Western cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud, allowing developers to access and utilize the DeepSeek AI technology on their respective services. Microsoft CEO Satya Nadella was quick to endorse the use of DeepSeek’s models in late January. Nadella highlighted the firm’s rapid decision to put DeepSeek’s latest AI models on its developer platforms, Azure AI Foundry and GitHub, adding that the R1 model had been through “automated red teaming, content safety integration, and security scanning.” Nadella stressed that customers will soon be able to run DeepSeek’s models locally on Microsoft’s AI PCs. Microsoft and Nadella appeared to be well aware of DeepSeek prior to the release of the R1 model on January 20 that rocked the stock market, and moved quickly to make R1 available across Microsoft’s key platforms after undergoing the internal red teaming and testing.
More mid-tier AI data enter and open-source platform hosting companies are also adding support for DeekSeek models, including together.ai. Organizations and developers looking to build, train, or deploy AI models such as DeepSeek in the cloud can choose from a spectrum of offerings. The choice largely depends on the desired level of control (IaaS vs. PaaS vs. SaaS), the need for hosted open-source models, and specific hardware requirements (e.g., GPU/TPU availability). Each category serves different user needs—from raw compute power for advanced teams to fully managed “click-to-deploy” AI solutions for those who want to focus only on higher-level development.
DeepSeek keeps innovating and releasing more details about training
DeepSeek continues to release new details on its models and new innovations such as FlashMLA, an efficient Multi-head Latent Attention (MLA) decoding kernel for Hopper GPUs, optimized for variable-length sequences, with up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on the Nvidia H800 SXM5, using CUDA 12.6. These types of optimizations come out of DeepSeek’s use of a cluster of the export-controlled H800s that were purchased when this GPU was not controlled—the cluster was the focus of training of DeepSeek models in the absence of access to much larger clusters of more advanced GPUs such as the H100.
In fact, the FlashMLA release was just part of an Open Source Week move by DeepSeek to release more details about its innovations on Github via repos. Repos for open source models refer to online code repositories, typically hosted on platforms like GitHub, where developers publicly share the code and details of their machine learning models, allowing anyone to access, use, modify, and contribute to these models, making them “open source” and facilitating collaboration within the AI community; essentially, it is a place where you can find and download pre-trained AI models that are freely available to use and improve upon.
DeepSeek release of repos on Github: Starting this week , Feb 24, 2025 we’ll open-source 5 repos – one daily drop – not because we've made grand claims, but simply as developers sharing our small-but-sincere progress with full transparency.
These are humble building blocks of our online service: documented, deployed and battle-tested in production. No vaporware, just sincere code that moved our tiny yet ambitious dream forward.
Why? Because every line shared becomes collective momentum that accelerates the journey. Daily unlocks begin soon. No ivory towers - just pure garage-energy and community-driven innovation 🔧
Stay tuned – let’s geek out in the open together.
The second release as part of Open Source Week, when coupled with FlashMLA, may be the most interesting and impactful. The release of DeepEP is significant, as it provides new information on how DeepSeek actually trained its V3 and R1 models and what types of hardware and GPU-to-GPU communications optimizations it used (still no sign of the 50K “hoppers”). DeepEP is a communication library tailored for Mixture-of-Experts (MoE) and expert parallelism (EP). It provides high-throughput and low-latency all-to-all GPU kernels, which are also as known as MoE dispatch and combine. The library also supports low-precision operations, including FP8. What does this mean? It provides details of how DeepSeek optimized critical elements of its hardware cluster of Nvidia H800 GPUs to train its advanced models.
DeepSeek FlashMLA centers on memory management and latency-reducing optimizations, while DeepEP (“Expert Parallelism”) addresses advanced parallel strategies and workload orchestration. Together, they significantly improve GPU performance in AI and HPC tasks by automating many of the intricate, low-level optimizations that were traditionally handled by hand‑tuned CUDA kernels.1
Looking ahead: R2 looming on the horizon
As we were going to press, there were indications that DeepSeek may be considering moving up the release data of its next R2 reasoning model, originally likely set for May. The factors behind this are complex, but an important one is that the other major labs continue to release more advanced models, such as Grok 3 and Anthropic’s release this week of Claude Sonnet 3.7. These models are doing very well on specific benchmarks such as coding, and DeepSeek is likely to make significant improvements on its benchmarks on coding in comparison to some of the other leading models.
The addition of DeepSeek to the leading lab model release cycle comes with some complexity. As I have noted, the firm is not attempting to compete commercially with leading US and foreign players such as OpenAI, Anthropic, Google, Cohere, Mistral, and others. DeepSeek’s business model is clearly evolving, but CEO Liang Wenfeng remains focused on AGI, and not on building the firm into the size of the other players and supporting commercial deployments of its models. As I noted in my last post here, the spread of the use of DeepSeek’s models within China is being driven by a combination of organized bottom-up uptake, and signals from Beijing after Liang’s meetings with Premier Li Qiang and President Xi that the firm is a new type of national champion.
Within this context, DeepSeek will need to adjust the timetable for the release of its new model, as it has raised expectations about the firm’s ability to continue to innovate and be a leader within China’s own AI ecosystem. International acclaim and the embrace of DeepSeek by the open source community is an additional benefit that neither Beijing nor DeepSeek will want to see diminished in the near term. But the competition here is about capability competition, not about commercial competition. In addition, the whole DeepSeek Effect has galvanized the tech sector in general and raised confidence among Chinese companies, investors, and consumers around domestic technology capabilities. This heightened morale is likely to lead to consumer-led economic growth in ways that Beijing has been unable to manage via promising top-down policy changes. Hence the release of R2 is critical, as it will demonstrate that the DeepSeek Effect is not a one-off, short-term phenomenon.
The release of R2 is likely to coincide to some degree with measures taken by the Trump administration and other governments that will seek to restrict the use of DeepSeek’s models in some contexts. The release, though, will again be welcomed by the open source community, particularly in the wake of the releases during Open Source Week that have demonstrated that DeepSeek is a serious player in this space and not a flash in the pan. The key question mark, which I addressed initially here, is what advanced compute hardware DeepSeek is training R2 on and perhaps more importantly, what hardware DeepSeek will train R3 on. Further extending optimizations of the type DeepSeek has released this week, and others the firm’s engineers have in the pipeline, are likely to be sufficient to boost the capabilities of R2 to match or exceed the other leading models on some benchmarks. But these optimizations are not likely to be sufficient to allow DeepSeek to keep pace with other major labs sporting clusters of tens of thousands of GPUs, particularly as Blackwell-based megaclusters come on line later in the year.
Here there are a number of paths forward for DeepSeek:
Build a bigger training cluster using some number of available Nvidia GPUs that the firm is able to obtain, or collaborate with another firm that already has a larger cluster of Nvidia hardware. DeepSeek clearly prefers to use its own hardware cluster and allow its engineers unfettered access to the hardware. It would need change its approach to model development with either of these two choices.
Work with domestic players, including Huawei, cloud providers, systems integrators such as Merit Interactive—headquartered in Hangzhou—and start-ups Infinigence AI and SiliconFlow2 for access to Nvidia and/or Huawei GPUs—or some mix of the two—to build training capacity. There are rumors that Huawei will provide 32,000 Ascend 910Cs to DeepSeek for training, but it is not clear that DeepSeek will go down this road, yet—the performance of the Ascend 910Cs, particularly on GPU-to-GPU communication and other interconnectivity, may not be sufficient yet, though it is possible that Huawei’s significant expertise in building infrastructure would help here. The challenge is that building large clusters of advanced GPUs for training is very complex, and there are not many companies capable of doing this at scale. Some of the collaboration between Huawei, SiliconFlow, and others is focused squarely on inference.
It seems likely that DeepSeek will be compelled by a variety of factors to change the way it develops advanced models, working with outside AI datacenter providers and other key players to continue to iterate model training and inference capabilities while driving down costs. While the embrace of DeepSeek outside of China is likely to continue with R2 and beyond, the fervor of this embrace could change if the US government decides to weigh in heavily. This could also depend on where US China relations more broadly go in the coming month. For now, the DeepSeek Effect continues apace, and the spread of Chinese open source models all around the world, particularly in countries with vibrant open source communities, is set to expand.
1. FlashMLA (Memory & Latency Accelerator)
Shared Memory Optimization
Adaptive Tiling: Dynamically adjusts tile sizes (e.g., in matrix multiplication) to maximize shared-memory utilization and reduce global memory accesses.
Latency Hiding: Leverages asynchronous data transfers (via CUDA streams) to overlap memory staging with kernel computation, cutting stall time.
Register Pressure Management
Compiler-Assisted Tuning: Identifies segments with high register usage and automatically balances register allocation to minimize spilling.
Inline Vectorization: Uses warp-level intrinsics and vectorized instructions to process multiple data elements in single operations where feasible.
Bandwidth Reduction
Mixed Precision: Adopts reduced-precision data types (FP16 or TF32) in certain operations, minimizing overall load/store overhead.
Data Compression: Integrates optional compression kernels for frequently accessed tensors, lowering memory bandwidth usage in global memory.
Cache Utilization
Coalesced Loads: Reorders data access to align with GPU memory layout, ensuring that warp accesses are coalesced and fewer L2 cache lines are touched.
Hierarchical Caching: Distributes data across shared memory, L1 cache, and L2 cache based on usage patterns, tailoring caching policies to each layer.
2. DeepSeekEP (Expert Parallelism)
Expert Parallelism focuses on advanced strategies for optimally dividing and orchestrating GPU workloads, especially in the context of AI and HPC pipelines:
Multi-Operator Fusion
Kernel Chaining: Merges consecutive GPU operations (e.g., activation + batch norm) into a single kernel to reduce kernel launch overhead and intermediate writes.
Dataflow Graph Analysis: Uses an internal intermediate representation (IR) to detect possible fusions or inlining opportunities automatically.
Adaptive Workload Partitioning
Thread Block Sizing: Dynamically picks the number of threads per block to match GPU SM occupancy and resource availability, minimizing over- or under-utilization.
Load Balancing: Implements concurrency scheduling methods to ensure that tasks are evenly distributed among GPU resources, improving throughput on diverse workloads.
Tensor Core Utilization
Mixed-Precision MatMul: Maps matrix multiplication operations to specialized Tensor Cores (if available) to accelerate dense linear algebra in AI pipelines.
Partitioning for Throughput: Splits large matrices into sub-blocks sized optimally for Tensor Core performance, maximizing concurrency at the warp level.
Optimized Memory Access
Strided Access Handling: Analyzes patterns with irregular or strided memory accesses (e.g., convolution filters) to minimize warp divergence and cache miss rates.
Prefetch & Async Copy: Uses CUDA’s async copy features to pre-stage data into shared memory, overlapping data transfers with compute to hide latency.
Profiling-Driven Auto-Tuning
Microbenchmarking: Runs small kernels to discover ideal launch configurations (thread block size, shared memory usage) specific to each GPU generation.
Dynamic Re-Configuration: At runtime, DeepSeekEP can adjust or switch kernel implementations in response to changing problem sizes or hardware conditions.
These emerging China AI datacenter operators are similar to so-called neocloud companies in the US, including CoreWeave, Lambda, and Vultr, which manage and rent out access to GPU-based computing clusters.