- A brand new strategy known as DualPipe appears to be the important thing to DeekSeek’s success
- One professional describes it as an on-GPU digital DPU that maximizes bandwidth effectivity
- Whereas DeepSeek has used Nvidia GPUs solely, one wonders how AMD’s Intuition would fare
China’s DeepSeek AI chatbot has shocked the tech trade, representing a reputable various to OpenAI’s ChatGPT at a fraction of the fee.
A current paper revealed DeepSeek V3 was educated on a cluster of two,048 Nvidia H800 GPUs – crippled variations of the H100 (we are able to solely think about how way more highly effective it might be working on AMD Intuition accelerators!). It reportedly required 2.79 million GPU-hours for pretraining, fine-tuning on 14.8 trillion tokens, and value – based on calculations made by The Subsequent Platform – a mere $5.58 million.
However precisely how DeepSeek’s builders managed this feat is probably going right down to a intelligent hack.
A digital DPU on the GPU itself
First, some background. DeepSeek is a complicated Combination-of-Specialists (MoE) language mannequin designed to optimize efficiency by selectively activating solely probably the most related elements of its structure for every activity. The third model of the mannequin, DeepSeek-V3, includes a complete of 671 billion parameters, with solely 37 billion activated for any given token prediction. This selective activation massively reduces computational prices whereas sustaining excessive efficiency and accuracy – which you’ll see if you happen to attempt it.
It’s straightforward to be skeptical of DeepSeek and the claims made concerning its coaching, however the paper reveals a few of the magic the builders got here up with to profit from the crippled {hardware} they needed to work with. This consists of the creation of the DualPipe algorithm for environment friendly pipeline parallelism.
In response to the knowledge printed by DeepSeek, DualPipe overlaps ahead and backward computation, reduces latency, and optimizes knowledge motion throughout GPUs. By effectively managing communication, it minimizes idle time (pipeline bubbles) and dynamically balances GPU compute cores (Streaming Multiprocessors) between computation and communication, stopping knowledge switch bottlenecks because the mannequin scales.
A commenter on The Subsequent Platform describes DualPipe as “basically making a digital DPU on the GPU itself to deal with all-to-all communication,” which highlights its position in optimizing knowledge switch effectivity.
The paper goes into additional element, “With the intention to guarantee ample computational efficiency for DualPipe, we customise environment friendly cross-node all-to-all communication kernels (together with dispatching and mixing) to preserve the variety of SMs devoted to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the community topology of our cluster. To be particular, in our cluster, cross-node GPUs are absolutely interconnected with IB, and intra-node communications are dealt with through NVLink.”