“A virtual DPU in a GPU”: could intelligent hardware hack be behind the revolutionary efficiency of the Deepseek AI?

A new approach called Dualpipe seems to be the key to Deekseek’s success
An expert describes it as a virtual DPU on GPU which maximizes the effectiveness of the bandwidth
While Deepseek used nvidia GPU only, one wonders how AMD instinct

The Chatbot of the Deepseek in China has amazed the technological industry, representing a credible alternative to the Openai Chatppt to a fraction of the cost.

A recent article revealed that Deepseek V3 was formed on a group of 2,048 GPU NVIDIA H800 – Paralyzing H100 (we can only imagine how it would work on AMD Instinct accelerators!). It would have demanded 2.79 million hours of GPU for pre -training, fine adjustment on 14.8 billions of tokens and cost – according to the calculations made by The next platform – Only $ 5.58 million.

But exactly how Deepseek developers managed this feat is probably due to a smart hack.

A virtual DPU on the GPU itself

First, a background. Deepseek is a model of advanced mixing mixture (MOE) designed to optimize performance by selectively activating the most relevant parts of its architecture for each task. The third version of the model, Deepseek-V3, has a total of 671 billion parameters, with only $ 37 billion for a given token prediction. This selective activation massively reduces calculation costs while retaining high performance and precision – that you will see if you try it.

It is easy to be skeptical about Deepseek and the claims made about his training, but the newspaper reveals part of the magic that the developers have proposed to make the most of the paralyzed equipment with which they had to work. This includes the creation of the doublepipe algorithm for an effective parallelism of the pipeline.

According to information published by Deepseek, Dualpipe rides the calculation forward and backwards, reduces latency and optimizes data movement through the GPUs. By effectively managing communication, it minimizes inactivity time (pipeline bubbles) and dynamically balances GPU calculation nuclei (multi -processes dissemination) between calculation and communication, preventing the bottlenecks of data transfer as The model evolves.

A commentator on The next platform Describes Dualpipe as “essentially creating a virtual DPU on the GPU itself to manage communication everything to all”, which highlights its role in optimizing data transfer efficiency.

The document goes in more detail: “In order to ensure sufficient calculation performance for Dualpipe, we customize effective communication nuclei (including distribution and combination) to keep the number of SMS dedicated to communication. The implementation of the implementation of the implementation of the implementation of the implementation of the implementation of the nuclei is co-designed with the MOE trigger algorithm and the topology of the network of our Cluster.

Example of planning with two pipes for the ranks of 8 pp and 20 micro-pins in two directions. The micro-pins in the opposite direction are symmetrical to those in the front direction, we therefore omit their Lot ID for the simplicity of the illustration. Two cells enclosed by a shared black border have passed the calculation and communication. (Image credit: Deekseek)

Must Read

Leave a Comment Cancel Reply