- The Nvidia H800 was launched in March 2023 and is a cut version of the H100
- It is also much slower than the NVIDIA and AMD Instinct H200 range
- These artificial constraints have forced Deepseek engineering to innovate
It has been largely supposed that the United States would remain undisputed as a global AI superpower, in particular after the recent announcement by President Donald Trump of Project Stargate – an initiative of $ 500 billion to strengthen the infrastructures of AI across the United States. However, this week has seen a seismic change with the arrival of Chinese depth. Developed at a fraction of the cost of its American rivals, Deepseek came out by apparently swinging out of nowhere and had an impact as it suffered from 1 billion of dollars of the market value of American technology, with Nvidia the main victim.
Obviously, everything that is developed in China is going to be very secret, but a technological newspaper published a few days before the amazed cat model, the AI observers gives us an overview of the technology that animates the Chinese equivalent chatgpt.
In 2022, the United States blocked the import of Nvidia GPU advanced to China to tighten control of critical AI technology, and has since imposed new restrictions, but obviously, this has not stopped Deepseek . According to the newspaper, the company has formed its V3 model on a group of 2,048 GPU Nvidia H800 – Paralyzed versions of the H100.
Cheap training
The H800 launched in March 2023 to comply with the American export restrictions to China, and has 80 GB of HBM3 memory with a bandwidth of 2 TB / s.
It is lagging behind the new H200, which offers 141 GB of HBM3E memory and the bandwidth of 4.8 TB / S, and the MI325X instinct of AMD which exceeds both with 256 GB of HBM3E memory and a band passing of 6 to / s.
Each node of the Deepseek cluster formed on houses 8 GPU connected by NVLink and NVSWitch for intra-Node communication, while the Infiniband interconnections manage communication between the nodes. The H800 has a lower NVLink bandwidth compared to the H100, which, of course, affects multi-GPU communication performance.
Deekseek-V3 required a total of 2.79 million GPU hours for pre-training and fine adjustment on 14.8 billions of tokens, using a combination of pipeline and data parallelism, optimizations of optimizations memory and innovative quantification techniques.
The next platformWho made a deep dive into the operation of Deepseek, says: “At the price of $ 2 per hour of GPU – we do not know if it is actually the price in force in China – so that costs only 5.58 million dollars to form the V3. “