AMD Achieves 75x Faster Performance for MI355X on DeepSeek V4 Pro with ROCm

Sports News » AMD Achieves 75x Faster Performance for MI355X on DeepSeek V4 Pro with ROCm
Preview AMD Achieves 75x Faster Performance for MI355X on DeepSeek V4 Pro with ROCm

Generative AI is widely regarded as the decade’s defining revolution, drawing comparisons to the advent of the Internet. Companies are making substantial investments in artificial intelligence, with record-breaking figures anticipated in 2026. While many are yet to see tangible profits from their AI endeavors, others are capitalizing on the trend for significant growth. NVIDIA stands as a prime example, having substantially increased its revenue thanks to its GPUs, with AMD and Intel also experiencing growth driven by AI. Recently, AMD AI engineers announced a remarkable achievement: they successfully boosted the performance of the MI355X accelerator by 75 times on DeepSeek V4 using ROCm in just 14 days.

DeepSeek has been at the forefront of the open-source generative text AI movement since its inception in 2025 with the R1 model, demonstrating its ability to rival American AI advancements. This highlights that U.S. restrictions on chip and machinery access did not ultimately halt China’s progress. While there may have been a temporary pause, China’s current momentum is unstoppable. Indeed, the latest iteration, DeepSeek V4, has emerged as the world’s most advanced Open Source AI model.

AMD Engineers Achieve 75x Speedup for MI355X on DeepSeek V4 Pro with ROCm in Just 14 Days

Upon DeepSeek’s unveiling, NVIDIA quickly expressed significant interest, offering day-one support for its Blackwell Ultra GPUs on DeepSeek V4. With these powerful graphics cards, NVIDIA achieved an impressive 3,500 tokens per second per GPU. AMD, determined not to be outdone, faced initial performance challenges. However, a dedicated team of AMD engineers has accomplished what seemed improbable: a 75-fold performance improvement in just two weeks.

The accompanying graph illustrates this progression. On April 25, 2026, the graphics accelerator performed poorly with DeepSeek 4 Pro, managing only about 80 tokens per second per GPU. By May 2, 2026, they achieved a significant uplift, reaching over 500 tokens per second per GPU, and by May 4, they surpassed 600 tokens per second. A mere four days later, on May 8, performance surged to approximately 1,500 tokens per second per GPU, showcasing the substantial growth of the MI355X on DeepSeek V4 Pro with ROCm.

NVIDIA Blackwell Ultra Remains Over Twice as Fast as AMD Even with Improvements

According to SemiAnalysis, the performance gains are attributed to the combination of mHC operations and the fusion of RoPE Hadamard transformations, which reduce CPU overhead and enhance HBM memory utilization. Despite this incredible performance boost achieved in a record timeframe, AMD’s solution still trails behind NVIDIA. Specifically, Blackwell Ultra GPUs are delivering more than double the tokens per second. To match NVIDIA’s GB200 node, AMD would need to quintuple its current performance.

Nevertheless, considering the advancements made in just two weeks, it is plausible that future optimizations and breakthroughs will further enhance the MI355X’s capabilities, potentially enabling it to compete at the GB200’s level for certain tasks.