NVIDIA Blackwell Destroys MLPerf Training 6.0 Benchmarks
Every major breakthrough in generative AI begins with an grueling, resource-intensive training run. The architecture under the hood determines how fast your team can iterate, the maximum parameter scale you can target, and whether your multi-week cluster run finishes or crashes midway. We have been watching the hardware landscape closely , and NVIDIA's latest MLPerf Training 6.0 performance proves that Blackwell is currently unmatched when it comes to raw, production-grade scaling for next-gen models.
Summary
NVIDIA completely swept the latest peer-reviewed MLPerf Training 6.0 industry benchmarks released on June 16, 2026. The tech giant's Blackwell platform captured the fastest time to train across every single category in the testing suite. Crucially, NVIDIA was the absolute only hardware vendor to submit results for all seven rigorous benchmarks in the lineup.
The latest MLPerf testing round introduced two highly anticipated Mixture-of-Experts (MoE) pretraining workloads. These additions-DeepSeek-V3 (671B parameters) and GPT-OSS (20B parameters)-reflect the industry’s massive architectural shift toward MoE layouts. NVIDIA tackled these complex workloads using its unified rack-scale compute environments.
┌────────────────────────────────────────────────────────┐
│ MLPerf Training 6.0 Winners │
├───────────────────────────┬────────────────────────────┤
│ DeepSeek-V3 (671B MoE) │ CoreWeave (GB300 NVL72) │
│ │ Time: 2.02 minutes │
├───────────────────────────┼────────────────────────────┤
│ Llama 3.1 (405B Dense) │ MS Azure (GB200 NVL72) │
│ │ Time: 7.07 minutes │
└───────────────────────────┴────────────────────────────┘
The benchmark submissions utilized both the NVIDIA GB200 NVL72 and the newer, higher-density GB300 NVL72 systems. Within these liquid-cooled racks, fifth-generation NVLink Switches pool all 72 onboard GPUs into a single, high-bandwidth memory fabric. This hardware configuration allows the entire rack to process massive neural networks as if it were one giant GPU.
Scale was a massive focal point for this round of evaluation. Partnering with major cloud providers, NVIDIA successfully scaled a Blackwell cluster up to 8,192 interconnected GPUs running the massive DeepSeek-V3 workload. CoreWeave achieved a blindingly fast 2.02-minute training milestone on DeepSeek-V3 using GB300 NVL72 systems hooked up via Spectrum-X Ethernet networking. Simultaneously, Microsoft Azure powered through the massive dense Llama 3.1 405B model in just 7.07 minutes using an 8,192 GB200 GPU cluster.
Remarks
Let's cut through the marketing noise: this MLPerf sweep is an absolute win for the developer ecosystem, but it cements an aggressive monopoly. The fact that NVIDIA was the only platform to submit data across all seven benchmarks shows a stark lack of viable silicon alternatives willing to compete publicly in high-end frontier training. Competitors are playing catch-up while NVIDIA is already iterating on its own architecture.
We predict that over the next twelve months, the combination of NVFP4 precision training and ultra-fast NVLink fabrics will make dense models an endangered species for anything over 100 billion parameters. Engineering teams will almost exclusively favor sparse MoE setups because the hardware handles the complex routing overhead seamlessly.
The most revealing metric in this benchmark report is the generation-over-generation leap. The Blackwell Ultra architecture found in the GB300 NVL72 system delivered up to a 1.6x training speedup compared directly to the base GB200 NVL72 system at an identical node scale.
GB200 NVL72 [Baseline Performance] ──► 1.0x
GB300 NVL72 [Blackwell Ultra Power] ─► 1.6x Faster
This massive efficiency gain stems from an elevated power ceiling and optimized thermal thresholds that allow the silicon to maintain peak clock speeds during heavy MoE routing phases. NVIDIA isn't just beating its rivals; it is outperforming its own cutting-edge product lines at an astonishing pace.
| Workload Model | Architecture Type | NVIDIA Blackwell Performance | Nearest Alternative Performance |
| DeepSeek-V3 (671B) | Mixture-of-Experts (MoE) | <b>2.02 minutes</b> (GB300 NVL72) | <i>No Submission</i> |
| GPT-OSS (20B) | Mixture-of-Experts (MoE) | <b>7.43 minutes</b> (GB200 NVL72) | <i>No Submission</i> |
| Llama 3.1 (405B) | Dense Large Language Model | <b>7.07 minutes</b> (GB200 NVL72) | <i>No Submission</i> |
| Llama 3.1 (8B) | Dense Large Language Model | <b>4.45 minutes</b> (GB200 NVL72) | 58.63 minutes |
| FLUX.1 | Text-to-Image Generation | <b>17.10 minutes</b> (GB200 NVL72) | 74.44 minutes |
NVIDIA continues to set an unreachable pace for high-end AI infrastructure. The MLPerf Training 6.0 results confirm that if you are building at the absolute frontier of AI, Blackwell is currently the only game in town for massive scale. While competitors struggle to put up comparable public numbers, Team Green is already optimizing its next hardware iterations. continue to track these core infrastructure shifts closely as cloud service providers deploy these massive clusters to public developer networks.