NVIDIA Nemotron 3 Ultra

NVIDIA Nemotron 3 Ultra

NVIDIA's largest open-weight model. Released June 4, 2026, announced at Computex 2026. 550B total parameters, 55B active per token via hybrid Mamba-Transformer mixture-of-experts architecture with 90% sparsity. Scores 48 on the Artificial Analysis Intelligence Index — highest among US open-weights models at launch, though behind the Chinese open-weights frontier (Kimi K2.6 at 54) and closed models (Opus 4.8 at 61).

Architecture

Hybrid Mamba-Transformer MoE. 550B total / 55B active parameters. The Mamba layers reduce memory overhead at long contexts; Transformer layers handle attention-heavy tasks. The 90% sparsity ratio keeps inference cost comparable to a 55B dense model.

Benchmarks

Benchmark Score
Artificial Analysis Intelligence Index 48
PinchBench (agent productivity) 91%
ProfBench Search 56%
Ruler at 1M tokens (long-context) 95%

Leads all US open-weights models on the Intelligence Index at launch. Approximately 30% lower inference cost than leading alternatives on comparable coding tasks per NVIDIA internal comparison.

Context Window

1M tokens.

Weight Formats

BF16 (evaluated version) and NVFP4 (planned for improved inference throughput).

Speed

Over 300 tokens/second on pre-release DeepInfra endpoints. NVIDIA positions Ultra in a "cost-efficiency frontier" combining intelligence score with output speed.

Training

Open weights with published training recipes. Partial training data released alongside weights. Full training data composition not disclosed for Ultra (Nemotron 3 Nano's pre-training corpus was 2.5T tokens as a reference point).

Availability

  • Hugging Face (open weights)
  • ModelScope
  • OpenRouter (multi-model API)
  • NVIDIA NIM microservice (build.nvidia.com, pay-per-token, no infrastructure required)

Pricing

Pay-per-token via NVIDIA NIM. Exact rates not yet published; approximately 30% less per inference than leading alternatives per NVIDIA benchmarking.

Strengths

  • Highest US open-weights intelligence at launch: 48 on Artificial Analysis index, well ahead of Gemma 4 31B (39) and Nemotron 3 Super (36).
  • Fast inference: 300+ tokens/sec on early endpoints — atypical for a 550B-class model.
  • Open recipes: Training code and partial data released, useful for fine-tuning and research replication.
  • Long context: 1M token window with 95% Ruler score — strong retrieval at max context.
  • NIM deployment: Managed API with no GPU provisioning required.

Weaknesses

  • China-led open-weights models (e.g. Kimi K2.6 at 54) still score higher on the Intelligence Index.
  • NVFP4 format not available at launch — BF16 only initially, requiring more VRAM for self-hosted inference.
  • Pricing details not fully published at launch.
  • Context window confirmed at 1M tokens but per-token cost at that length not disclosed.

Use Cases

Best for: agentic coding pipelines, long-document analysis, enterprise deployments that require open-weights with commercial support via NIM. Good fit for ollama or custom inference stacks that need a high-capability model with published training details.

Not ideal for: tasks requiring the absolute top intelligence score (use a closed model or Kimi K2.6), or users needing a small fast model (see gemma-4).

Related

deepseek-r1 · gemma-4 · ollama · agentic-workflows · evals

Sources