Skip to content

[AMD] [MI300X] minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X#1808

Open
JohnQinAMD wants to merge 1 commit into
mainfrom
minimaxm3-mi300x-aiter-tuning
Open

[AMD] [MI300X] minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X#1808
JohnQinAMD wants to merge 1 commit into
mainfrom
minimaxm3-mi300x-aiter-tuning

Conversation

@JohnQinAMD

@JohnQinAMD JohnQinAMD commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Enable AITER kernels for MiniMax-M3 MXFP8 on MI300X/gfx942 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they're left at their defaults. VLLM_ROCM_USE_AITER_MHA=0 keeps attention on TRITON_ATTN because the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.

Also sets AMD-recommended, numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (raises RCCL channels above the ~32–64 default for TP8), GPU_MAX_HW_QUEUES=2 (caps HIP streams below the default of 4). All changes are kernel-selection / runtime only.

Measured uplift (8×MI300X, 1k1k random sweep, total tok/s/gpu)

conc before after Δ
256 782.7 856.1 +9.4%
128 598.9 637.0 +6.4%
64 365.1 392.0 +7.4%
32 295.6 327.4 +10.8%
16 203.1 216.5 +6.6%
8 127.6 136.6 +7.1%
4 80.1 84.6 +5.6%

conc 1–2 unchanged (latency-bound). GSM8K exact-match holds at ~0.95 (kernel-selection change only).

Notes

  • Image is unchanged by design (vllm/vllm-openai-rocm:minimax-m3). AITER kernels are already compiled into the ROCm image; these env vars only select them at runtime, so nothing needs to be baked into the image.
  • Opened from an upstream branch (not a fork) so CI can run.

Supersedes #1804.


Note

Low Risk
Benchmark-only env var changes on AMD inference; attention stays on TRITON_ATTN and reported accuracy is unchanged.

Overview
Turns on AITER for the MiniMax-M3 MXFP8 MI300X vLLM recipe by exporting VLLM_ROCM_USE_AITER=1 in minimaxm3_fp8_mi300x.sh, so decode GEMMs and fused MoE use the ROCm image’s AITER paths instead of generic kernels. VLLM_ROCM_USE_AITER_MHA=0 keeps TRITON_ATTN because the MXFP8 checkpoint has no calibrated q/prob scales for ROCm FP8 attention.

Also sets MI300X runtime env defaults: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS (default 112), and GPU_MAX_HW_QUEUES (default 2). No container change—runtime kernel selection only.

Documents measured ~5–10% throughput uplift on 8×MI300X 1k1k sweeps (conc ≥4) and stable GSM8K in perf-changelog.yaml for minimaxm3-fp8-mi300x-vllm.

Reviewed by Cursor Bugbot for commit 883fbd0. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@JohnQinAMD

Copy link
Copy Markdown
Collaborator Author

@functionstackx Supersedes #1804 and switched to this pr with upstream branch to trigger CI

@functionstackx

Copy link
Copy Markdown
Collaborator

@claude can u rebase this PR

@Klaud-Cold

Klaud-Cold commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Claude finished @functionstackx's task in 41s —— View job


Rebase Request

  • Understand request
  • Attempt to update PR branch with latest from main

There are merge conflicts between this branch and main, so I'm unable to automatically update the branch. This needs to be resolved manually.

To rebase locally and resolve conflicts:

git checkout minimaxm3-mi300x-aiter-tuning
git fetch origin main
git rebase origin/main
# Resolve any conflicts in the flagged files, then:
git rebase --continue
git push --force-with-lease origin minimaxm3-mi300x-aiter-tuning

Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master
toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR,
_RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so
they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and
is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8
checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.

Also set AMD-recommended numerically-inert MI300X runtime knobs:
TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised
above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped
below the default of 4). All changes are kernel-selection/runtime only;
GSM8K holds ~0.95.

Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across
conc 4..256; conc 1-2 unchanged (latency-bound).

Co-authored-by: Gong Zheng <zgong@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the minimaxm3-mi300x-aiter-tuning branch from 9def276 to 883fbd0 Compare June 16, 2026 20:04
@github-actions

Copy link
Copy Markdown
Contributor

@ZhengGong-amd

Copy link
Copy Markdown

The benchmark sweep didn't cover all points because the job landed on a self-hosted runner named mi300x-tw_. The workflow builds the launch script path from the runner name (bash ./runners/launch_${RUNNER_NAME%%_}.sh), so it tried to run runners/launch_mi300x-tw.sh — which doesn't exist in the repo (only launch_mi300x-amds.sh is provided). That step fails with No such file or directory, so the sweep aborts before finishing.

My PR only touches the recipe / config / perf-changelog.yaml — it doesn't touch runners/, the workflows, or the runner pool. The fix belongs to whoever owns the CI runners: either add runners/launch_mi300x-tw.sh, or remove the mi300x-tw runner so MI300X jobs only run on mi300x-amds.

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27644643517 see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27644643517

@functionstackx

Copy link
Copy Markdown
Collaborator

@JohnQinAMD a bunch of failures on PR validation, can u take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants