[AMD] [MI300X] minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X by JohnQinAMD · Pull Request #1808 · SemiAnalysisAI/InferenceX

JohnQinAMD · 2026-06-16T15:45:36Z

Enable AITER kernels for MiniMax-M3 MXFP8 on MI300X/gfx942 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they're left at their defaults. VLLM_ROCM_USE_AITER_MHA=0 keeps attention on TRITON_ATTN because the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.

Also sets AMD-recommended, numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (raises RCCL channels above the ~32–64 default for TP8), GPU_MAX_HW_QUEUES=2 (caps HIP streams below the default of 4). All changes are kernel-selection / runtime only.

Measured uplift (8×MI300X, 1k1k random sweep, total tok/s/gpu)

conc	before	after	Δ
256	782.7	856.1	+9.4%
128	598.9	637.0	+6.4%
64	365.1	392.0	+7.4%
32	295.6	327.4	+10.8%
16	203.1	216.5	+6.6%
8	127.6	136.6	+7.1%
4	80.1	84.6	+5.6%

conc 1–2 unchanged (latency-bound). GSM8K exact-match holds at ~0.95 (kernel-selection change only).

Notes

Image is unchanged by design (vllm/vllm-openai-rocm:minimax-m3). AITER kernels are already compiled into the ROCm image; these env vars only select them at runtime, so nothing needs to be baked into the image.
Opened from an upstream branch (not a fork) so CI can run.

Supersedes #1804.

Note

Low Risk
Benchmark-only env var changes on AMD inference; attention stays on TRITON_ATTN and reported accuracy is unchanged.

Overview
Turns on AITER for the MiniMax-M3 MXFP8 MI300X vLLM recipe by exporting VLLM_ROCM_USE_AITER=1 in minimaxm3_fp8_mi300x.sh, so decode GEMMs and fused MoE use the ROCm image’s AITER paths instead of generic kernels. VLLM_ROCM_USE_AITER_MHA=0 keeps TRITON_ATTN because the MXFP8 checkpoint has no calibrated q/prob scales for ROCm FP8 attention.

Also sets MI300X runtime env defaults: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS (default 112), and GPU_MAX_HW_QUEUES (default 2). No container change—runtime kernel selection only.

Documents measured ~5–10% throughput uplift on 8×MI300X 1k1k sweeps (conc ≥4) and stable GSM8K in perf-changelog.yaml for minimaxm3-fp8-mi300x-vllm.

^{Reviewed by Cursor Bugbot for commit 883fbd0. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-16T15:45:57Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

JohnQinAMD · 2026-06-16T16:26:38Z

@functionstackx Supersedes #1804 and switched to this pr with upstream branch to trigger CI

functionstackx · 2026-06-16T20:00:59Z

@claude can u rebase this PR

Klaud-Cold · 2026-06-16T20:01:32Z

Claude finished @functionstackx's task in 41s —— View job

Rebase Request

Understand request
Attempt to update PR branch with latest from main

There are merge conflicts between this branch and main, so I'm unable to automatically update the branch. This needs to be resolved manually.

To rebase locally and resolve conflicts:

git checkout minimaxm3-mi300x-aiter-tuning
git fetch origin main
git rebase origin/main
# Resolve any conflicts in the flagged files, then:
git rebase --continue
git push --force-with-lease origin minimaxm3-mi300x-aiter-tuning

Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention. Also set AMD-recommended numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped below the default of 4). All changes are kernel-selection/runtime only; GSM8K holds ~0.95. Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across conc 4..256; conc 1-2 unchanged (latency-bound). Co-authored-by: Gong Zheng <zgong@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-16T21:06:43Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27644643517
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27644643517

ZhengGong-amd · 2026-06-17T03:14:47Z

The benchmark sweep didn't cover all points because the job landed on a self-hosted runner named mi300x-tw_. The workflow builds the launch script path from the runner name (bash ./runners/launch_${RUNNER_NAME%%_}.sh), so it tried to run runners/launch_mi300x-tw.sh — which doesn't exist in the repo (only launch_mi300x-amds.sh is provided). That step fails with No such file or directory, so the sweep aborts before finishing.

My PR only touches the recipe / config / perf-changelog.yaml — it doesn't touch runners/, the workflows, or the runner pool. The fix belongs to whoever owns the CI runners: either add runners/launch_mi300x-tw.sh, or remove the mi300x-tw runner so MI300X jobs only run on mi300x-amds.

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27644643517 see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27644643517

functionstackx · 2026-06-17T03:28:01Z

@JohnQinAMD a bunch of failures on PR validation, can u take a look?

JohnQinAMD requested a review from a team June 16, 2026 15:45

github-project-automation Bot added this to InferenceMAX Board Jun 16, 2026

JohnQinAMD force-pushed the minimaxm3-mi300x-aiter-tuning branch from d3a617f to 7c1dc0a Compare June 16, 2026 15:47

JohnQinAMD mentioned this pull request Jun 16, 2026

[AMD] minimaxm3-fp8-mi300x-vllm: enable AITER kernels + safe ROCm knobs #1804

Closed

JohnQinAMD force-pushed the minimaxm3-mi300x-aiter-tuning branch from 7c1dc0a to 9def276 Compare June 16, 2026 16:24

functionstackx added the full-sweep-enabled label Jun 16, 2026

functionstackx force-pushed the minimaxm3-mi300x-aiter-tuning branch from 9def276 to 883fbd0 Compare June 16, 2026 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] [MI300X] minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X#1808

[AMD] [MI300X] minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X#1808
JohnQinAMD wants to merge 1 commit into
mainfrom
minimaxm3-mi300x-aiter-tuning

JohnQinAMD commented Jun 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

JohnQinAMD commented Jun 16, 2026

Uh oh!

functionstackx commented Jun 16, 2026

Uh oh!

Klaud-Cold commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

ZhengGong-amd commented Jun 17, 2026

Uh oh!

functionstackx commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JohnQinAMD commented Jun 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Measured uplift (8×MI300X, 1k1k random sweep, total tok/s/gpu)

Notes

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

JohnQinAMD commented Jun 16, 2026

Uh oh!

functionstackx commented Jun 16, 2026

Uh oh!

Klaud-Cold commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rebase Request

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

ZhengGong-amd commented Jun 17, 2026

Uh oh!

functionstackx commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JohnQinAMD commented Jun 16, 2026 •

edited by cursor Bot

Loading

Klaud-Cold commented Jun 16, 2026 •

edited

Loading