[AMD] [MI300X] minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X#1808
[AMD] [MI300X] minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X#1808JohnQinAMD wants to merge 1 commit into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
d3a617f to
7c1dc0a
Compare
7c1dc0a to
9def276
Compare
|
@functionstackx Supersedes #1804 and switched to this pr with upstream branch to trigger CI |
|
@claude can u rebase this PR |
|
Claude finished @functionstackx's task in 41s —— View job Rebase Request
There are merge conflicts between this branch and To rebase locally and resolve conflicts: git checkout minimaxm3-mi300x-aiter-tuning
git fetch origin main
git rebase origin/main
# Resolve any conflicts in the flagged files, then:
git rebase --continue
git push --force-with-lease origin minimaxm3-mi300x-aiter-tuning |
Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention. Also set AMD-recommended numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped below the default of 4). All changes are kernel-selection/runtime only; GSM8K holds ~0.95. Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across conc 4..256; conc 1-2 unchanged (latency-bound). Co-authored-by: Gong Zheng <zgong@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9def276 to
883fbd0
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27644643517 |
|
The benchmark sweep didn't cover all points because the job landed on a self-hosted runner named mi300x-tw_. The workflow builds the launch script path from the runner name (bash ./runners/launch_${RUNNER_NAME%%_}.sh), so it tried to run runners/launch_mi300x-tw.sh — which doesn't exist in the repo (only launch_mi300x-amds.sh is provided). That step fails with No such file or directory, so the sweep aborts before finishing. My PR only touches the recipe / config / perf-changelog.yaml — it doesn't touch runners/, the workflows, or the runner pool. The fix belongs to whoever owns the CI runners: either add runners/launch_mi300x-tw.sh, or remove the mi300x-tw runner so MI300X jobs only run on mi300x-amds.
|
|
@JohnQinAMD a bunch of failures on PR validation, can u take a look? |
Enable AITER kernels for MiniMax-M3 MXFP8 on MI300X/gfx942 via the single master toggle
VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE,_LINEAR,_RMSNORM,_FP8BMM) default toTrueand are gated behind the master flag, so they're left at their defaults.VLLM_ROCM_USE_AITER_MHA=0keeps attention onTRITON_ATTNbecause the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.Also sets AMD-recommended, numerically-inert MI300X runtime knobs:
TORCH_BLAS_PREFER_HIPBLASLT=1,NCCL_MIN_NCHANNELS=112(raises RCCL channels above the ~32–64 default for TP8),GPU_MAX_HW_QUEUES=2(caps HIP streams below the default of 4). All changes are kernel-selection / runtime only.Measured uplift (8×MI300X, 1k1k random sweep, total tok/s/gpu)
conc 1–2 unchanged (latency-bound). GSM8K exact-match holds at ~0.95 (kernel-selection change only).
Notes
vllm/vllm-openai-rocm:minimax-m3). AITER kernels are already compiled into the ROCm image; these env vars only select them at runtime, so nothing needs to be baked into the image.Supersedes #1804.
Note
Low Risk
Benchmark-only env var changes on AMD inference; attention stays on TRITON_ATTN and reported accuracy is unchanged.
Overview
Turns on AITER for the MiniMax-M3 MXFP8 MI300X vLLM recipe by exporting
VLLM_ROCM_USE_AITER=1inminimaxm3_fp8_mi300x.sh, so decode GEMMs and fused MoE use the ROCm image’s AITER paths instead of generic kernels.VLLM_ROCM_USE_AITER_MHA=0keepsTRITON_ATTNbecause the MXFP8 checkpoint has no calibrated q/prob scales for ROCm FP8 attention.Also sets MI300X runtime env defaults:
TORCH_BLAS_PREFER_HIPBLASLT=1,NCCL_MIN_NCHANNELS(default 112), andGPU_MAX_HW_QUEUES(default 2). No container change—runtime kernel selection only.Documents measured ~5–10% throughput uplift on 8×MI300X 1k1k sweeps (conc ≥4) and stable GSM8K in
perf-changelog.yamlforminimaxm3-fp8-mi300x-vllm.Reviewed by Cursor Bugbot for commit 883fbd0. Bugbot is set up for automated code reviews on this repo. Configure here.