[AMD] minimaxm3-fp8-mi300x-vllm: enable AITER kernels + safe ROCm knobs#1804
[AMD] minimaxm3-fp8-mi300x-vllm: enable AITER kernels + safe ROCm knobs#1804JohnQinAMD wants to merge 1 commit into
Conversation
Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention. Also set AMD-recommended numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped below the default of 4). All changes are kernel-selection/runtime only; GSM8K holds ~0.95. Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across conc 4..256; conc 1-2 unchanged (latency-bound). Co-authored-by: Cursor <cursoragent@cursor.com>
| export VLLM_ROCM_USE_AITER=1 | ||
| export VLLM_ROCM_USE_AITER_MHA=0 | ||
|
|
||
| export TORCH_BLAS_PREFER_HIPBLASLT=1 | ||
| export NCCL_MIN_NCHANNELS="${NCCL_MIN_NCHANNELS:-112}" | ||
| export GPU_MAX_HW_QUEUES="${GPU_MAX_HW_QUEUES:-2}" |
There was a problem hiding this comment.
@JohnQinAMD thank you for the PR, can u please update the image in amd-master.yaml to include these changes & do an upstream branch instead of forked branch so that we can kick off CI?
There was a problem hiding this comment.
@JohnQinAMD can u also update https://github.com/vllm-project/recipes/tree/main with these new env vars
There was a problem hiding this comment.
@JohnQinAMD thank you for the PR, can u please update the image in amd-master.yaml to include these changes & do an upstream branch instead of forked branch so that we can kick off CI?
@functionstackx, a upstream branch has been created and will close this pr and switched to #1808 to trigger CI test.
The image tag in this pr currently use the default image tag as amd-master.yaml in main branch
@JohnQinAMD can u also update https://github.com/vllm-project/recipes/tree/main with these new env vars
updating the vllm-recipes via vllm-project/recipes#556
|
close this pr and switched to #1808 to trigger CI test. |
Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.
Also set numerically-inert runtime knobs (hipBLASLt preference, NCCL channels, HW queues). All changes are kernel-selection/runtime only; GSM8K holds ~0.95.
Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across conc 4..256; conc 1-2 unchanged (latency-bound).
Note
Low Risk
Benchmark-only env exports and changelog; kernel/runtime selection with reported stable GSM8K, no auth or data-path changes.
Overview
MiniMax-M3 MXFP8 MI300X fixed-sequence vLLM recipe now exports
VLLM_ROCM_USE_AITER=1so decode GEMMs and fused MoE use AITER instead of generic ROCm kernels, withVLLM_ROCM_USE_AITER_MHA=0so attention stays onTRITON_ATTN(MXFP8 lacks calibrated FP8 attention scales).Also sets numerically inert MI300X knobs:
TORCH_BLAS_PREFER_HIPBLASLT=1,NCCL_MIN_NCHANNELSdefault 112, andGPU_MAX_HW_QUEUESdefault 2.perf-changelog.yamldocuments the change forminimaxm3-fp8-mi300x-vllm, including measured 1k1k throughput uplift at conc 4–256 and unchanged GSM8K (~0.95).Reviewed by Cursor Bugbot for commit 5cbf877. Bugbot is set up for automated code reviews on this repo. Configure here.