Skip to content

[AMD] minimaxm3-fp8-mi300x-vllm: enable AITER kernels + safe ROCm knobs#1804

Closed
JohnQinAMD wants to merge 1 commit into
SemiAnalysisAI:mainfrom
ZhengGong-amd:minimaxm3-mi300x-aiter-tuning
Closed

[AMD] minimaxm3-fp8-mi300x-vllm: enable AITER kernels + safe ROCm knobs#1804
JohnQinAMD wants to merge 1 commit into
SemiAnalysisAI:mainfrom
ZhengGong-amd:minimaxm3-mi300x-aiter-tuning

Conversation

@JohnQinAMD

@JohnQinAMD JohnQinAMD commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.

Also set numerically-inert runtime knobs (hipBLASLt preference, NCCL channels, HW queues). All changes are kernel-selection/runtime only; GSM8K holds ~0.95.

Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across conc 4..256; conc 1-2 unchanged (latency-bound).


Note

Low Risk
Benchmark-only env exports and changelog; kernel/runtime selection with reported stable GSM8K, no auth or data-path changes.

Overview
MiniMax-M3 MXFP8 MI300X fixed-sequence vLLM recipe now exports VLLM_ROCM_USE_AITER=1 so decode GEMMs and fused MoE use AITER instead of generic ROCm kernels, with VLLM_ROCM_USE_AITER_MHA=0 so attention stays on TRITON_ATTN (MXFP8 lacks calibrated FP8 attention scales).

Also sets numerically inert MI300X knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS default 112, and GPU_MAX_HW_QUEUES default 2.

perf-changelog.yaml documents the change for minimaxm3-fp8-mi300x-vllm, including measured 1k1k throughput uplift at conc 4–256 and unchanged GSM8K (~0.95).

Reviewed by Cursor Bugbot for commit 5cbf877. Bugbot is set up for automated code reviews on this repo. Configure here.

Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master
toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR,
_RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so
they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and
is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8
checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.

Also set AMD-recommended numerically-inert MI300X runtime knobs:
TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised
above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped
below the default of 4). All changes are kernel-selection/runtime only;
GSM8K holds ~0.95.

Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across
conc 4..256; conc 1-2 unchanged (latency-bound).

Co-authored-by: Cursor <cursoragent@cursor.com>
@JohnQinAMD JohnQinAMD requested a review from a team June 16, 2026 06:58
@JohnQinAMD JohnQinAMD changed the title minimaxm3-fp8-mi300x-vllm: enable AITER kernels + safe ROCm knobs [AMD] minimaxm3-fp8-mi300x-vllm: enable AITER kernels + safe ROCm knobs Jun 16, 2026
Comment on lines +37 to +42
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MHA=0

export TORCH_BLAS_PREFER_HIPBLASLT=1
export NCCL_MIN_NCHANNELS="${NCCL_MIN_NCHANNELS:-112}"
export GPU_MAX_HW_QUEUES="${GPU_MAX_HW_QUEUES:-2}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnQinAMD thank you for the PR, can u please update the image in amd-master.yaml to include these changes & do an upstream branch instead of forked branch so that we can kick off CI?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnQinAMD can u also update https://github.com/vllm-project/recipes/tree/main with these new env vars

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnQinAMD thank you for the PR, can u please update the image in amd-master.yaml to include these changes & do an upstream branch instead of forked branch so that we can kick off CI?

@functionstackx, a upstream branch has been created and will close this pr and switched to #1808 to trigger CI test.

The image tag in this pr currently use the default image tag as amd-master.yaml in main branch

@JohnQinAMD can u also update https://github.com/vllm-project/recipes/tree/main with these new env vars

updating the vllm-recipes via vllm-project/recipes#556

@JohnQinAMD

Copy link
Copy Markdown
Collaborator Author

close this pr and switched to #1808 to trigger CI test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants