[DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61#1709
[DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61#1709seungrokj wants to merge 32 commits into
Conversation
…r mi355x models Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| $ASYNC_SCHEDULING_ARGS | ||
| "${PREFIX_CACHE_ARGS[@]}" | ||
| "${OFFLOAD_ARGS[@]}" | ||
| ) |
There was a problem hiding this comment.
vLLM uses wrong model
High Severity
The vLLM command serves "$MODEL" and omits --served-model-name, while the script downloads weights into MODEL_PATH and build_replay_cmd sends --model $MODEL to aiperf. That breaks the usual MODEL_PATH + served-name pairing used by sibling agentic scripts and can fail when MODEL is a Hub id but weights live under MODEL_PATH.
Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.
| --mem-fraction-static 0.8 \ | ||
| --context-length $MAX_MODEL_LEN \ | ||
| "${CACHE_ARGS[@]}" \ | ||
| "${WARMUP_ARGS[@]}" \ |
There was a problem hiding this comment.
SGLang ignores MODEL_PATH
Medium Severity
SGLang is started with --model-path $MODEL and no --served-model-name, after the script may download into MODEL_PATH. Matrix jobs that set a local MODEL_PATH can still point the server at the Hub id, and the OpenAI model name may not match MODEL used by aiperf.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.
| cd LMCache | ||
| pip install -r requirements/build.txt | ||
| CXX=hipcc BUILD_WITH_HIP=1 pip install -e . --no-build-isolation | ||
| cd .. |
There was a problem hiding this comment.
LMCache clone not idempotent
Medium Severity
The lmcache path runs git clone https://github.com/LMCache/LMCache.git unconditionally. With set -e, a second run in the same working directory exits when LMCache already exists, so lmcache agentic jobs fail on retry or reuse of the job cwd.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.
Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onfig Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
||
| python3 -m sglang.launch_server \ | ||
| --attention-backend aiter \ | ||
| --model-path $MODEL \ |
There was a problem hiding this comment.
Server ignores MODEL_PATH
Medium Severity
Weights are downloaded into MODEL_PATH when the workflow sets that directory, but SGLang is started with --model-path $MODEL (Hub id) instead of MODEL_PATH. The server may load a different cache path than the one prepared for the job.
Reviewed by Cursor Bugbot for commit 32f5007. Configure here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
||
| # ---- Resolve traces and install deps ---------------------------------------- | ||
| # https://huggingface.co/datasets/semianalysisai/cc-traces-weka-with-subagents-060826 | ||
| export WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_060826 |
There was a problem hiding this comment.
DSv4 atom uncapped traces
Medium Severity
This new DSv4 ATOM agentic script sets WEKA_LOADER_OVERRIDE to the uncapped 060826 trace set, while peer MI355X agentic scripts in the same PR use 060226_256k to avoid ~1M-token traces that are rejected and skew sweeps.
Reviewed by Cursor Bugbot for commit 351e729. Configure here.
Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>
…nalysisAI/InferenceX into amd/agentx-v0.4_rebase0611
| $ASYNC_SCHEDULING_ARGS | ||
| "${PREFIX_CACHE_ARGS[@]}" | ||
| "${OFFLOAD_ARGS[@]}" | ||
| ) |
There was a problem hiding this comment.
MiniMax FP8 launcher regressed
High Severity
The MI355X MiniMax FP8 agentic launcher was replaced with a Kimi-style vLLM recipe. Existing minimaxm2.5-fp8-mi355x-vllm-agentic jobs (TP4/EP4, offloading=cpu) lose the prior --max-model-len, ROCM_AITER_UNIFIED_ATTN backend, MODEL_PATH-based serve, and SimpleCPU offload wiring they depended on.
Reviewed by Cursor Bugbot for commit faba18f. Configure here.
Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>
… config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cripts and master yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| --cuda-graph-max-bs "$PER_ENGINE_MAX_RUNNING" \ | ||
| --disable-radix-cache \ | ||
| --attention-backend dsv4 \ | ||
| --max-running-requests ${CONC} \ |
There was a problem hiding this comment.
DP max-running requests wrong
Medium Severity
When DP_ATTENTION=true, the script computes PER_ENGINE_MAX_RUNNING as CONC/TP for per-engine limits, but the server is started with --max-running-requests ${CONC}. Each DP engine may accept too many sequences versus the harness load-balancing assumption.
Reviewed by Cursor Bugbot for commit 76d90e0. Configure here.
…ript Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| python3 -m sglang.launch_server \ | ||
| --model-path "$MODEL_PATH" --served-model-name "$MODEL" \ | ||
| sglang serve \ | ||
| --model-path $MODEL \ |
There was a problem hiding this comment.
Wrong model path for serve
Medium Severity
The script downloads weights into MODEL_PATH when set, but sglang serve uses --model-path $MODEL (Hub id) instead of "$MODEL_PATH". Runs that pre-stage a local directory can ignore the prepared path and rely on a different cache location.
Reviewed by Cursor Bugbot for commit 4ebc4e2. Configure here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c script Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: thshan@amd.com <thshan@amd.com@mia1-p01-g07.mia.tensorwave.lan> Co-authored-by: Cursor <cursoragent@cursor.com>
067484f to
a9e1304
Compare
| # mia1-p01-g11: docker.sock permissions denied (cluster-cleanup step fails) | ||
| # Both have been root-caused via #1431/#1432/#1440/#1441/#1443 sweep failures. | ||
| salloc --partition=$PARTITION --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=500 --no-shell --job-name="$RUNNER_NAME" | ||
| salloc --partition=$PARTITION --exclude=mia1-p01-g09,mia1-p01-g11,mia1-p01-g37 --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=500 --no-shell --job-name="$RUNNER_NAME" |
There was a problem hiding this comment.
Multinode agentic scripts never invoked
High Severity
The multinode branch always runs benchmarks/multi_node/${EXP}_${PRECISION}_mi355x_${FRAMEWORK}.sh and ignores SCENARIO_SUBDIR / IS_AGENTIC, while new disagg agentic entrypoints live under benchmarks/multi_node/agentic/. Agentic HiCache sweeps therefore hit the fixed-seq wrapper, skip trace_replay.sh, and ignore YAML offloading / duration.
Reviewed by Cursor Bugbot for commit a9e1304. Configure here.
| multinode: false | ||
| framework: sglang-disagg | ||
| multinode: true | ||
| disagg: true |
There was a problem hiding this comment.
Disagg agentic uses wrong runner
High Severity
dsr1-fp4-mi355x-sglang-disagg-agentic-hicache sets runner: mi355x while sibling PD-disagg entries (including dsv4-fp4-mi355x-sglang-disagg-agentic-hicache) use runner: mi355x-disagg. Multinode jobs use runs-on: ${{ inputs.runner }}, so the DSR1 agentic disagg matrix likely schedules on the wrong runner class.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a9e1304. Configure here.
| # node budget. Lower TP configs use higher ratios to maintain adequate | ||
| # host token capacity without exceeding DRAM limits. | ||
| if [ "$TP" -ge 8 ]; then | ||
| DEFAULT_HICACHE_RATIO=2 |
There was a problem hiding this comment.
DSv4 HiCache ratio inconsistent
Low Severity
Both DSv4 MI355X agentic SGLang launchers share the same HiCache comment for TP≥8, but dsv4_fp4_mi355x_sglang.sh defaults DEFAULT_HICACHE_RATIO to 2 while dsv4_fp4_mi355x.sh uses 8, so identical YAML sweeps get different CPU tier sizing.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a9e1304. Configure here.
| TOTAL_CPU_DRAM_GB=2500 | ||
| #TODO: fix | ||
| TOTAL_CPU_DRAM_GB=3000 | ||
| TOTAL_CPU_DRAM_PARTITION_GB="${TOTAL_CPU_DRAM_PARTITION_GB:-$((TOTAL_CPU_DRAM_GB / (8 / TP)))}" |
There was a problem hiding this comment.
CPU offload divide by zero
Medium Severity
TOTAL_CPU_DRAM_PARTITION_GB uses $((TOTAL_CPU_DRAM_GB / (8 / TP))). For TP greater than 8, bash evaluates 8 / TP as 0 and arithmetic expansion errors on divide-by-zero.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a9e1304. Configure here.
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
| decode_dp_ranks=$DECODE_TP_SIZE | ||
| MORI_MAX_DISPATCH_TOKENS_DECODE=$((BENCH_MAX_CONC_VALUE / decode_dp_ranks)) | ||
| MORI_MOE_MAX_INPUT_TOKENS_DECODE=$((MORI_MAX_DISPATCH_TOKENS_DECODE * decode_dp_ranks * 7 / 10)) | ||
| # MORI_MOE_MAX_INPUT_TOKENS_DECODE=$((MORI_MAX_DISPATCH_TOKENS_DECODE * decode_dp_ranks * 7 / 10)) |
There was a problem hiding this comment.
Disagg MoE token overrides removed
Medium Severity
This change comments out assignments that set MORI_MOE_MAX_INPUT_TOKENS_PREFILL and MORI_MOE_MAX_INPUT_TOKENS_DECODE for DP+EP and MTP decode paths, while launch commands still conditionally export those variables. Disagg sweeps that relied on the computed caps may run with unset MoE input limits.
Reviewed by Cursor Bugbot for commit c21ad06. Configure here.
| fi | ||
| set +x | ||
| PREFILL_CMD="SGLANG_MORI_COMBINE_DTYPE=${MORI_COMBINE_DTYPE_PREFILL} ${PREFILL_SDMA_ENV} ${PREFILL_MORI_MOE_ENV} SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=${MORI_MAX_DISPATCH_TOKENS_PREFILL} python3 -m sglang.launch_server \ | ||
| PREFILL_CMD="SGLANG_MORI_COMBINE_DTYPE=${MORI_COMBINE_DTYPE_PREFILL} ${PREFILL_SDMA_ENV} ${PREFILL_MORI_MOE_ENV} SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=${MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK_PREFILL:-${MORI_MAX_DISPATCH_TOKENS_PREFILL}} MORI_IO_SQ_BACKOFF_TIMEOUT_US=${MORI_IO_SQ_BACKOFF_TIMEOUT_US} MORI_IO_QP_MAX_SEND_WR=${MORI_IO_QP_MAX_SEND_WR} ${LAUNCH_PREFIX:-} python3 -m sglang.launch_server \ |
There was a problem hiding this comment.
Server ignores resolved MODEL_PATH
Medium Severity
job.slurm now resolves and exports a canonical MODEL_PATH (caller path, hf_dir, or MODEL_DIR/MODEL_NAME), but server_sglang.sh still launches with --model-path $MODEL_DIR/$MODEL_NAME. When the resolved path differs from that join, prefill/decode can fail to load weights or load from the wrong directory.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c7f269e. Configure here.
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
c7f269e to
e37fbc2
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| fi | ||
| set +x | ||
| DECODE_CMD="SGLANG_MORI_COMBINE_DTYPE=${MORI_COMBINE_DTYPE_DECODE} ${DECODE_MORI_MOE_ENV} SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=${MORI_MAX_DISPATCH_TOKENS_DECODE} python3 -m sglang.launch_server \ | ||
| DECODE_CMD="SGLANG_MORI_COMBINE_DTYPE=${MORI_COMBINE_DTYPE_DECODE} ${DECODE_MORI_MOE_ENV} SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=${MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK_DECODE:-${MORI_MAX_DISPATCH_TOKENS_DECODE}} MORI_IO_SQ_BACKOFF_TIMEOUT_US=${MORI_IO_SQ_BACKOFF_TIMEOUT_US} MORI_IO_QP_MAX_SEND_WR=${MORI_IO_QP_MAX_SEND_WR} ${LAUNCH_PREFIX:-} python3 -m sglang.launch_server \ |
There was a problem hiding this comment.
Custom all-reduce flag unused
Medium Severity
DISABLE_CUSTOM_ALL_REDUCE is threaded into the container from job.slurm, and the DSR1 disagg agentic recipe defaults it to 1 for an Aiter fault workaround, but prefill/decode launch commands never append --disable-custom-all-reduce.
Reviewed by Cursor Bugbot for commit b5626fb. Configure here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
| multinode: false | ||
| framework: sglang-disagg | ||
| multinode: true | ||
| disagg: true |
There was a problem hiding this comment.
Multinode agentic scripts not selected
High Severity
New disaggregated agentic YAML entries and benchmarks/multi_node/agentic/* wrappers are added, but launch_mi355x-amds.sh still invokes benchmarks/multi_node/${SCRIPT_NAME} and never agentic/ or IS_AGENTIC. Those jobs run the fixed-seq disagg script without trace replay, HiCache env, or DURATION/OFFLOADING wiring.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f10f456. Configure here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 19 total unresolved issues (including 18 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e753830. Configure here.
| exit 1 | ||
| fi | ||
|
|
||
| echo "$JOB_ID" |
There was a problem hiding this comment.
Agentic disagg scripts not invoked
High Severity
New multinode agentic wrappers live under benchmarks/multi_node/agentic/, but launch_mi355x-amds.sh still runs benchmarks/multi_node/${EXP_NAME%%_*}_…_${FRAMEWORK}.sh and never uses workflow SCENARIO_SUBDIR. Disagg agentic YAML entries therefore hit the fixed-seq wrapper, which does not export DURATION, OFFLOADING, or HiCache tunables the agentic scripts set.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e753830. Configure here.
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Theresa Shan <theresa.shan@amd.com>


Summary
qwen3.5-fp4-mi355x-sglang-agentic-hicacheconfig: SGLang agentic-coding sweep with and without hicache offloading (TP2, EP1)minimaxm2.5-fp4-mi355x-vllm-agentic-lmcacheconfig: vLLM agentic-coding sweep with lmcacheminimaxm2.5_fp4_mi355x.sh,qwen3.5_fp4_mi355x.shglm5.1_fp4_mi355x.sh,kimik2.5_fp4_mi355x.sh,minimaxm2.5_fp8_mi355x.sh,qwen3.5_fp8_mi355x.shlaunch_mi355x-amds.shTest plan
🤖 Generated with Claude Code
Note
Medium Risk
Changes are benchmark/CI configuration and launch scripts rather than production services, but they alter expensive cluster sweep matrices, container pins (including version downgrades), and complex KV offload paths (HiCache/Mooncake, LMCache on ROCm) where misconfiguration can waste long SLURM jobs or skew cross-hardware comparisons.
Overview
AMD master CI (
amd-master.yaml) is reshaped for an image-bump validation pass: several fixed-seq search spaces are simplified, some top-level agentic blocks are commented or split into sibling-agenticentries (older images, dedicatedconc-list/offloadinggrids), and SGLang/vLLM/Atom images are bumped or pinned (including vLLM downgrades on some Kimi/MiniMax entries and digest-pinned nightlies for disagg). New or expanded coverage includes DSv4 (single-node vLLM comments, newdsv4-fp4-mi355x-sglang-disaggPD topologies, agentic variants), agentic-hicache disagg stubs, and net-new agentic recipes (e.g. Qwen3.5 hicache, DSv4 vLLM/sglang agentic).CI workflow now passes
offloadingfrom the agentic matrix into disaggregated sweep jobs (run-sweep.yml).Multi-node AMD harness gains agentic PD launchers (
dsr1/dsv4sglang-disaggunderagentic/), a fixed-seqdsv4_fp4_mi355x_sglang-disagg.sh, and substantialamd_utilschanges: HiCache/Mooncake settings via a bind-mountedhicache_mc.env,trace_replay.shfor agentic runs on PD clusters, DSv4 bench--dsv4framing,DeepSeek-V4-Proinmodels.yaml+env.sh,ep_flagsdecoupled from DP flags inmodels.yaml, agentic/OFFLOADING/DURATIONenv threading injob.slurm, and HiCache/Mooncake startup inserver_sglang.sh(including optional Mooncake master).Single-node agentic scripts are added or rewritten for DSv4 (SGLang/Atom), GLM5.1, Kimi/MiniMax FP4 (LMCache build-from-source, larger CPU pools), with HiCache ratio/size tuning and trace corpus overrides (
WEKA_LOADER_OVERRIDE).Reviewed by Cursor Bugbot for commit e36bf75. Bugbot is set up for automated code reviews on this repo. Configure here.