Fix for https://github.com/sgl-project/sglang/issues/22072#1806
Fix for https://github.com/sgl-project/sglang/issues/22072#1806davzhuAMD wants to merge 1 commit into
Conversation
| if [[ -n "${_ibdev_sorted}" ]]; then | ||
| echo "[ibdev-auto] $(hostname -s): IBDEVICES (orig) = ${_ibdev_orig}" | ||
| echo "[ibdev-auto] $(hostname -s): IBDEVICES (sorted) = ${_ibdev_sorted}" | ||
| IBDEVICES="${_ibdev_sorted}" |
There was a problem hiding this comment.
IB sort drops device list
Medium Severity
When RoCE GID sorting finds any matching bnxt_re device, it replaces IBDEVICES with only sysfs devices whose GID index 3 has a non-zero subnet, ignoring the launcher’s original list. Active HCAs without that GID, or non-bnxt_re names, are dropped, which can under-provision MoRI/NCCL and cause init hangs.
Reviewed by Cursor Bugbot for commit 7821a3a. Configure here.
… mentioned in #22072.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.
| PREFILL_IP="$(resolve_ip "${PREFILL_SSH}")" | ||
| DECODE1_IP="$(resolve_ip "${DECODE_SSH_1}")" | ||
| DECODE2_IP="$(resolve_ip "${DECODE_SSH_2}")" | ||
| fi |
There was a problem hiding this comment.
Partial IP env vars overwritten
Medium Severity
When only one of PREFILL_IP, DECODE_IP (or DECODE1_IP / DECODE2_IP in the 1P+2D launcher) is unset, the script still re-resolves every address and overwrites any IPs the operator already exported.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.
| PREFILL_EP="${PREFILL_EP:-1}" | ||
| PREFILL_DP_ATTN="${PREFILL_DP_ATTN:-false}" | ||
| DECODE_TP="${DECODE_TP:-8}" | ||
| DECODE_EP="${DECODE_EP:-1}" |
There was a problem hiding this comment.
TP16 launcher wrong default TP
High Severity
This script always starts three nodes for a two-node TP16 decode worker, but DECODE_TP defaults to 8, so DECODE_NODES_PER_WORKER stays 1 unless the caller overrides it—rank 2 is not part of the intended TP16 worker layout.
Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.
| export REBUILD_LIBBNXT_IN_CONTAINER=1 | ||
| export PATH_TO_BNXT_TAR_PACKAGE=/workspace/driver/libbnxt_re-237.1.137.0.tar.gz | ||
|
|
||
| export PREFILL_NODE="45.63.71.103" |
There was a problem hiding this comment.
The start_test_*.sh scripts hardcode many cluster-specific values (real node IPs like 45.63.71.103 / 137.220.60.12 and Docker image tags) directly in a public repo. Maybe we could parameterize these (e.g. via env vars) instead of committing concrete cluster details.
There was a problem hiding this comment.
Thanks for your comments. These test scripts themselves (including the new start_test_, run_1p, and start_sglang*) don't necessarily need to make it into the final repo, but I added them to illustrate how we investigated and reproduced the original issue on our side. If it makes things clearer, I can remove the helper scripts from the commit and attach them somewhere else instead, leaving just the actual fixes in this PR.
| # IPADDRS order: prefill_ip,decode1_ip,decode2_ip. NNODES=3, xP=1, yD=1. | ||
| # | ||
| # Prerequisites: | ||
| # - Passwordless SSH from this machine to both nodes |
There was a problem hiding this comment.
Should this say 3 nodes? This script launches 1 prefill + 2 decode nodes (NNODES=3), but the comments/header still say "both nodes", which were copy-pasted from the single-decode launcher and read as if there are only 2 nodes total.


Fix some SGLang MORI disagg tests hanging on MI325X.
Uses derived helper scripts from https://github.com/JohnQinAMD/InferenceX/tree/sglang-pd-mi300-mi325x to aid investigation and repro.
To validate, update the PREFILL_MODEL_HOST_DIR, DECODE_MODEL_HOST_DIR, PREFILL_NODE, and DECODE_NODE IPs per test in these scripts, then run
Thanks to Akash Dhaka who investigated this issue with me.
Note
Medium Risk
Changes multi-node RDMA/NCCL defaults and privileged Docker launch paths used for benchmarks; mis-tuned NCCL or IBDEVICES ordering could still hang or mis-route traffic on non-bnxt clusters unless overrides are set.
Overview
Addresses SGLang MORI prefill–decode hangs on MI300X/MI325X (e.g. sglang#22072) with RoCE/
bnxt_reand multi-node decode fixes, plus Slurm-free SSH + Docker orchestration to reproduce the five validation scenarios.Networking / collectives:
env.shnow defaults NCCLNCCL_IB_GID_INDEX,NCCL_IB_TC, andNCCL_IB_SLfor inter-node TP on Broadcom RoCE (avoids wrong GID and dropped handshake)._disagg_ssh_remote_inner.shre-sortsIBDEVICESper node by RoCEv2 GID subnet so rank→HCA mapping is consistent across hosts (needed for multi-node decode TP16).SGLang server tuning:
server_sglang.shclamps decode max-running-requests to at leastdp_rankswhen DP+EP are on (prevents cuda-graphcapture_bscollapsing to[0]at low bench concurrency) and derives Mori dispatch/MoE token limits from the adjusted value. DeepSeek-R1-0528 prefillmem_fraction_staticis raised 0.8 → 0.9 inmodels.yaml.New launch path: Root scripts
run_1p1d_*,run_1p2d_*, andrun_1p1d_tp16_*SSH to GPU nodes and run Docker viascripts/_disagg_container_entry.sh(optional libbnxt rebuild and MoRI install) into existingamd_utils/server.sh. Helpers includedetect_ibdevices_bnxt.sh,rebuild_bnxt.sh,install_mori_in_container.sh, anddocs/SGLANG_PD_MI300_MI325X.md.start_test_1.sh–start_test_5.shwire cluster-specific nodes/images for the repro matrix.Reviewed by Cursor Bugbot for commit c8a0e03. Bugbot is set up for automated code reviews on this repo. Configure here.