Skip to content

Fix for https://github.com/sgl-project/sglang/issues/22072#1806

Open
davzhuAMD wants to merge 1 commit into
SemiAnalysisAI:mainfrom
davzhuAMD:main
Open

Fix for https://github.com/sgl-project/sglang/issues/22072#1806
davzhuAMD wants to merge 1 commit into
SemiAnalysisAI:mainfrom
davzhuAMD:main

Conversation

@davzhuAMD

@davzhuAMD davzhuAMD commented Jun 16, 2026

Copy link
Copy Markdown

Fix some SGLang MORI disagg tests hanging on MI325X.

Uses derived helper scripts from https://github.com/JohnQinAMD/InferenceX/tree/sglang-pd-mi300-mi325x to aid investigation and repro.

To validate, update the PREFILL_MODEL_HOST_DIR, DECODE_MODEL_HOST_DIR, PREFILL_NODE, and DECODE_NODE IPs per test in these scripts, then run

  • start_test_1.sh
  • start_test_2.sh
  • start_test_3.sh
  • start_test_4.sh
  • start_test_5.sh

Thanks to Akash Dhaka who investigated this issue with me.


Note

Medium Risk
Changes multi-node RDMA/NCCL defaults and privileged Docker launch paths used for benchmarks; mis-tuned NCCL or IBDEVICES ordering could still hang or mis-route traffic on non-bnxt clusters unless overrides are set.

Overview
Addresses SGLang MORI prefill–decode hangs on MI300X/MI325X (e.g. sglang#22072) with RoCE/bnxt_re and multi-node decode fixes, plus Slurm-free SSH + Docker orchestration to reproduce the five validation scenarios.

Networking / collectives: env.sh now defaults NCCL NCCL_IB_GID_INDEX, NCCL_IB_TC, and NCCL_IB_SL for inter-node TP on Broadcom RoCE (avoids wrong GID and dropped handshake). _disagg_ssh_remote_inner.sh re-sorts IBDEVICES per node by RoCEv2 GID subnet so rank→HCA mapping is consistent across hosts (needed for multi-node decode TP16).

SGLang server tuning: server_sglang.sh clamps decode max-running-requests to at least dp_ranks when DP+EP are on (prevents cuda-graph capture_bs collapsing to [0] at low bench concurrency) and derives Mori dispatch/MoE token limits from the adjusted value. DeepSeek-R1-0528 prefill mem_fraction_static is raised 0.8 → 0.9 in models.yaml.

New launch path: Root scripts run_1p1d_*, run_1p2d_*, and run_1p1d_tp16_* SSH to GPU nodes and run Docker via scripts/_disagg_container_entry.sh (optional libbnxt rebuild and MoRI install) into existing amd_utils/server.sh. Helpers include detect_ibdevices_bnxt.sh, rebuild_bnxt.sh, install_mori_in_container.sh, and docs/SGLANG_PD_MI300_MI325X.md. start_test_1.shstart_test_5.sh wire cluster-specific nodes/images for the repro matrix.

Reviewed by Cursor Bugbot for commit c8a0e03. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread utils/find_reusable_sweep_run.py
Comment thread benchmarks/multi_node/amd_utils/server.sh
Comment thread run_1p1d_tp16_sglang_mi300_mi325x.sh
Comment thread runners/launch_gb300-nv.sh
if [[ -n "${_ibdev_sorted}" ]]; then
echo "[ibdev-auto] $(hostname -s): IBDEVICES (orig) = ${_ibdev_orig}"
echo "[ibdev-auto] $(hostname -s): IBDEVICES (sorted) = ${_ibdev_sorted}"
IBDEVICES="${_ibdev_sorted}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IB sort drops device list

Medium Severity

When RoCE GID sorting finds any matching bnxt_re device, it replaces IBDEVICES with only sysfs devices whose GID index 3 has a non-zero subnet, ignoring the launcher’s original list. Active HCAs without that GID, or non-bnxt_re names, are dropped, which can under-provision MoRI/NCCL and cause init hangs.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7821a3a. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.

PREFILL_IP="$(resolve_ip "${PREFILL_SSH}")"
DECODE1_IP="$(resolve_ip "${DECODE_SSH_1}")"
DECODE2_IP="$(resolve_ip "${DECODE_SSH_2}")"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial IP env vars overwritten

Medium Severity

When only one of PREFILL_IP, DECODE_IP (or DECODE1_IP / DECODE2_IP in the 1P+2D launcher) is unset, the script still re-resolves every address and overwrites any IPs the operator already exported.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.

PREFILL_EP="${PREFILL_EP:-1}"
PREFILL_DP_ATTN="${PREFILL_DP_ATTN:-false}"
DECODE_TP="${DECODE_TP:-8}"
DECODE_EP="${DECODE_EP:-1}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TP16 launcher wrong default TP

High Severity

This script always starts three nodes for a two-node TP16 decode worker, but DECODE_TP defaults to 8, so DECODE_NODES_PER_WORKER stays 1 unless the caller overrides it—rank 2 is not part of the intended TP16 worker layout.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c8a0e03. Configure here.

Comment thread start_test_1.sh
export REBUILD_LIBBNXT_IN_CONTAINER=1
export PATH_TO_BNXT_TAR_PACKAGE=/workspace/driver/libbnxt_re-237.1.137.0.tar.gz

export PREFILL_NODE="45.63.71.103"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start_test_*.sh scripts hardcode many cluster-specific values (real node IPs like 45.63.71.103 / 137.220.60.12 and Docker image tags) directly in a public repo. Maybe we could parameterize these (e.g. via env vars) instead of committing concrete cluster details.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. These test scripts themselves (including the new start_test_, run_1p, and start_sglang*) don't necessarily need to make it into the final repo, but I added them to illustrate how we investigated and reproduced the original issue on our side. If it makes things clearer, I can remove the helper scripts from the commit and attach them somewhere else instead, leaving just the actual fixes in this PR.

# IPADDRS order: prefill_ip,decode1_ip,decode2_ip. NNODES=3, xP=1, yD=1.
#
# Prerequisites:
# - Passwordless SSH from this machine to both nodes

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say 3 nodes? This script launches 1 prefill + 2 decode nodes (NNODES=3), but the comments/header still say "both nodes", which were copy-pasted from the single-decode launcher and read as if there are only 2 nodes total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants