Skip to content

feat(RL): add RL support for verl#1298

Open
shihaobai wants to merge 247 commits into
mainfrom
rl_verl_rebase_main
Open

feat(RL): add RL support for verl#1298
shihaobai wants to merge 247 commits into
mainfrom
rl_verl_rebase_main

Conversation

@shihaobai

Copy link
Copy Markdown
Collaborator

No description provided.

shihaobai and others added 30 commits May 26, 2026 09:05
…rward

abort was previously fanned out from master httpserver to slave httpservers
over zmq so every node's local shm got is_aborted=True before the router's
MIN-allreduce agreed. now rank 0 is the single source of truth: the router
broadcast(src=0)s the aborted_req_mask and slaves write is_aborted back to
their local shm so recycle_resource_loop still observes a consistent state.

side effects:
- disable_abort gate is removed; is_disconnected -> abort now works in
  multinode tp dp=1 mode.
- PortLocker no longer locks nccl_port on slave ranks (only rank 0 binds the
  TCPStore listener), which fixes single-machine multi-node tp testing.

adds test/test_api/test_abort_chaos.py covering abort_all on N concurrent
streams and random per-stream disconnect chaos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants