perf: async prefetch of next segment's params during compute#1626
Open
fszontagh wants to merge 1 commit into
Open
perf: async prefetch of next segment's params during compute#1626fszontagh wants to merge 1 commit into
fszontagh wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
--stream-layersruns a multi-segment plan, today each merged segment H2Ds its params before compute, then waits. GPU sits idle during the H2D.This PR overlaps them: while segment N's kernel runs on the runtime backend's stream, segment N+1's params are copied to a new pending buffer (
ggml_backend_tensor_copyoncudaStreamPerThread). At the next iterationoffload_partial_paramsrecognizes the prefetched signature and adopts the pending buffer in place of a second H2D.Per-segment wallclock drops toward
max(H2D, compute)instead ofH2D + compute. Falls back to sync if the pending allocation fails.Related
Continuation of #1576, #1598, #1601, #1611, #1612.
Numbers
RTX 3060 12 GB,
--offload-to-cpu --stream-layers --max-vram -1:SDXL is a 1-segment plan so prefetch is a no-op; small win is from
compute_async + synchronizehaving less host overhead than synchronouscompute. Z-Image hits the 9-segment streaming path and gets the full overlap.Checklist