perf: async prefetch of next segment's params during compute by fszontagh · Pull Request #1626 · leejet/stable-diffusion.cpp

fszontagh · 2026-06-09T21:55:29Z

Summary

When --stream-layers runs a multi-segment plan, today each merged segment H2Ds its params before compute, then waits. GPU sits idle during the H2D.

This PR overlaps them: while segment N's kernel runs on the runtime backend's stream, segment N+1's params are copied to a new pending buffer (ggml_backend_tensor_copy on cudaStreamPerThread). At the next iteration offload_partial_params recognizes the prefetched signature and adopts the pending buffer in place of a second H2D.

Per-segment wallclock drops toward max(H2D, compute) instead of H2D + compute. Falls back to sync if the pending allocation fails.

Numbers

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload	Before	After
SDXL bf16 1152x896 batch=2 8 steps	20.9 s	18.7 s
Z-Image bf16 1024x688 batch=2 9 steps	82 s	54.7 s

SDXL is a 1-segment plan so prefetch is a no-op; small win is from compute_async + synchronize having less host overhead than synchronous compute. Z-Image hits the 9-segment streaming path and gets the full overlap.

Checklist

I have read and confirmed this PR follows the contribution guidelines.

perf: async prefetch of next segment's params during compute

815e38c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: async prefetch of next segment's params during compute#1626

perf: async prefetch of next segment's params during compute#1626
fszontagh wants to merge 1 commit into
leejet:masterfrom
fszontagh:perf/async-prefetch

fszontagh commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fszontagh commented Jun 9, 2026

Summary

Related

Numbers

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant