Experimental performance family (default-off): load-balance infrastructure, active-box windowing, block-structured AMR, hybrid WENO/Riemann sensors#1628
Draft
sbryngelson wants to merge 112 commits into
Conversation
…, correct growth comment)
…inal-review fixes)
…ay); num_patches=1 behavior-identical
…ion merge, per-slot advance (SP12a)
…sambiguate from IC patches; golden values unchanged
…moment prolongation, per-block bubble state
…ation on the fine level
…iblock+subcycle, bubbles+subcycle+regrid)
…hibit (diffuse-interface c/f normal inconsistency; 3 attempts diagnosed)
… closure, per-block reactions, multi-rank temperature-ghost exchange (diffusion gated)
…docs into PR branch
…the c/f boundary; lift the diffusion gate
…ck moment realizability
…c/polydisperse bubbles now supported
…olydisperse, SP18) + AMR docs
…, realizability-preserving prolongation
…rkers/ghost points, body-driven tagging
…e-point stencil not decomposition-exact there)
# Conflicts: # src/simulation/m_ibm.fpp
… (fold regression)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an opt-in, default-off family of performance features and the measurement
infrastructure they rest on. With all flags at their defaults the only touched production
path is
s_mpi_decompose_computational_domain, refactored to compute its equal splitthrough the new
m_boxmodule (byte-identical; covered by the existing suite).Load-balance infrastructure (common + sim):
m_box:t_box+ partition arithmetic; shared by the decomposer, AMR, and theweighted splitter.
m_load_weight+load_weight_wrt: per-cell load-weight field (active-box, EL-bubble,IB, phase-change Newton-iteration contributors) with field output and a per-rank
imbalance metric.
m_sfc_partition+sfc_partition_wrt: Morton-SFC tile ordering and chains-on-chainsbalanced partition, reported as a predicted-imbalance diagnostic.
m_load_balance+load_balance: experimental weighted static Cartesian decompositionat init (requires
parallel_io), with a min-cells feasibility floor and, when AMR ison, fine-work-aware weighting with a deterministic feasibility clamp.
m_rank_timing+rank_time_wrt: per-rank compute-time diagnostic (halo exchangeexcluded; device-synced on GPU).
Active-box windowing (sim):
m_active_box+active_box: restricts reconstruction/Riemann/RK windows to alight-cone-grown box around non-ambient flow; a debug tripwire guards under-growth.
Golden-tested (
ECABA006) to stay a strict subset while matching the full-domainsolution.
Block-structured AMR (sim):
m_amr+m_amr_registers+amr: a two-level 2:1 refined hierarchy withconservative restriction / conservative-linear prolongation, per-stage flux registers
with Berger–Colella refluxing, gradient-based dynamic regrid (
amr_regrid_int,amr_tag_eps,amr_buf), optional dt/2 subcycling (amr_subcycle), multi-rankoperation with a mirror-decomposed fine level (patches may span rank boundaries; fine
halo exchange; distributed flux registers; rank-local regrid), and GPU builds
(device-resident fine fields and registers, on-device ghost fill/RK/restriction).
Requires WENO, SSP-RK3, model_eqns=2, single fluid (checker-enforced).
Hybrid reconstruction/flux sensors (sim):
hybrid_weno(+hybrid_weno_eps): linear-optimal reconstruction in smooth cells, fullWENO only at flagged discontinuities (Jameson-type density+pressure sensor,
stencil-dilated, halo-aware).
hybrid_riemann(+hybrid_smooth_flux): cheap central/Rusanov flux in smooth cells,full HLLC at discontinuities (5- and 6-equation blocks).
Motivation
Measured rank imbalance on heterogeneous-cost workloads (bubbles, IB, phase change)
motivates first-class measurement tools; the active box and hybrid sensors give direct
speedups on localized-flow / mostly-smooth cases; AMR concentrates resolution where the
flow needs it, and the load-balance coupling keeps the refined work spread across ranks.
Testing
5ECBB926(AMR static patch),1CBACEB5(AMR dynamic regrid),852CCB81(AMR subcycling),ECABA006(active_box 3D strict-subset).pass;
load_balance+amrnp=2 end-to-end smoke produces the analytically predictedweighted offsets and completes; amr np=2 spanning-patch run completes.
executable) verified locally (see PR checks for the full matrix).
case_validatorentries,case.mddocs, andmodule_categories are included.
Known-untested configurations
Delegated to CI: Cray ftn, Intel ifx, AMD flang, OpenMP target offload, single/mixed
precision. Hybrid WENO/Riemann ship without a dedicated golden case (flagging for
reviewer judgment; the sensors are default-off and checker-guarded).
Review guide
The 75 commits are arc-ordered and cleanly arc-separable — reviewing by arc is much
easier than by file:
2760da7d…2bb5fdc4active-box (11)bbf6b2a9…14b837c6load-weight field + contributors (8)0161fac0…2795e266SFC partition diagnostic (6)6df9c1f0…c43c02a5weighted decomposition (load_balance) (8)95398eb3…cc7882d1rank timing (4)21c60ffa…5082b535hybrid WENO/Riemann (10)74b58771…de244407m_box refactor + validation hygiene (4)352f564e…03b59516AMR: static hierarchy → restriction/prolongation → fine advance →refluxing → regrid → subcycling → multi-rank → GPU → mirror decomposition →
load-balance coupling (20)
a1a7e3admerge of upstream/master (num_procs_x/y/zpromotion adopted from Fix periodic ib issues #1618)Addendum: features added after the initial draft
mpp_limrequired fornum_fluids > 1; shock–material-interface demo validated. Known bounded limitation: alpha-sum deviation up to ~5.7e-3 at coarse cells historically hosting a patch face during shock crossing (non-growing, mpp_lim-damped; the volume-fraction K-term is deliberately not refluxed — it is non-conservative).Further additions
num_procsrequired (np-flexible restart is future work).viscousprohibition lifted; viscous stress/work refluxed through the existing registers (entersrhsas aflux_src_nface-flux difference, same form as advective flux) so coarse/fine boundaries match total flux; energy conservation 0.0, accuracy triplet coarse 2.49e-4 ≫ two-level 6.89e-5 ≈ fine 5.04e-5. A fine-ghost-coordinate bug (viscous gradient using stale coarsedxat the fine subdomain/patch edge — invisible to WENO, which uses only interiordx) was found by an np=2 exactness probe and fixed; the fine viscous seam is now byte-exact across ranks. Residual: a bounded (~1e-6) np-dependence remains only at the coarse/fine patch boundary from prolongation-derived ghost gradients (AMR's inherently-approximate coupling zone); the density-gradient tagger senses shear poorly (buffered/static patch recommended; error-estimator taggers are future work).Multi-block AMR + terminology
amr_max_blocks(default 4; N fixed-size slots, ~N× device memory — compute efficiency is the goal, memory efficiency a follow-up),amr_cluster_eff(default 0.7). Fine blocks stay ≥ buff_size apart ⇒ no fine–fine coupling; all existing per-block machinery (multi-rank, GPU, subcycle, viscous, multi-fluid) loops over the block list unchanged.amr_block_beg/end,amr_max_blocks) — disambiguated from MFC's initial-conditionpatch_icpp. (Draft-stage rename; golden values unchanged.)Euler-Euler bubbles under AMR (SP13)
q_cons) are refluxed by the existing register machinery; prolongation is realizability-preserving (radius momentnR > 0maintained across coarse→fine, analogous to the multi-fluid volume-fraction closure). Validated: conservation defects ~1e-15 through refluxed+subcycled+regridding advance, moments stay realizable, AMR beats the coarse solution, np=1==np=2 element-exact. Non-polytropic / QBMM / polydisperse / Lagrangian bubbles remain explicitly gated (future work — non-polytropic additionally needs per-blockpb/mvhandling).Phase-change (relax) under AMR (SP15)
relax) runs on each fine block before restriction (a news_amr_relax_fine), so the refined solution is properly relaxed. Cell-local — no reflux, no c/f coupling. Machine-precision conservation, free-stream preserved, np=1==np=2 bit-exact. Config:model_eqns=2, relax=T, num_fluids>1, mpp_lim=T.Validation hardening (blind spots closed)
Chemistry under AMR (SP16) + surface-tension limitation
Further physics rungs (SP17–SP20)
flux_srcis captured by the coarse/fine flux registers (mirroring the viscous path) and refluxed, and the temperature ghost is exchanged at rank seams (the same broadening the reactions fix required). Removed the diffusion prohibit; np=1==np=2 element-exact, machine-precision conservation.nb ≥ 1) configs, with a per-block moment-realizability floor applied to all positive moments on prolongation. Conservation is machine-precision for polytropic; the non-polytropic source-term model carries a ~7e-10 defect that is np-invariant (identical np=1/np=2 — a model property, not an AMR decomposition leak). Removed the polytropic/monodisperse gates.q_consand is injected piecewise-constant at prolongation so every fine/ghost child inherits a realizable moment set (variancec20 > 0), keeping the inversion NaN-free; moments reflux/restrict on the standard conservative path. np=2 element-exact, conservation ~1e-15. Non-polytropic QBMM stays gated (itspb/mvquadrature side-state is a global array the fine advance would corrupt through the swap).amr_regrid_int = 0) is now resolved on the refined level — each fine block carries its own fine-grid IB markers/ghost points computed from the geometry, and the fine advance applies the ghost-cell IB correction per RK stage. The IB forcing is non-conservative by construction at the body (ghost-cell method), while the flux reflux still conserves to machine precision away from it. Moving/multi-body/STL/dynamic-regrid-with-IB remain gated. A body straddling a rank seam is rejected at startup (the fine-IB image-point stencil across the seam is not yet decomposition-exact) rather than silently producing a small surface error — keep the body within a single rank's subdomain.