Poc collector design#438
Conversation
Seed the decision log and add the POC brief + collector design under docs/design/ so the implementation and design go up together. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the per-sample max-aggregation engine with the collect ->
buffer -> aggregate-once pipeline end to end:
- monitor_process now drives an Aggregator: each sample buffers raw
readings; at each report interval it aggregates once and writes a
usage record {timestamp, num_samples, measurements{...}}.
- execution_summary is rewritten from the run-level accumulator with
new measurement keys (peak_ps_rss_total, ave_ps_rss_total,
ps_cpu_seconds, peak_cgroup_rss_peak, peak_psutil_pss_total); the old
rss/vsz/pmem/pcpu summary keys are gone.
- New `--measurements` selection flag (default: all available keys;
DUCT_MEASUREMENTS env); unknown/unavailable keys fail fast and clean.
- Schema bump 0.2.2 -> 0.3.0: LS_FIELD_CHOICES and the ls --fields
default track the renamed summary keys; EXECUTION_SUMMARY_FORMAT
updated. plot/ls/pprint logic untouched (out of scope).
- Remove the now-dead _sampling.py and Sample/Averages/ProcessStats;
the unsupported-OS guard moves to _collectors. Old Sample-based tests
replaced by the pipeline unit tests; e2e/execution/schema tests follow
the new record/summary shape.
tox: lint, typing, and 377 tests green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
add_sample now returns whether a sample was buffered. When ps finds no processes the sample is dropped (not buffered, not counted) so a trailing empty reading cannot skew a delta/total or inflate num_samples. Reporting no longer hinges on a nonzero ps pid count, so a selection of only non-ps (single-read) collectors still produces records. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update resource-collectors.md interface sketch to the built signatures and add an "Implementation status (POC)" section documenting the deliberate deviations (units in cores not percent, ps_cmd, dropped pmem/vsz/pcpu, on-disk shapes, out-of-scope consumers). Record the remaining implementation decisions and the brief's out-of-scope interface-fit notes (io / proc / presets / differential). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #438 +/- ##
==========================================
- Coverage 91.08% 86.94% -4.14%
==========================================
Files 15 16 +1
Lines 1245 1425 +180
Branches 170 211 +41
==========================================
+ Hits 1134 1239 +105
- Misses 77 135 +58
- Partials 34 51 +17 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Right now Proposal: have the psutil collector emit |
|
Counter totals: sum per-pid deltas, not delta-of-sum.
The robust form is sum-of-per-pid-deltas rather than delta-of-sum: accumulate each pid's own Δcputime (last − seed, per pid), so a process that exits still contributes its final delta instead of vanishing. |
|
Stamp the monotonic timestamp per
Fix is small and keeps collectors pure: stamp |
| # Measurement-derived keys always present in execution_summary, so info.json | ||
| # schema does not depend on which collectors happened to be available. | ||
| SUMMARY_KEYS = ( | ||
| "peak_ps_rss_total", |
There was a problem hiding this comment.
i think lets go with ps_peak_rss_total, ie the tool should be first so its easy to .startswith()
|
Batching the rest of the review into one comment. Like the inline comments, these are working notes for the next implementation — not threads that need resolving on this POC. 1. Restore the dropped per-pid measurements:
|
POC — not requesting review yet
Mentioning for visibility; I'm still folding a review pass into the design. When I do ask for feedback, it'll be for direction on the short list of decisions below — not a code review.
Problem: duct's resource stats aren't trustworthy enough for their main job (e.g. sizing SLURM
--mem/--cpusfrom measured runs):ps -o pcpuis a lifetime average so per-pid sums overshoot (#399), and summedrssdouble-counts shared pages.I explored alternative samplers in #415 (cgroup hybrid) and #423 (measurement-time delta-pcpu), and landed a best-effort plot-time correction for existing logs in #424. This PR combines those contexts into a design doc (
docs/design/resource-collectors.md), plus a one-shot agent implementation of that doc — pushed prior to human review. My inline comments flag the agent's known missteps (e.g. dropping fields not in the example) so the next implementation doesn't repeat them; they aren't review threads that need resolving here.Goals:
Known-intentional gaps (follow-ups, not bugs): the schema is replaced wholesale (0.2.2 → 0.3.0, no back-compat fields), and
ls/plot/pprintare not reworked — they break on new-format logs (their tests pass only against old fixtures). Backwards compat and consumer updates are the explicit next chunk.Decisions that will need sign-off (happy to chat about any of these)
max. Is this the shape the real implementation should be sliced into (as several much smaller PRs)?pcpu(the bug), collect cumulativecputimeinstead and derive the rate from it — a user-visible semantic change to the headline CPU number.--measurements ps_rss,cgroup_mem_peak,...) with collectors internal, rather than picking a named sampler bundle.