Poc collector design by asmacdo · Pull Request #438 · con/duct

asmacdo · 2026-06-25T22:20:35Z

POC — not requesting review yet

Mentioning for visibility; I'm still folding a review pass into the design. When I do ask for feedback, it'll be for direction on the short list of decisions below — not a code review.

Problem: duct's resource stats aren't trustworthy enough for their main job (e.g. sizing SLURM --mem/--cpus from measured runs): ps -o pcpu is a lifetime average so per-pid sums overshoot (#399), and summed rss double-counts shared pages.

I explored alternative samplers in #415 (cgroup hybrid) and #423 (measurement-time delta-pcpu), and landed a best-effort plot-time correction for existing logs in #424. This PR combines those contexts into a design doc (docs/design/resource-collectors.md), plus a one-shot agent implementation of that doc — pushed prior to human review. My inline comments flag the agent's known missteps (e.g. dropping fields not in the example) so the next implementation doesn't repeat them; they aren't review threads that need resolving here.

Goals:

stress-test the design doc against real code
stretch the pattern beyond ps: psutil is implemented; io and /proc are analyzed only — we think they should fit, but that's unverified
provide a reference POC to try out
a starting point for the real implementation, which should be broken into much smaller PRs

Known-intentional gaps (follow-ups, not bugs): the schema is replaced wholesale (0.2.2 → 0.3.0, no back-compat fields), and ls/plot/pprint are not reworked — they break on new-format logs (their tests pass only against old fixtures). Backwards compat and consumer updates are the explicit next chunk.

Decisions that will need sign-off (happy to chat about any of these)

The architecture: buffer raw readings and aggregate once per report, behind a collector/measurement interface — replacing the streaming per-sample max. Is this the shape the real implementation should be sliced into (as several much smaller PRs)?
Schema policy: semantics live in field names — a new source/method is a new field, never a new meaning under an old name. And freeze-vs-rename: keep existing field names forever and only add new ones cleanly, or accept a one-time breaking rename to honest names?
The CPU story: stop collecting lifetime pcpu (the bug), collect cumulative cputime instead and derive the rate from it — a user-visible semantic change to the headline CPU number.
The user surface: selection by flat measurement keys (--measurements ps_rss,cgroup_mem_peak,...) with collectors internal, rather than picking a named sampler bundle.

Seed the decision log and add the POC brief + collector design under docs/design/ so the implementation and design go up together. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the per-sample max-aggregation engine with the collect -> buffer -> aggregate-once pipeline end to end: - monitor_process now drives an Aggregator: each sample buffers raw readings; at each report interval it aggregates once and writes a usage record {timestamp, num_samples, measurements{...}}. - execution_summary is rewritten from the run-level accumulator with new measurement keys (peak_ps_rss_total, ave_ps_rss_total, ps_cpu_seconds, peak_cgroup_rss_peak, peak_psutil_pss_total); the old rss/vsz/pmem/pcpu summary keys are gone. - New `--measurements` selection flag (default: all available keys; DUCT_MEASUREMENTS env); unknown/unavailable keys fail fast and clean. - Schema bump 0.2.2 -> 0.3.0: LS_FIELD_CHOICES and the ls --fields default track the renamed summary keys; EXECUTION_SUMMARY_FORMAT updated. plot/ls/pprint logic untouched (out of scope). - Remove the now-dead _sampling.py and Sample/Averages/ProcessStats; the unsupported-OS guard moves to _collectors. Old Sample-based tests replaced by the pipeline unit tests; e2e/execution/schema tests follow the new record/summary shape. tox: lint, typing, and 377 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

add_sample now returns whether a sample was buffered. When ps finds no processes the sample is dropped (not buffered, not counted) so a trailing empty reading cannot skew a delta/total or inflate num_samples. Reporting no longer hinges on a nonzero ps pid count, so a selection of only non-ps (single-read) collectors still produces records. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Update resource-collectors.md interface sketch to the built signatures and add an "Implementation status (POC)" section documenting the deliberate deviations (units in cores not percent, ps_cmd, dropped pmem/vsz/pcpu, on-disk shapes, out-of-scope consumers). Record the remaining implementation decisions and the brief's out-of-scope interface-fit notes (io / proc / presets / differential). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-25T22:22:23Z

Codecov Report

❌ Patch coverage is 79.62466% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.94%. Comparing base (cc80272) to head (a063824).

Files with missing lines	Patch %	Lines
src/con_duct/_collectors.py	66.13%	53 Missing and 11 partials ⚠️
src/con_duct/_aggregation.py	95.94%	2 Missing and 4 partials ⚠️
src/con_duct/_duct_main.py	66.66%	3 Missing and 1 partial ⚠️
src/con_duct/_tracker.py	90.47%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #438      +/-   ##
==========================================
- Coverage   91.08%   86.94%   -4.14%     
==========================================
  Files          15       16       +1     
  Lines        1245     1425     +180     
  Branches      170      211      +41     
==========================================
+ Hits         1134     1239     +105     
- Misses         77      135      +58     
- Partials       34       51      +17

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

asmacdo · 2026-06-25T22:50:44Z

psutil_pdcpu: use psutil's cpu_percent() instead of re-deriving the rate from raw cpu_times().

Right now PsutilCollector reads raw cpu_times() and runs it through our own derive=rate (_per_pid_rate in _aggregation.py). But our rate over psutil's cpu_times computes the same thing psutil.Process.cpu_percent() already does (Δcputime/Δt) — the numbers come out ~identical. So re-deriving it ourselves buys nothing numerically, while cpu_percent() gives us psutil's battle-tested, maintained implementation for free — a real reason to value psutil as a distinct source rather than a redundant one.

Proposal: have the psutil collector emit cpu_percent() directly. It's a contained exception to "collectors are pure" — the collector holds a Process cache to prime cpu_percent, but the measurement stays a clean reduce=max with no derive (the aggregator just maxes the per-sample values). One alignment point: cpu_percent() is in percent (100 = one core), so normalize it to whatever units ps_pdcpu settles on (see the cores-vs-percent question).

asmacdo · 2026-06-25T22:50:45Z

Counter totals: sum per-pid deltas, not delta-of-sum.

ps_cpu_seconds (the reduce=delta single path in _aggregation._single) diffs the summed cumulative cputime across the window. When a pid that had accrued cputime exits mid-window, the summed total drops below the seed, the delta goes negative, and it's clamped to 0 — silently undercounting CPU for that window.

The robust form is sum-of-per-pid-deltas rather than delta-of-sum: accumulate each pid's own Δcputime (last − seed, per pid), so a process that exits still contributes its final delta instead of vanishing. delta-of-sum is only correct while the process set is stable; sum-of-per-pid-deltas has no such constraint.

asmacdo · 2026-06-25T22:50:46Z

Stamp the monotonic timestamp per collect() call, not once per sample.

Aggregator.add_sample() takes one time.monotonic() at the top and applies it to every collector's reading for that sample. With more than one per-sample collector (e.g. ps + psutil), they're collect()-ed sequentially, so the later collector's reading is actually taken ~one-ps-call after the timestamp it's stamped with. For rates (Δvalue/Δt) that introduces jitter / a systematic offset between a value and its paired time.

Fix is small and keeps collectors pure: stamp monotonic() per collect() call (right before each), instead of once per sample. The per-sample wall timestamp on the record can stay shared — only the Δt-bearing monotonic needs to be per-collector.

asmacdo · 2026-06-25T23:12:08Z

+# Measurement-derived keys always present in execution_summary, so info.json
+# schema does not depend on which collectors happened to be available.
+SUMMARY_KEYS = (
+    "peak_ps_rss_total",


i think lets go with ps_peak_rss_total, ie the tool should be first so its easy to .startswith()

asmacdo · 2026-07-01T22:25:09Z

Batching the rest of the review into one comment. Like the inline comments, these are working notes for the next implementation — not threads that need resolving on this POC.

1. Restore the dropped per-pid measurements: `pmem`, `vsz`, `etime`, `stat`

The new ps collector trimmed -o to pid,rss,cputime,cmd, dropping pmem, vsz, etime, and stat entirely. These all come from the same single ps call, so restoring them is just extra -o columns + measurement entries — no added cost. Proposed keys:

ps_vsz (per_pid, max) + ps_vsz_total (single, max) — virtual memory size
ps_pmem (per_pid, max) + ps_pmem_total (single, max) — %MEM (ps_pmem_total is a sum of percentages, matching the old total_pmem)
ps_etime (per_pid, last) — per-process elapsed wall time (not summable, so per_pid only)
ps_stat (per_pid) — process state codes, including Z (zombie)

ps_stat is the one that needs new machinery: it's categorical, and the old code aggregated it as a Counter of state codes over the window (so you could see e.g. how many samples a pid spent zombie/uninterruptible). The current Reduce enum (max/delta/last) can't express that — it needs a categorical/tally reduce (a histogram of observed states, or a set). ps_stat is the worked example for that reduce type. (ps_cmd-as-string is a half-step toward non-numeric, but doesn't cover the tally case.)

Restoring ps_etime also fixes the seedless-newborn gap: a pid's first reading has no previous value to diff against, so ps_pdcpu is currently None for it. With etime available, the seedless case can fall back to cputime / etime — the lifetime-average rate, which for a pid born within the window is exactly its rate since birth. Better than emitting nothing.

Not restored, on purpose: pcpu (lifetime %CPU) — it's replaced by ps_pdcpu, since the lifetime average is the bug being fixed. rss/cmd are already kept.

2. Ratify the `pdcpu` units: percent, not cores

ps_pdcpu is now a generic rate = cputime-seconds per wall-second — i.e. cores (1.0 = one core) — where the old pcpu was percent (100 = one core). That's a user-visible semantic change worth ratifying explicitly rather than letting it slip in unannounced.

Cores is arguably the more honest unit, but percent should win here: keeping percent avoids a needless breaking change to the one CPU number people already read, and it matches what psutil's cpu_percent() emits. Recommend: keep the generic rate as the internal derive (future io/proc rates reuse it) and scale ×100 at the measurement edge so the emitted pdcpu stays percent.

3. cgroup v1 path is unverified — and v1 is the production target

The cgroup collector resolves both v2 (memory.peak) and v1 (memory.max_usage_in_bytes) paths, but only v2 was exercised: the POC container is v2, and the decision log notes the v1 path is code-only/untested. The intended production target — the SLURM cluster (RHEL8) — is cgroup v1, so the v1 reader is the one that actually matters there, and it's currently unverified.

Note that discovery differs on v1 too, not just the filenames: v2 is one 0::/path line in /proc/self/cgroup under a unified mount, while v1 has one N:controller:/path line per controller with separate per-controller mounts (and on SLURM the controllers land in different scopes — e.g. memory is job-scoped while cpuacct is not). So both halves — finding the right cgroup and reading it — need verification on real v1 hardware before relying on it.

4. Design refinements this review pass surfaced

cgroup_rss_peak is misnamed — it isn't RSS. cgroup memory (v2 memory.peak / v1 memory.max_usage_in_bytes) is charged memory including reclaimable page cache, a genuinely different quantity from process RSS — they can diverge a lot on IO-heavy jobs. Overstating is the safe direction for --mem, but the name must not say rss. Rename to cgroup_mem_peak. The first field named under the semantics-in-names discipline shouldn't itself be an example of the overloading that discipline exists to prevent.
The design doc needs a "Schema and compatibility" section. The most contested surface is currently one clause ("namespacing keeps sources from colliding"). The position worth stating so it can be reacted to: semantics live in field names, not in an info.json tag (a tag is ignore-by-default — a consumer that skips it silently treats cgroup numbers as ps numbers); consumers need a documented preference order (e.g. cgroup_mem_peak if present, else ps_rss_total); ls/plot/pp must be method-aware and never silently collapse differently-sourced columns; and freeze-vs-rename of the existing fields is the open back-compat call.
Present-iff-measured, everywhere. usage.jsonl records carry only selected+available keys, but execution_summary always emits the full key set with nulls (chosen to keep test_schema deterministic across environments). That puts the two files on opposite conventions. Pick the principle over the test: a field is present iff that exact measurement was taken, in both files; make the test environment-aware instead. Related: record the resolved selection + collector availability in info.json as provenance — the audit trail a large sweep needs regardless.
Name the value-type axis. Three data points are discovering a dimension one exception at a time: ps_cmd is a string label with reduce=last (decision log flags it "mild stretch, revisit"); ps_stat, if restored, needs a categorical/tally reduce; everything else is a numeric gauge or counter. Naming it in the model — roughly gauge / counter / label / categorical — buys legal-reduce checking (a label can't max; a counter shouldn't be read as a level) and makes the tally machinery the fourth type instead of a special case.
Liveness doesn't belong in the aggregator. add_sample() hardcodes if name == "ps" as the session-liveness signal — a tracker concern expressed as a collector-name comparison inside the aggregation layer, and a psutil-only selection silently loses liveness detection. Next implementation: liveness is a capability any per-pid collector can report ("session empty"), or it stays in the tracker entirely.
State a units convention. rss is converted to bytes at collect and cputime to seconds (good); pdcpu units are in point 2 above — but the design doc never states the rule. A unit attribute on Measurement (or a unit-in-name convention) is cheap now and pays off when the plot/ls rework needs axis labels and users need to not misread KiB/bytes/percent/cores.

asmacdo and others added 4 commits June 25, 2026 15:44

docs: add collector POC design docs and decision log

e4446ab

Seed the decision log and add the POC brief + collector design under docs/design/ so the implementation and design go up together. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

asmacdo commented Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poc collector design#438

Poc collector design#438
asmacdo wants to merge 4 commits into
con:mainfrom
asmacdo:poc-collector-design

asmacdo commented Jun 25, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 25, 2026

Uh oh!

asmacdo commented Jun 25, 2026

Uh oh!

asmacdo commented Jun 25, 2026

Uh oh!

asmacdo commented Jun 25, 2026

Uh oh!

asmacdo Jun 25, 2026

Uh oh!

asmacdo commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

asmacdo commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

POC — not requesting review yet

Decisions that will need sign-off (happy to chat about any of these)

Uh oh!

codecov Bot commented Jun 25, 2026

Codecov Report

Uh oh!

asmacdo commented Jun 25, 2026

Uh oh!

asmacdo commented Jun 25, 2026

Uh oh!

asmacdo commented Jun 25, 2026

Uh oh!

asmacdo Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

asmacdo commented Jul 1, 2026

1. Restore the dropped per-pid measurements: pmem, vsz, etime, stat

2. Ratify the pdcpu units: percent, not cores

3. cgroup v1 path is unverified — and v1 is the production target

4. Design refinements this review pass surfaced

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asmacdo commented Jun 25, 2026 •

edited

Loading

1. Restore the dropped per-pid measurements: `pmem`, `vsz`, `etime`, `stat`

2. Ratify the `pdcpu` units: percent, not cores