Skip to content

Poc collector design#438

Draft
asmacdo wants to merge 4 commits into
con:mainfrom
asmacdo:poc-collector-design
Draft

Poc collector design#438
asmacdo wants to merge 4 commits into
con:mainfrom
asmacdo:poc-collector-design

Conversation

@asmacdo

@asmacdo asmacdo commented Jun 25, 2026

Copy link
Copy Markdown
Member

POC — not requesting review yet

Mentioning for visibility; I'm still folding a review pass into the design. When I do ask for feedback, it'll be for direction on the short list of decisions below — not a code review.

Problem: duct's resource stats aren't trustworthy enough for their main job (e.g. sizing SLURM --mem/--cpus from measured runs): ps -o pcpu is a lifetime average so per-pid sums overshoot (#399), and summed rss double-counts shared pages.

I explored alternative samplers in #415 (cgroup hybrid) and #423 (measurement-time delta-pcpu), and landed a best-effort plot-time correction for existing logs in #424. This PR combines those contexts into a design doc (docs/design/resource-collectors.md), plus a one-shot agent implementation of that doc — pushed prior to human review. My inline comments flag the agent's known missteps (e.g. dropping fields not in the example) so the next implementation doesn't repeat them; they aren't review threads that need resolving here.

Goals:

  • stress-test the design doc against real code
  • stretch the pattern beyond ps: psutil is implemented; io and /proc are analyzed only — we think they should fit, but that's unverified
  • provide a reference POC to try out
  • a starting point for the real implementation, which should be broken into much smaller PRs

Known-intentional gaps (follow-ups, not bugs): the schema is replaced wholesale (0.2.2 → 0.3.0, no back-compat fields), and ls/plot/pprint are not reworked — they break on new-format logs (their tests pass only against old fixtures). Backwards compat and consumer updates are the explicit next chunk.

Decisions that will need sign-off (happy to chat about any of these)

  1. The architecture: buffer raw readings and aggregate once per report, behind a collector/measurement interface — replacing the streaming per-sample max. Is this the shape the real implementation should be sliced into (as several much smaller PRs)?
  2. Schema policy: semantics live in field names — a new source/method is a new field, never a new meaning under an old name. And freeze-vs-rename: keep existing field names forever and only add new ones cleanly, or accept a one-time breaking rename to honest names?
  3. The CPU story: stop collecting lifetime pcpu (the bug), collect cumulative cputime instead and derive the rate from it — a user-visible semantic change to the headline CPU number.
  4. The user surface: selection by flat measurement keys (--measurements ps_rss,cgroup_mem_peak,...) with collectors internal, rather than picking a named sampler bundle.

asmacdo and others added 4 commits June 25, 2026 15:44
Seed the decision log and add the POC brief + collector design under
docs/design/ so the implementation and design go up together.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the per-sample max-aggregation engine with the collect ->
buffer -> aggregate-once pipeline end to end:

- monitor_process now drives an Aggregator: each sample buffers raw
  readings; at each report interval it aggregates once and writes a
  usage record {timestamp, num_samples, measurements{...}}.
- execution_summary is rewritten from the run-level accumulator with
  new measurement keys (peak_ps_rss_total, ave_ps_rss_total,
  ps_cpu_seconds, peak_cgroup_rss_peak, peak_psutil_pss_total); the old
  rss/vsz/pmem/pcpu summary keys are gone.
- New `--measurements` selection flag (default: all available keys;
  DUCT_MEASUREMENTS env); unknown/unavailable keys fail fast and clean.
- Schema bump 0.2.2 -> 0.3.0: LS_FIELD_CHOICES and the ls --fields
  default track the renamed summary keys; EXECUTION_SUMMARY_FORMAT
  updated. plot/ls/pprint logic untouched (out of scope).
- Remove the now-dead _sampling.py and Sample/Averages/ProcessStats;
  the unsupported-OS guard moves to _collectors. Old Sample-based tests
  replaced by the pipeline unit tests; e2e/execution/schema tests follow
  the new record/summary shape.

tox: lint, typing, and 377 tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
add_sample now returns whether a sample was buffered. When ps finds no
processes the sample is dropped (not buffered, not counted) so a trailing
empty reading cannot skew a delta/total or inflate num_samples. Reporting
no longer hinges on a nonzero ps pid count, so a selection of only non-ps
(single-read) collectors still produces records.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update resource-collectors.md interface sketch to the built signatures
and add an "Implementation status (POC)" section documenting the
deliberate deviations (units in cores not percent, ps_cmd, dropped
pmem/vsz/pcpu, on-disk shapes, out-of-scope consumers). Record the
remaining implementation decisions and the brief's out-of-scope
interface-fit notes (io / proc / presets / differential).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.62466% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.94%. Comparing base (cc80272) to head (a063824).

Files with missing lines Patch % Lines
src/con_duct/_collectors.py 66.13% 53 Missing and 11 partials ⚠️
src/con_duct/_aggregation.py 95.94% 2 Missing and 4 partials ⚠️
src/con_duct/_duct_main.py 66.66% 3 Missing and 1 partial ⚠️
src/con_duct/_tracker.py 90.47% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #438      +/-   ##
==========================================
- Coverage   91.08%   86.94%   -4.14%     
==========================================
  Files          15       16       +1     
  Lines        1245     1425     +180     
  Branches      170      211      +41     
==========================================
+ Hits         1134     1239     +105     
- Misses         77      135      +58     
- Partials       34       51      +17     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@asmacdo

asmacdo commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

psutil_pdcpu: use psutil's cpu_percent() instead of re-deriving the rate from raw cpu_times().

Right now PsutilCollector reads raw cpu_times() and runs it through our own derive=rate (_per_pid_rate in _aggregation.py). But our rate over psutil's cpu_times computes the same thing psutil.Process.cpu_percent() already does (Δcputime/Δt) — the numbers come out ~identical. So re-deriving it ourselves buys nothing numerically, while cpu_percent() gives us psutil's battle-tested, maintained implementation for free — a real reason to value psutil as a distinct source rather than a redundant one.

Proposal: have the psutil collector emit cpu_percent() directly. It's a contained exception to "collectors are pure" — the collector holds a Process cache to prime cpu_percent, but the measurement stays a clean reduce=max with no derive (the aggregator just maxes the per-sample values). One alignment point: cpu_percent() is in percent (100 = one core), so normalize it to whatever units ps_pdcpu settles on (see the cores-vs-percent question).

@asmacdo

asmacdo commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

Counter totals: sum per-pid deltas, not delta-of-sum.

ps_cpu_seconds (the reduce=delta single path in _aggregation._single) diffs the summed cumulative cputime across the window. When a pid that had accrued cputime exits mid-window, the summed total drops below the seed, the delta goes negative, and it's clamped to 0 — silently undercounting CPU for that window.

The robust form is sum-of-per-pid-deltas rather than delta-of-sum: accumulate each pid's own Δcputime (last − seed, per pid), so a process that exits still contributes its final delta instead of vanishing. delta-of-sum is only correct while the process set is stable; sum-of-per-pid-deltas has no such constraint.

@asmacdo

asmacdo commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

Stamp the monotonic timestamp per collect() call, not once per sample.

Aggregator.add_sample() takes one time.monotonic() at the top and applies it to every collector's reading for that sample. With more than one per-sample collector (e.g. ps + psutil), they're collect()-ed sequentially, so the later collector's reading is actually taken ~one-ps-call after the timestamp it's stamped with. For rates (Δvalue/Δt) that introduces jitter / a systematic offset between a value and its paired time.

Fix is small and keeps collectors pure: stamp monotonic() per collect() call (right before each), instead of once per sample. The per-sample wall timestamp on the record can stay shared — only the Δt-bearing monotonic needs to be per-collector.

# Measurement-derived keys always present in execution_summary, so info.json
# schema does not depend on which collectors happened to be available.
SUMMARY_KEYS = (
"peak_ps_rss_total",

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think lets go with ps_peak_rss_total, ie the tool should be first so its easy to .startswith()

@asmacdo

asmacdo commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Batching the rest of the review into one comment. Like the inline comments, these are working notes for the next implementation — not threads that need resolving on this POC.

1. Restore the dropped per-pid measurements: pmem, vsz, etime, stat

The new ps collector trimmed -o to pid,rss,cputime,cmd, dropping pmem, vsz, etime, and stat entirely. These all come from the same single ps call, so restoring them is just extra -o columns + measurement entries — no added cost. Proposed keys:

  • ps_vsz (per_pid, max) + ps_vsz_total (single, max) — virtual memory size
  • ps_pmem (per_pid, max) + ps_pmem_total (single, max) — %MEM (ps_pmem_total is a sum of percentages, matching the old total_pmem)
  • ps_etime (per_pid, last) — per-process elapsed wall time (not summable, so per_pid only)
  • ps_stat (per_pid) — process state codes, including Z (zombie)

ps_stat is the one that needs new machinery: it's categorical, and the old code aggregated it as a Counter of state codes over the window (so you could see e.g. how many samples a pid spent zombie/uninterruptible). The current Reduce enum (max/delta/last) can't express that — it needs a categorical/tally reduce (a histogram of observed states, or a set). ps_stat is the worked example for that reduce type. (ps_cmd-as-string is a half-step toward non-numeric, but doesn't cover the tally case.)

Restoring ps_etime also fixes the seedless-newborn gap: a pid's first reading has no previous value to diff against, so ps_pdcpu is currently None for it. With etime available, the seedless case can fall back to cputime / etime — the lifetime-average rate, which for a pid born within the window is exactly its rate since birth. Better than emitting nothing.

Not restored, on purpose: pcpu (lifetime %CPU) — it's replaced by ps_pdcpu, since the lifetime average is the bug being fixed. rss/cmd are already kept.

2. Ratify the pdcpu units: percent, not cores

ps_pdcpu is now a generic rate = cputime-seconds per wall-second — i.e. cores (1.0 = one core) — where the old pcpu was percent (100 = one core). That's a user-visible semantic change worth ratifying explicitly rather than letting it slip in unannounced.

Cores is arguably the more honest unit, but percent should win here: keeping percent avoids a needless breaking change to the one CPU number people already read, and it matches what psutil's cpu_percent() emits. Recommend: keep the generic rate as the internal derive (future io/proc rates reuse it) and scale ×100 at the measurement edge so the emitted pdcpu stays percent.

3. cgroup v1 path is unverified — and v1 is the production target

The cgroup collector resolves both v2 (memory.peak) and v1 (memory.max_usage_in_bytes) paths, but only v2 was exercised: the POC container is v2, and the decision log notes the v1 path is code-only/untested. The intended production target — the SLURM cluster (RHEL8) — is cgroup v1, so the v1 reader is the one that actually matters there, and it's currently unverified.

Note that discovery differs on v1 too, not just the filenames: v2 is one 0::/path line in /proc/self/cgroup under a unified mount, while v1 has one N:controller:/path line per controller with separate per-controller mounts (and on SLURM the controllers land in different scopes — e.g. memory is job-scoped while cpuacct is not). So both halves — finding the right cgroup and reading it — need verification on real v1 hardware before relying on it.

4. Design refinements this review pass surfaced

  • cgroup_rss_peak is misnamed — it isn't RSS. cgroup memory (v2 memory.peak / v1 memory.max_usage_in_bytes) is charged memory including reclaimable page cache, a genuinely different quantity from process RSS — they can diverge a lot on IO-heavy jobs. Overstating is the safe direction for --mem, but the name must not say rss. Rename to cgroup_mem_peak. The first field named under the semantics-in-names discipline shouldn't itself be an example of the overloading that discipline exists to prevent.

  • The design doc needs a "Schema and compatibility" section. The most contested surface is currently one clause ("namespacing keeps sources from colliding"). The position worth stating so it can be reacted to: semantics live in field names, not in an info.json tag (a tag is ignore-by-default — a consumer that skips it silently treats cgroup numbers as ps numbers); consumers need a documented preference order (e.g. cgroup_mem_peak if present, else ps_rss_total); ls/plot/pp must be method-aware and never silently collapse differently-sourced columns; and freeze-vs-rename of the existing fields is the open back-compat call.

  • Present-iff-measured, everywhere. usage.jsonl records carry only selected+available keys, but execution_summary always emits the full key set with nulls (chosen to keep test_schema deterministic across environments). That puts the two files on opposite conventions. Pick the principle over the test: a field is present iff that exact measurement was taken, in both files; make the test environment-aware instead. Related: record the resolved selection + collector availability in info.json as provenance — the audit trail a large sweep needs regardless.

  • Name the value-type axis. Three data points are discovering a dimension one exception at a time: ps_cmd is a string label with reduce=last (decision log flags it "mild stretch, revisit"); ps_stat, if restored, needs a categorical/tally reduce; everything else is a numeric gauge or counter. Naming it in the model — roughly gauge / counter / label / categorical — buys legal-reduce checking (a label can't max; a counter shouldn't be read as a level) and makes the tally machinery the fourth type instead of a special case.

  • Liveness doesn't belong in the aggregator. add_sample() hardcodes if name == "ps" as the session-liveness signal — a tracker concern expressed as a collector-name comparison inside the aggregation layer, and a psutil-only selection silently loses liveness detection. Next implementation: liveness is a capability any per-pid collector can report ("session empty"), or it stays in the tracker entirely.

  • State a units convention. rss is converted to bytes at collect and cputime to seconds (good); pdcpu units are in point 2 above — but the design doc never states the rule. A unit attribute on Measurement (or a unit-in-name convention) is cheap now and pays off when the plot/ls rework needs axis labels and users need to not misread KiB/bytes/percent/cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant