[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5
[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5varunursekar wants to merge 3 commits into
Conversation
- Mode B (runner.py): `HarborRunner`, an `EvalStrategy` that — for each candidate — runs a *nested* `harbor run` of the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into vero `SampleResult`s. One Harbor task = one sample; inference is fully delegated, scoring comes from Harbor's verifier. - The compiler (build/): `vero harbor build` renders a `BuildConfig` into a runnable Harbor task directory — a Docker Compose environment (optimizer workbench `main` + the eval sidecar + three volumes), two Dockerfiles, instruction.md, tests/test.sh, and the seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Supports Mode A (local dataset/scorer) and Mode B (a registry or local Harbor benchmark, passed through to the HarborConfig). - `.gitignore`: un-ignore src/vero/harbor/build/ (the repo's `build/` rule was hiding the compiler package). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| # Execute | ||
| # ------------------------------------------------------------------ | ||
|
|
||
| def _build_command( |
There was a problem hiding this comment.
HarborConfig.n_attempts / max_retries are typed fields, which signals they are honored, but _build_command never emits flags for them — so every nested run uses harbor defaults. On a budget-graded run that is a real cost/correctness bug. Either map them to the real harbor run flags here, or drop the fields. (Confirms Greptile's P1.)
| continue | ||
| task_name = data.get("task_name") | ||
| if task_name: | ||
| trials[task_name] = data |
There was a problem hiding this comment.
trials[task_name] = data is last-write-wins over an unordered rglob; with retries or a reused jobs_dir, a failing attempt can clobber a passing one, feeding a wrong score into auto_best. Collect all matches and pick deterministically (latest mtime / highest attempt index), and/or namespace jobs_dir per attempt. (Confirms Greptile's P1.)
| values = [float(v) for v in rewards.values()] | ||
| return sum(values) / len(values) if values else 0.0 | ||
|
|
||
| def _existing(self, params: EvaluationParameters, sample_id: int) -> SampleResult | None: |
There was a problem hiding this comment.
_existing treats a persisted error sample as 'done', so a sample whose nested run failed once is permanently skipped on resume and silently degrades the candidate's score. Either don't persist error samples, or make the pending/skip predicate treat error results as not-done.
| ["tar", "xf", "-", "--strip-components", str(strip)], | ||
| cwd=dest, stdin=archive.stdout, check=True, | ||
| ) | ||
| archive.wait() |
There was a problem hiding this comment.
archive.wait()'s return code is discarded; if git archive fails, tar can still exit 0 on a truncated stream and you commit a near-empty baseline whose SHA gets baked into both images. Assert archive.wait() == 0 (and reap the Popen on the tar-failure path). (Confirms Greptile.)
| environment: | ||
| VERO_HOME_DIR: "/opt/vero_home" | ||
| {% for secret in secrets %} | ||
| {{ secret }}: "${{ '{' }}{{ secret }}{{ '}' }}" |
There was a problem hiding this comment.
This renders SECRET: "${SECRET}", resolved from the run host's env, not anything the compiler validates. An unset var interpolates to empty and the sidecar comes up credential-less, failing every eval opaquely. Consider failing the build / emitting a required-secrets manifest when a declared secret is missing. Note test_rendered_files_parse only asserts the literal ${...} string, which hides this.
| # ...except locked paths (e.g. the scorer): root-owned + unwritable. | ||
| if [ -e "/work/agent/{{ p }}" ]; then | ||
| chown -R root:root "/work/agent/{{ p }}" | ||
| chmod -R a-w "/work/agent/{{ p }}" |
There was a problem hiding this comment.
This chown root:root + chmod -R a-w applies to the working tree under /work/agent, but the verifier never executes the working tree: _transfer_commit fetches git blobs and the evaluator checks the commit out into a fresh temp copy. The agent can git add -f a modified scorer and commit it regardless of working-tree perms. So read_only_paths is not a tamper control — please document it as advisory-only and move scorer-provenance enforcement into the sidecar (see the Mode A scorer-provenance comment on #4).
…mkdtemp cleanup Mode B runner + build-compiler robustness (review findings on PR #5): - runner: emit --n-attempts / --max-retries (the typed HarborConfig fields were silently dropped); pick the best trial per task deterministically instead of last-write-wins over an unordered rglob; treat a persisted error sample as not-done so a transient failure is re-run on resume. - compiler: assert git archive exited 0 (and reap it) so a truncated stream cannot bake a near-empty baseline; validate declared secrets at build time and render compose secrets with a fail-fast guard so an unset host var fails loudly instead of producing a credential-less sidecar. - seed.sh: document that read_only_paths is advisory only and scorer provenance is enforced sidecar-side. - compiler: stage the dataset in a cleaned-up TemporaryDirectory (Greptile: the mkdtemp scratch dir was leaking, datasets can be gigabytes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(harbor): nested-run flags/dedup/resume, archive + secret guards [fixes 3/4 compiler]
Draft · Stack 3 of 4 — targets
harbor-2-sidecar.HarborRunner, anEvalStrategythat — per candidate — runs a nestedharbor runof the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into veroSampleResults. One Harbor task = one sample; inference is delegated, scoring comes from Harbor.vero harbor buildrenders aBuildConfiginto a runnable Harbor task — a Docker Compose env (optimizer workbench + eval sidecar + 3 volumes), two Dockerfiles,instruction.md,tests/test.sh, seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar'sServeConfig. Mode A and Mode B.Review focus: the nested-run + collation, and the trust-boundary plumbing baked into the generated compose (volumes,
root:600token, secrets → sidecar only). Also un-ignoressrc/vero/harbor/build/(the repo'sbuild/rule was hiding the package).Stack: [1/4] core → [2/4] sidecar → this → [4/4] docs.
🤖 Generated with Claude Code
Greptile Summary
This PR implements Mode B (
HarborRunner) — a nestedharbor runeval strategy that collates trialresult.jsonfiles intoSampleResults — and thevero harbor buildcompiler that renders aBuildConfiginto a runnable Harbor task directory (Docker Compose environment, two Dockerfiles, templates, and baked dataset/baseline/vero source). The stack-2 issues flagged in previous reviews (git archive exit-code,mkdtempleak, last-write-wins on retries, and silentn_attempts/max_retriesdrop) are all addressed in this revision.HarborRunner(runner.py): nestedharbor runexecution with resume logic, best-trial selection via_trial_rank(clean+rewards > any rewards > latest timestamp), and gap-filling collation that converts missing trials to errorSampleResults.build/compiler.py+ templates):compile_taskmaterialisesBuildConfiginto a deterministic task directory; trust-boundary wiring (secrets → sidecar only via${VAR:?}compose guards,root:600admin token, advisoryread_only_pathsinseed.sh) is clearly documented and tested.Confidence Score: 5/5
Safe to merge; all previously-flagged defects are resolved and the remaining findings are minor edge-case gaps that do not affect the normal build or evaluation path.
All previously-flagged defects (git archive exit-code check, temp-dir leak, last-write-wins on retried tasks, silent n_attempts/max_retries drop) are resolved and covered by new tests in TestReviewFixes. The remaining findings are a missing Mode A task validation that fails loudly at sidecar startup, a TOML single-quote edge case requiring deliberately unusual formatting, and a shell-quoting gap in seed.sh.j2 unlikely with realistic file paths. None affect the happy path.
vero/src/vero/harbor/build/compiler.py for the Mode A task validation gap and the single-quote TOML regex; vero/src/vero/harbor/build/templates/seed.sh.j2 for the read_only_paths quoting.
Important Files Changed
_build_commandnow includes--n-attempts/--max-retries,_load_trialsuses_trial_rankfor deterministic best-trial selection, and_is_donecorrectly re-runs persisted error samples. Logic is sound and well-tested.config.taskin Mode A and single-quoted TOML path handling in_rewrite_vero_source_pathare minor new gaps.from_fileresolves relative paths against the config directory. Well-structured.read_only_pathsvalues are interpolated into double-quoted shell strings without escaping.${VAR:?}fail-fast guards, mounts token volume read-only on main, and gates main startup on the sidecar health check.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Caller as EvalFramework participant HR as HarborRunner participant FS as Filesystem (jobs_dir) participant Harbor as harbor CLI (nested run) participant Sidecar as eval-sidecar (per task) participant Store as vero SampleResult store Caller->>HR: produce_sample_results(workspace, params, result_dir) HR->>HR: _task_names_for(params) HR->>Store: _is_done? (filter pending) HR->>Harbor: uv run harbor run --n-attempts N --max-retries R -i task1 ... --jobs-dir jobs_dir loop per task Harbor->>Sidecar: run task Sidecar-->>FS: write result.json end Harbor-->>HR: exit code (non-zero is non-fatal) HR->>FS: _load_trials (rglob, pick best by _trial_rank) loop per (sample_id, task_name) HR->>Store: _is_done? skip if succeeded HR->>HR: _sample_result(trial, sample_id) HR->>Store: save_sample_result end%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Caller as EvalFramework participant HR as HarborRunner participant FS as Filesystem (jobs_dir) participant Harbor as harbor CLI (nested run) participant Sidecar as eval-sidecar (per task) participant Store as vero SampleResult store Caller->>HR: produce_sample_results(workspace, params, result_dir) HR->>HR: _task_names_for(params) HR->>Store: _is_done? (filter pending) HR->>Harbor: uv run harbor run --n-attempts N --max-retries R -i task1 ... --jobs-dir jobs_dir loop per task Harbor->>Sidecar: run task Sidecar-->>FS: write result.json end Harbor-->>HR: exit code (non-zero is non-fatal) HR->>FS: _load_trials (rglob, pick best by _trial_rank) loop per (sample_id, task_name) HR->>Store: _is_done? skip if succeeded HR->>HR: _sample_result(trial, sample_id) HR->>Store: save_sample_result endComments Outside Diff (2)
vero/src/vero/harbor/runner.py, line 739-752 (link)n_attemptsandmax_retriessilently dropped from harbor runHarborConfigexposesn_attemptsandmax_retriesas typed fields (notextra_argspass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in_build_command, so everyharbor runinvocation uses whatever the harbor defaults are regardless of what the caller configured. A user who setsmax_retries=0to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.Prompt To Fix With AI
vero/src/vero/harbor/runner.py, line 799-813 (link)task_namewhen harbor retries a taskrglob("result.json")visits all result files across all timestamp directories underjobs_dir. If harbor'sn_attemptsormax_retriescauses the same task to produce multiple trial entries with the sametask_name,trials[task_name] = datasimply overwrites with whichever filerglobhappens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.Prompt To Fix With AI
Reviews (2): Last reviewed commit: "Merge pull request #9 from scaleapi/harb..." | Re-trigger Greptile