[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler by varunursekar · Pull Request #5 · scaleapi/vero

varunursekar · 2026-06-24T18:12:44Z

Draft · Stack 3 of 4 — targets harbor-2-sidecar.

Mode B (runner.py): HarborRunner, an EvalStrategy that — per candidate — runs a nested harbor run of the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into vero SampleResults. One Harbor task = one sample; inference is delegated, scoring comes from Harbor.
The compiler (build/): vero harbor build renders a BuildConfig into a runnable Harbor task — a Docker Compose env (optimizer workbench + eval sidecar + 3 volumes), two Dockerfiles, instruction.md, tests/test.sh, seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Mode A and Mode B.

Review focus: the nested-run + collation, and the trust-boundary plumbing baked into the generated compose (volumes, root:600 token, secrets → sidecar only). Also un-ignores src/vero/harbor/build/ (the repo's build/ rule was hiding the package).

Stack: [1/4] core → [2/4] sidecar → this → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR implements Mode B (HarborRunner) — a nested harbor run eval strategy that collates trial result.json files into SampleResults — and the vero harbor build compiler that renders a BuildConfig into a runnable Harbor task directory (Docker Compose environment, two Dockerfiles, templates, and baked dataset/baseline/vero source). The stack-2 issues flagged in previous reviews (git archive exit-code, mkdtemp leak, last-write-wins on retries, and silent n_attempts/max_retries drop) are all addressed in this revision.

HarborRunner (runner.py): nested harbor run execution with resume logic, best-trial selection via _trial_rank (clean+rewards > any rewards > latest timestamp), and gap-filling collation that converts missing trials to error SampleResults.
Compiler (build/compiler.py + templates): compile_task materialises BuildConfig into a deterministic task directory; trust-boundary wiring (secrets → sidecar only via ${VAR:?} compose guards, root:600 admin token, advisory read_only_paths in seed.sh) is clearly documented and tested.

Confidence Score: 5/5

Safe to merge; all previously-flagged defects are resolved and the remaining findings are minor edge-case gaps that do not affect the normal build or evaluation path.

All previously-flagged defects (git archive exit-code check, temp-dir leak, last-write-wins on retried tasks, silent n_attempts/max_retries drop) are resolved and covered by new tests in TestReviewFixes. The remaining findings are a missing Mode A task validation that fails loudly at sidecar startup, a TOML single-quote edge case requiring deliberately unusual formatting, and a shell-quoting gap in seed.sh.j2 unlikely with realistic file paths. None affect the happy path.

vero/src/vero/harbor/build/compiler.py for the Mode A task validation gap and the single-quote TOML regex; vero/src/vero/harbor/build/templates/seed.sh.j2 for the read_only_paths quoting.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/runner.py	Mode B runner: `_build_command` now includes `--n-attempts`/`--max-retries`, `_load_trials` uses `_trial_rank` for deterministic best-trial selection, and `_is_done` correctly re-runs persisted error samples. Logic is sound and well-tested.
vero/src/vero/harbor/build/compiler.py	Compiler for Harbor task dirs. Previously flagged issues (git archive returncode check, TemporaryDirectory cleanup) are fixed. Missing compile-time validation for `config.task` in Mode A and single-quoted TOML path handling in `_rewrite_vero_source_path` are minor new gaps.
vero/src/vero/harbor/build/config.py	BuildConfig Pydantic model with sensible defaults; `from_file` resolves relative paths against the config directory. Well-structured.
vero/src/vero/harbor/build/templates/seed.sh.j2	Generates the seed script; trust-boundary advisory note is accurate and tested. `read_only_paths` values are interpolated into double-quoted shell strings without escaping.
vero/src/vero/harbor/build/templates/docker-compose.yaml.j2	Compose template correctly isolates secrets to the sidecar via `${VAR:?}` fail-fast guards, mounts token volume read-only on main, and gates main startup on the sidecar health check.
vero/tests/test_harbor_build.py	Good coverage: structure, ServeConfig round-trip, TOML/YAML parsing, vero path rewriting, shared baseline SHA, git archive failure, missing-secret gate, and advisory read-only documentation.
vero/tests/test_harbor_runner.py	Thorough runner tests: command flags, reward extraction, collation with missing trials, resume skipping done samples, and the TestReviewFixes class verifying both previously-flagged issues are resolved.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller as EvalFramework
    participant HR as HarborRunner
    participant FS as Filesystem (jobs_dir)
    participant Harbor as harbor CLI (nested run)
    participant Sidecar as eval-sidecar (per task)
    participant Store as vero SampleResult store

    Caller->>HR: produce_sample_results(workspace, params, result_dir)
    HR->>HR: _task_names_for(params)
    HR->>Store: _is_done? (filter pending)
    HR->>Harbor: uv run harbor run --n-attempts N --max-retries R -i task1 ... --jobs-dir jobs_dir
    loop per task
        Harbor->>Sidecar: run task
        Sidecar-->>FS: write result.json
    end
    Harbor-->>HR: exit code (non-zero is non-fatal)
    HR->>FS: _load_trials (rglob, pick best by _trial_rank)
    loop per (sample_id, task_name)
        HR->>Store: _is_done? skip if succeeded
        HR->>HR: _sample_result(trial, sample_id)
        HR->>Store: save_sample_result
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller as EvalFramework
    participant HR as HarborRunner
    participant FS as Filesystem (jobs_dir)
    participant Harbor as harbor CLI (nested run)
    participant Sidecar as eval-sidecar (per task)
    participant Store as vero SampleResult store

    Caller->>HR: produce_sample_results(workspace, params, result_dir)
    HR->>HR: _task_names_for(params)
    HR->>Store: _is_done? (filter pending)
    HR->>Harbor: uv run harbor run --n-attempts N --max-retries R -i task1 ... --jobs-dir jobs_dir
    loop per task
        Harbor->>Sidecar: run task
        Sidecar-->>FS: write result.json
    end
    Harbor-->>HR: exit code (non-zero is non-fatal)
    HR->>FS: _load_trials (rglob, pick best by _trial_rank)
    loop per (sample_id, task_name)
        HR->>Store: _is_done? skip if succeeded
        HR->>HR: _sample_result(trial, sample_id)
        HR->>Store: save_sample_result
    end

Comments Outside Diff (2)

vero/src/vero/harbor/runner.py, line 739-752 (link)

n_attempts and max_retries silently dropped from harbor run

HarborConfig exposes n_attempts and max_retries as typed fields (not extra_args pass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in _build_command, so every harbor run invocation uses whatever the harbor defaults are regardless of what the caller configured. A user who sets max_retries=0 to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.

Prompt To Fix With AI

This is a comment left during a code review.
Path: vero/src/vero/harbor/runner.py
Line: 739-752

Comment:
**`n_attempts` and `max_retries` silently dropped from harbor run**

`HarborConfig` exposes `n_attempts` and `max_retries` as typed fields (not `extra_args` pass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in `_build_command`, so every `harbor run` invocation uses whatever the harbor defaults are regardless of what the caller configured. A user who sets `max_retries=0` to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.

How can I resolve this? If you propose a fix, please make it concise.

vero/src/vero/harbor/runner.py, line 799-813 (link)

Last-write-wins on duplicate task_name when harbor retries a task

rglob("result.json") visits all result files across all timestamp directories under jobs_dir. If harbor's n_attempts or max_retries causes the same task to produce multiple trial entries with the same task_name, trials[task_name] = data simply overwrites with whichever file rglob happens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.

Prompt To Fix With AI

This is a comment left during a code review.
Path: vero/src/vero/harbor/runner.py
Line: 799-813

Comment:
**Last-write-wins on duplicate `task_name` when harbor retries a task**

`rglob("result.json")` visits all result files across all timestamp directories under `jobs_dir`. If harbor's `n_attempts` or `max_retries` causes the same task to produce multiple trial entries with the same `task_name`, `trials[task_name] = data` simply overwrites with whichever file `rglob` happens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "Merge pull request #9 from scaleapi/harb..." | Re-trigger Greptile}

- Mode B (runner.py): `HarborRunner`, an `EvalStrategy` that — for each candidate — runs a *nested* `harbor run` of the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into vero `SampleResult`s. One Harbor task = one sample; inference is fully delegated, scoring comes from Harbor's verifier. - The compiler (build/): `vero harbor build` renders a `BuildConfig` into a runnable Harbor task directory — a Docker Compose environment (optimizer workbench `main` + the eval sidecar + three volumes), two Dockerfiles, instruction.md, tests/test.sh, and the seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Supports Mode A (local dataset/scorer) and Mode B (a registry or local Harbor benchmark, passed through to the HarborConfig). - `.gitignore`: un-ignore src/vero/harbor/build/ (the repo's `build/` rule was hiding the compiler package). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale · 2026-06-30T08:34:12Z

+    # Execute
+    # ------------------------------------------------------------------
+
+    def _build_command(


HarborConfig.n_attempts / max_retries are typed fields, which signals they are honored, but _build_command never emits flags for them — so every nested run uses harbor defaults. On a budget-graded run that is a real cost/correctness bug. Either map them to the real harbor run flags here, or drop the fields. (Confirms Greptile's P1.)

shehabyasser-scale · 2026-06-30T08:34:19Z

+                continue
+            task_name = data.get("task_name")
+            if task_name:
+                trials[task_name] = data


trials[task_name] = data is last-write-wins over an unordered rglob; with retries or a reused jobs_dir, a failing attempt can clobber a passing one, feeding a wrong score into auto_best. Collect all matches and pick deterministically (latest mtime / highest attempt index), and/or namespace jobs_dir per attempt. (Confirms Greptile's P1.)

shehabyasser-scale · 2026-06-30T08:34:26Z

+        values = [float(v) for v in rewards.values()]
+        return sum(values) / len(values) if values else 0.0
+
+    def _existing(self, params: EvaluationParameters, sample_id: int) -> SampleResult | None:


_existing treats a persisted error sample as 'done', so a sample whose nested run failed once is permanently skipped on resume and silently degrades the candidate's score. Either don't persist error samples, or make the pending/skip predicate treat error results as not-done.

shehabyasser-scale · 2026-06-30T08:34:32Z

+            ["tar", "xf", "-", "--strip-components", str(strip)],
+            cwd=dest, stdin=archive.stdout, check=True,
+        )
+        archive.wait()


archive.wait()'s return code is discarded; if git archive fails, tar can still exit 0 on a truncated stream and you commit a near-empty baseline whose SHA gets baked into both images. Assert archive.wait() == 0 (and reap the Popen on the tar-failure path). (Confirms Greptile.)

shehabyasser-scale · 2026-06-30T08:34:39Z

+    environment:
+      VERO_HOME_DIR: "/opt/vero_home"
+{% for secret in secrets %}
+      {{ secret }}: "${{ '{' }}{{ secret }}{{ '}' }}"


This renders SECRET: "${SECRET}", resolved from the run host's env, not anything the compiler validates. An unset var interpolates to empty and the sidecar comes up credential-less, failing every eval opaquely. Consider failing the build / emitting a required-secrets manifest when a declared secret is missing. Note test_rendered_files_parse only asserts the literal ${...} string, which hides this.

shehabyasser-scale · 2026-06-30T08:35:52Z

+# ...except locked paths (e.g. the scorer): root-owned + unwritable.
+if [ -e "/work/agent/{{ p }}" ]; then
+  chown -R root:root "/work/agent/{{ p }}"
+  chmod -R a-w "/work/agent/{{ p }}"


This chown root:root + chmod -R a-w applies to the working tree under /work/agent, but the verifier never executes the working tree: _transfer_commit fetches git blobs and the evaluator checks the commit out into a fresh temp copy. The agent can git add -f a modified scorer and commit it regardless of working-tree perms. So read_only_paths is not a tamper control — please document it as advisory-only and move scorer-provenance enforcement into the sidecar (see the Mode A scorer-provenance comment on #4).

…mkdtemp cleanup Mode B runner + build-compiler robustness (review findings on PR #5): - runner: emit --n-attempts / --max-retries (the typed HarborConfig fields were silently dropped); pick the best trial per task deterministically instead of last-write-wins over an unordered rglob; treat a persisted error sample as not-done so a transient failure is re-run on resume. - compiler: assert git archive exited 0 (and reap it) so a truncated stream cannot bake a near-empty baseline; validate declared secrets at build time and render compose secrets with a fail-fast guard so an unset host var fails loudly instead of producing a credential-less sidecar. - seed.sh: document that read_only_paths is advisory only and scorer provenance is enforced sidecar-side. - compiler: stage the dataset in a cleaned-up TemporaryDirectory (Greptile: the mkdtemp scratch dir was leaking, datasets can be gigabytes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(harbor): nested-run flags/dedup/resume, archive + secret guards [fixes 3/4 compiler]

varunursekar mentioned this pull request Jun 24, 2026

Add Harbor integration: optimization-as-a-Harbor-task #2

Closed

varunursekar requested a review from a team June 24, 2026 18:15

varunursekar marked this pull request as ready for review June 24, 2026 18:22

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread vero/src/vero/harbor/build/compiler.py

Comment thread vero/src/vero/harbor/build/compiler.py

shehabyasser-scale mentioned this pull request Jun 30, 2026

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A) #4

Open

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

[Harbor 4/4] architecture docs, tutorial, and the GAIA example #6

Open

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

fix(harbor): nested-run flags/dedup/resume, archive + secret guards [fixes 3/4 compiler] #9

Merged

Merge pull request #9 from scaleapi/harbor-3-compiler-fixes

05cdeea

fix(harbor): nested-run flags/dedup/resume, archive + secret guards [fixes 3/4 compiler]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5

[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5
varunursekar wants to merge 3 commits into
harbor-2-sidecarfrom
harbor-3-compiler

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

varunursekar commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading