[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A) by varunursekar · Pull Request #4 · scaleapi/vero

varunursekar · 2026-06-24T18:12:23Z

Draft · Stack 2 of 4 — targets harbor-1-core (review [1/4] first; its diff is the base here).

The evaluation engine run in a sidecar container, plus the trust boundary that makes a run leaderboard-gradeable. This is the security-critical PR — a focused/security review is worthwhile here.

EvaluationSidecar (server.py): commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) + tier-gated result write-routing across the agent-readable and admin volumes.
Verifier (verifier.py): commit selection (submit | auto_best) + hidden-split scoring.
Admin token (auth.py): per-trial, written root:600 — unreadable by the de-privileged optimizer, readable by the verifier (root, shared mode).
FastAPI (app.py): /eval /submit /status (agent; metered, redacted) and /finalize (verifier; token-gated). vero harbor serve assembles engine+sidecar+verifier from a ServeConfig.
CLI clients (cli.py) + HarborConfig/partition helpers so the package imports cleanly (build/run light up in [3/4]).

Stack: [1/4] core → this → [3/4] compiler → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the Harbor eval sidecar and verifier flow. The main changes are:

FastAPI endpoints for eval, submit, status, and token-gated finalize.
Sidecar commit transfer from the agent repo and tier-based result projection.
Admin token creation and CLI clients for agent and verifier commands.
Verifier commit selection and hidden-split reward scoring.
Serve-time assembly for engine, sidecar, verifier, budgets, and config.

Confidence Score: 4/5

The Harbor sidecar flow is close, but the eval path needs a hidden-split access gate before merge.

The review covers the new API, sidecar, verifier, auth, CLI, and tests, and the remaining concern is a concrete access-control gap in the eval path.

vero/src/vero/harbor/server.py

T-Rex Logs

What T-Rex did

Reproduced gate hidden evals by running a focused pytest against the EvaluationSidecar.evaluate path with a non-admin request for a hidden but budgeted split, showing engine.evaluate was called once and result_path remained None.
Inspected the API-auth probe by reviewing before and after artifacts; the before log captured the probe command, cwd, base git SHA, import failures for vero.harbor.app and vero.harbor.cli, and exit code 0, while the after log captured the head git SHA, HTTP status codes/messages and bodies for /status and /finalize, plus CLI finalize headers and reward.json contents.
Observed verifier functionality progress: the before artifact shows a base import failure for vero.harbor, and the after artifact shows evaluate_admin calls for submit_commit on hidden/secret targets with rewards and a final_reward of 0.88.
Validated sidecar routing: the base run failed with ModuleNotFoundError: No module named 'vero.harbor.server', and the head run captured fetch args with hooks disabled and file://, with resolved_sha matching expected_agent_sha and object_in_sidecar_rc 0, plus aggregate response keys and agent volume contents.
Validated serve assembly: the base run failed with ModuleNotFoundError No module named 'vero.harbor.serve' (exit code 2); the head run succeeded with token length/file match, mode 0o600, sidecar paths matching configured volumes, verifier/target configuration, ledger remaining, and a fail-closed ValueError due to missing task_project (exit code 0). The serve_assembly_probe.py script was used for both runs.

_{Ran code and verified through T-Rex}

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/src/vero/harbor/server.py:70-73
**Gate hidden evals**

`tier_for_split()` now treats unlisted splits as `no_access`, and `/status` reports those budgeted-but-unlisted splits as not agent-evaluable. But this path still transfers and scores the commit before `_route_results()` suppresses only the result files. If a config includes a budget for a hidden or unlisted split, the agent can call `/eval` and receive `mean_score`, `n_samples`, and budget data in the returned `EvalSummary`. Please reject non-admin evals whose tier is `no_access` before calling `engine.evaluate()`, or ensure those splits cannot be present in the agent budget allowlist.

_{Reviews (2): Last reviewed commit: "Merge pull request #8 from scaleapi/harb..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

The evaluation engine, run in a sidecar container, with the trust boundary that makes an optimization run leaderboard-gradeable: - `EvaluationSidecar` (server.py): agent-facing handlers — commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) and tier-gated write-routing of results across the agent-readable and admin volumes. - `Verifier` (verifier.py): commit selection (submit | auto_best) + hidden-split scoring. - Per-trial admin token (auth.py), written root:600 so the optimizer (de-privileged) cannot read it; only the verifier (root, shared mode) can. - FastAPI surface (app.py): /eval, /submit, /status for the agent (metered, redacted); /finalize for the verifier (token-gated). `vero harbor serve` (serve.py) assembles the engine + sidecar + verifier from a ServeConfig and runs it under uvicorn. - `vero harbor` CLI clients (cli.py): serve | eval | submit | status | finalize (build/run land with the compiler). HarborConfig + the Mode-B dataset partition helpers (config.py, dataset.py) are included so the harbor package imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale · 2026-06-30T08:33:30Z

+
+    # Mode A
+    task: str | None = None
+    task_project: str | None = None


Security, the headline one. In Mode A this defaults to None, so the Evaluator runs uv in the agent's own project and loads the @task.evaluation() scorer from the agent's committed code. Verifier.finalize → evaluate_admin then scores the hidden split with that agent-controlled scorer, so a committed scorer that returns 1.0 wins the leaderboard. The token/volume/object-copy machinery secures transport, not scorer provenance. For a gradeable boundary the verifier must score with a sidecar-baked task project: require task_project/inner_task for Mode A and reject configs where the scorer would resolve from the agent repo. This is the default posture and the single most important gap in the stack.

shehabyasser-scale · 2026-06-30T08:33:37Z

+        tok = generate_token()
+        p = write_admin_token(tmp_path / "t", tok)
+        assert read_admin_token(p) == tok
+        assert (p.stat().st_mode & 0o777) == 0o600


For the security-critical PR, the OS-level claims are asserted only as a 0o600 mode bit plus happy-path mocks. Please add adversarial coverage: (1) agent /eval on a no_access split → rejected end-to-end; (2) an integration test that finalize does not execute an agent-supplied scorer (the Mode A provenance gap); (3) a uid-drop test that agent.user cannot read the token file.

shehabyasser-scale · 2026-06-30T08:33:44Z

+    workspace = await GitWorkspace.create(config.repo_path)
+
+    budget = BudgetLedger(
+        [SplitBudget(**b) for b in config.budgets],


build_components rebuilds the BudgetLedger from config.budgets on every start, and the ledger writes persist_path on each mutation but never reads it back. So a sidecar restart mid-trial silently resets spent budget — metering is not crash-safe despite the durable-ledger framing. Add a load-if-exists path that seeds from the persisted ledger when the file exists, and from config.budgets only when it does not.

shehabyasser-scale · 2026-06-30T08:33:52Z

+        """Best candidate by recorded score on the selection split (excludes baseline)."""
+        if self.engine.db is None:
+            raise NoCandidateError("auto_best mode but no experiment database.")
+        df = self.engine.db.get_experiments_df(fill_score=default_minimum_score)


auto_best ranks candidates solely by the recorded mean_score from engine.db.get_experiments_df, but that DB is written by the agent's metered /eval calls and the verifier shares the same in-process engine/DB. The only admin-independent scoring at finalize is evaluate_admin on the test target — after selection, and only on the chosen winner. So a candidate that records an inflated selection score selects itself. (Mode B in #5 sharpens this: the recorded selection score is the reward a nested run of the candidate emitted, which the optimizer controls.) Before trusting the ranking, re-score under admin: re-run evaluate_admin on the selection split for the top-K and re-rank, or at minimum re-score the winner and fail closed on deviation.

shehabyasser-scale · 2026-06-30T08:35:45Z

+    for sa in split_accesses:
+        if sa.split == split:
+            return sa.access
+    return SplitAccessLevel.viewable


This fails open, the wrong default for a trust boundary. tier_for_split is the gate _route_results uses to decide whether to write full per-sample SampleResults (labels included) to the agent volume; an author who budgets a split but forgets to add it to split_accesses lands here and gets viewable → ground truth on the agent-readable volume. Default to the most restrictive tier (no_access) and update the docstring, which currently advertises the fail-open as intended. If an unlisted split should ever be visible, make that explicit and logged.

…ers, budget reload Trust-boundary hardening for the eval sidecar (review findings on PR #4): - protocol: tier_for_split fails CLOSED (unlisted split -> no_access, with a warning), instead of fail-open to viewable. - serve: Mode A now requires task_project. build_components refuses to start a Mode A config without it, so the hidden-split/admin scorer is always loaded from the sidecar-baked task project, never the agent's committed repo (a committed scorer returning 1.0 can no longer win the reward). - verifier: auto_best no longer trusts the agent-influenced recorded score. It shortlists by recorded score, then re-runs evaluate_admin (the trusted scorer) on the selection split for the top-K and ranks by the admin score; fails closed. - serve: reload the persisted budget ledger on startup so a sidecar restart does not reset spent budget to full. - tests: adversarial coverage (no_access -> 400, finalize does not run an agent-supplied scorer via a cheating-agent repo, admin token 0o600/unreadable). Also folds in the valid Greptile findings on this PR: clear stale per-sample result files on re-eval, type reward_mode as Literal[submit, auto_best], and filter auto_best candidates by selection dataset id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar]

greptile-apps · 2026-07-01T15:53:43Z

+    async def evaluate(self, req: EvalRequest, *, admin: bool = False) -> EvalSummary:
+        sha = await self._transfer_commit(req.commit)
+        exp = await self.engine.evaluate(replace(req, commit=sha), admin=admin)
+        result_path = self._route_results(exp, admin=admin)


Gate hidden evals

tier_for_split() now treats unlisted splits as no_access, and /status reports those budgeted-but-unlisted splits as not agent-evaluable. But this path still transfers and scores the commit before _route_results() suppresses only the result files. If a config includes a budget for a hidden or unlisted split, the agent can call /eval and receive mean_score, n_samples, and budget data in the returned EvalSummary. Please reject non-admin evals whose tier is no_access before calling engine.evaluate(), or ensure those splits cannot be present in the agent budget allowlist.

Artifacts

Repro: focused pytest exercising non-admin hidden split EvaluationSidecar.evaluate behavior

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing engine.evaluate call and leaked EvalSummary fields

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/server.py Line: 70-73 Comment: **Gate hidden evals** `tier_for_split()` now treats unlisted splits as `no_access`, and `/status` reports those budgeted-but-unlisted splits as not agent-evaluable. But this path still transfers and scores the commit before `_route_results()` suppresses only the result files. If a config includes a budget for a hidden or unlisted split, the agent can call `/eval` and receive `mean_score`, `n_samples`, and budget data in the returned `EvalSummary`. Please reject non-admin evals whose tier is `no_access` before calling `engine.evaluate()`, or ensure those splits cannot be present in the agent budget allowlist. How can I resolve this? If you propose a fix, please make it concise.

shehabyasser-scale · 2026-07-02T09:34:07Z

Empirical finding from a live Mode B smoke run (GAIA example, nested Modal evals, combined stack #3-#6 + fix PRs #7-#9):

With reward_mode: auto_best and total_run_budget: 1, the optimizer (claude-code / haiku) spent its only eval on the seeded baseline commit ("measure the baseline first", a natural strategy), then committed its improvements with no budget left. finalize returned:

409 {"error":"auto_best mode but no candidate experiments on split 'train'."}

because the selector excludes base_commit from the candidate pool, and the outer Harbor trial died with RewardFileNotFoundError, i.e. an exception, not a score.

Two thoughts:

The base_commit exclusion itself is right (it stops "do nothing" from winning selection). But "the optimizer produced no measurable candidate" is a legitimate outcome of an optimization run, and its honest value is reward 0.0, not a missing-reward exception. As-is, harness-level aggregation counts these trials as errors rather than zeros, and with small budgets a reasonable agent strategy triggers it. Suggestion: when the pool is empty after the exclusion, write reward.json with 0.0 (plus e.g. "note": "no_candidates") instead of 409.
The generated instruction should warn the optimizer that evaluating the unmodified baseline consumes budget without creating a candidate. It currently doesn't, so the agent walks into this blind.

(Fix PR #8 keeps the exclusion unchanged; this is a separate follow-up decision. Happy to implement either piece.)

varunursekar mentioned this pull request Jun 24, 2026

Add Harbor integration: optimization-as-a-Harbor-task #2

Closed

varunursekar requested a review from a team June 24, 2026 18:17

varunursekar marked this pull request as ready for review June 24, 2026 18:21

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread vero/src/vero/harbor/server.py

Comment thread vero/src/vero/harbor/verifier.py Outdated

Comment thread vero/src/vero/harbor/serve.py Outdated

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

[Harbor 4/4] architecture docs, tutorial, and the GAIA example #6

Open

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler #5

Open

This was referenced Jun 30, 2026

fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar] #8

Merged

fix(harbor): fail-closed split access, budget/tier reconciliation, off-loop ledger flush [fixes 1/4 core] #7

Merged

Merge pull request #8 from scaleapi/harbor-2-sidecar-fixes

767570e

fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar]

greptile-apps Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4
varunursekar wants to merge 3 commits into
harbor-1-corefrom
harbor-2-sidecar

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

greptile-apps Bot Jul 1, 2026

Uh oh!

shehabyasser-scale commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

varunursekar commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

T-Rex Logs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading