Skip to content

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4

Open
varunursekar wants to merge 3 commits into
harbor-1-corefrom
harbor-2-sidecar
Open

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4
varunursekar wants to merge 3 commits into
harbor-1-corefrom
harbor-2-sidecar

Conversation

@varunursekar

@varunursekar varunursekar commented Jun 24, 2026

Copy link
Copy Markdown

Draft · Stack 2 of 4 — targets harbor-1-core (review [1/4] first; its diff is the base here).

The evaluation engine run in a sidecar container, plus the trust boundary that makes a run leaderboard-gradeable. This is the security-critical PR — a focused/security review is worthwhile here.

  • EvaluationSidecar (server.py): commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) + tier-gated result write-routing across the agent-readable and admin volumes.
  • Verifier (verifier.py): commit selection (submit | auto_best) + hidden-split scoring.
  • Admin token (auth.py): per-trial, written root:600 — unreadable by the de-privileged optimizer, readable by the verifier (root, shared mode).
  • FastAPI (app.py): /eval /submit /status (agent; metered, redacted) and /finalize (verifier; token-gated). vero harbor serve assembles engine+sidecar+verifier from a ServeConfig.
  • CLI clients (cli.py) + HarborConfig/partition helpers so the package imports cleanly (build/run light up in [3/4]).

Stack: [1/4] core → this → [3/4] compiler → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the Harbor eval sidecar and verifier flow. The main changes are:

  • FastAPI endpoints for eval, submit, status, and token-gated finalize.
  • Sidecar commit transfer from the agent repo and tier-based result projection.
  • Admin token creation and CLI clients for agent and verifier commands.
  • Verifier commit selection and hidden-split reward scoring.
  • Serve-time assembly for engine, sidecar, verifier, budgets, and config.

Confidence Score: 4/5

The Harbor sidecar flow is close, but the eval path needs a hidden-split access gate before merge.

The review covers the new API, sidecar, verifier, auth, CLI, and tests, and the remaining concern is a concrete access-control gap in the eval path.

vero/src/vero/harbor/server.py

T-Rex T-Rex Logs

What T-Rex did

  • Reproduced gate hidden evals by running a focused pytest against the EvaluationSidecar.evaluate path with a non-admin request for a hidden but budgeted split, showing engine.evaluate was called once and result_path remained None.
  • Inspected the API-auth probe by reviewing before and after artifacts; the before log captured the probe command, cwd, base git SHA, import failures for vero.harbor.app and vero.harbor.cli, and exit code 0, while the after log captured the head git SHA, HTTP status codes/messages and bodies for /status and /finalize, plus CLI finalize headers and reward.json contents.
  • Observed verifier functionality progress: the before artifact shows a base import failure for vero.harbor, and the after artifact shows evaluate_admin calls for submit_commit on hidden/secret targets with rewards and a final_reward of 0.88.
  • Validated sidecar routing: the base run failed with ModuleNotFoundError: No module named 'vero.harbor.server', and the head run captured fetch args with hooks disabled and file://, with resolved_sha matching expected_agent_sha and object_in_sidecar_rc 0, plus aggregate response keys and agent volume contents.
  • Validated serve assembly: the base run failed with ModuleNotFoundError No module named 'vero.harbor.serve' (exit code 2); the head run succeeded with token length/file match, mode 0o600, sidecar paths matching configured volumes, verifier/target configuration, ledger remaining, and a fail-closed ValueError due to missing task_project (exit code 0). The serve_assembly_probe.py script was used for both runs.

View all artifacts

T-Rex Ran code and verified through T-Rex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/src/vero/harbor/server.py:70-73
**Gate hidden evals**

`tier_for_split()` now treats unlisted splits as `no_access`, and `/status` reports those budgeted-but-unlisted splits as not agent-evaluable. But this path still transfers and scores the commit before `_route_results()` suppresses only the result files. If a config includes a budget for a hidden or unlisted split, the agent can call `/eval` and receive `mean_score`, `n_samples`, and budget data in the returned `EvalSummary`. Please reject non-admin evals whose tier is `no_access` before calling `engine.evaluate()`, or ensure those splits cannot be present in the agent budget allowlist.

Reviews (2): Last reviewed commit: "Merge pull request #8 from scaleapi/harb..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

The evaluation engine, run in a sidecar container, with the trust boundary that
makes an optimization run leaderboard-gradeable:

- `EvaluationSidecar` (server.py): agent-facing handlers — commit transfer from the
  untrusted agent repo (git fetch, hooks off, object copy) and tier-gated write-routing
  of results across the agent-readable and admin volumes.
- `Verifier` (verifier.py): commit selection (submit | auto_best) + hidden-split scoring.
- Per-trial admin token (auth.py), written root:600 so the optimizer (de-privileged)
  cannot read it; only the verifier (root, shared mode) can.
- FastAPI surface (app.py): /eval, /submit, /status for the agent (metered, redacted);
  /finalize for the verifier (token-gated). `vero harbor serve` (serve.py) assembles the
  engine + sidecar + verifier from a ServeConfig and runs it under uvicorn.
- `vero harbor` CLI clients (cli.py): serve | eval | submit | status | finalize (build/run
  land with the compiler). HarborConfig + the Mode-B dataset partition helpers (config.py,
  dataset.py) are included so the harbor package imports cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@varunursekar varunursekar requested a review from a team June 24, 2026 18:17
@varunursekar varunursekar marked this pull request as ready for review June 24, 2026 18:21
Comment thread vero/src/vero/harbor/server.py
Comment thread vero/src/vero/harbor/verifier.py Outdated
Comment thread vero/src/vero/harbor/serve.py Outdated

# Mode A
task: str | None = None
task_project: str | None = None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security, the headline one. In Mode A this defaults to None, so the Evaluator runs uv in the agent's own project and loads the @task.evaluation() scorer from the agent's committed code. Verifier.finalizeevaluate_admin then scores the hidden split with that agent-controlled scorer, so a committed scorer that returns 1.0 wins the leaderboard. The token/volume/object-copy machinery secures transport, not scorer provenance. For a gradeable boundary the verifier must score with a sidecar-baked task project: require task_project/inner_task for Mode A and reject configs where the scorer would resolve from the agent repo. This is the default posture and the single most important gap in the stack.

tok = generate_token()
p = write_admin_token(tmp_path / "t", tok)
assert read_admin_token(p) == tok
assert (p.stat().st_mode & 0o777) == 0o600

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the security-critical PR, the OS-level claims are asserted only as a 0o600 mode bit plus happy-path mocks. Please add adversarial coverage: (1) agent /eval on a no_access split → rejected end-to-end; (2) an integration test that finalize does not execute an agent-supplied scorer (the Mode A provenance gap); (3) a uid-drop test that agent.user cannot read the token file.

Comment thread vero/src/vero/harbor/serve.py Outdated
workspace = await GitWorkspace.create(config.repo_path)

budget = BudgetLedger(
[SplitBudget(**b) for b in config.budgets],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_components rebuilds the BudgetLedger from config.budgets on every start, and the ledger writes persist_path on each mutation but never reads it back. So a sidecar restart mid-trial silently resets spent budget — metering is not crash-safe despite the durable-ledger framing. Add a load-if-exists path that seeds from the persisted ledger when the file exists, and from config.budgets only when it does not.

"""Best candidate by recorded score on the selection split (excludes baseline)."""
if self.engine.db is None:
raise NoCandidateError("auto_best mode but no experiment database.")
df = self.engine.db.get_experiments_df(fill_score=default_minimum_score)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_best ranks candidates solely by the recorded mean_score from engine.db.get_experiments_df, but that DB is written by the agent's metered /eval calls and the verifier shares the same in-process engine/DB. The only admin-independent scoring at finalize is evaluate_admin on the test target — after selection, and only on the chosen winner. So a candidate that records an inflated selection score selects itself. (Mode B in #5 sharpens this: the recorded selection score is the reward a nested run of the candidate emitted, which the optimizer controls.) Before trusting the ranking, re-score under admin: re-run evaluate_admin on the selection split for the top-K and re-rank, or at minimum re-score the winner and fail closed on deviation.

Comment thread vero/src/vero/harbor/protocol.py Outdated
for sa in split_accesses:
if sa.split == split:
return sa.access
return SplitAccessLevel.viewable

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails open, the wrong default for a trust boundary. tier_for_split is the gate _route_results uses to decide whether to write full per-sample SampleResults (labels included) to the agent volume; an author who budgets a split but forgets to add it to split_accesses lands here and gets viewable → ground truth on the agent-readable volume. Default to the most restrictive tier (no_access) and update the docstring, which currently advertises the fail-open as intended. If an unlisted split should ever be visible, make that explicit and logged.

…ers, budget reload

Trust-boundary hardening for the eval sidecar (review findings on PR #4):

- protocol: tier_for_split fails CLOSED (unlisted split -> no_access, with a
  warning), instead of fail-open to viewable.
- serve: Mode A now requires task_project. build_components refuses to start a
  Mode A config without it, so the hidden-split/admin scorer is always loaded
  from the sidecar-baked task project, never the agent's committed repo (a
  committed scorer returning 1.0 can no longer win the reward).
- verifier: auto_best no longer trusts the agent-influenced recorded score. It
  shortlists by recorded score, then re-runs evaluate_admin (the trusted scorer)
  on the selection split for the top-K and ranks by the admin score; fails closed.
- serve: reload the persisted budget ledger on startup so a sidecar restart does
  not reset spent budget to full.
- tests: adversarial coverage (no_access -> 400, finalize does not run an
  agent-supplied scorer via a cheating-agent repo, admin token 0o600/unreadable).

Also folds in the valid Greptile findings on this PR: clear stale per-sample
result files on re-eval, type reward_mode as Literal[submit, auto_best], and
filter auto_best candidates by selection dataset id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar]
Comment on lines +70 to +73
async def evaluate(self, req: EvalRequest, *, admin: bool = False) -> EvalSummary:
sha = await self._transfer_commit(req.commit)
exp = await self.engine.evaluate(replace(req, commit=sha), admin=admin)
result_path = self._route_results(exp, admin=admin)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Gate hidden evals

tier_for_split() now treats unlisted splits as no_access, and /status reports those budgeted-but-unlisted splits as not agent-evaluable. But this path still transfers and scores the commit before _route_results() suppresses only the result files. If a config includes a budget for a hidden or unlisted split, the agent can call /eval and receive mean_score, n_samples, and budget data in the returned EvalSummary. Please reject non-admin evals whose tier is no_access before calling engine.evaluate(), or ensure those splits cannot be present in the agent budget allowlist.

Artifacts

Repro: focused pytest exercising non-admin hidden split EvaluationSidecar.evaluate behavior

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing engine.evaluate call and leaked EvalSummary fields

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 70-73

Comment:
**Gate hidden evals**

`tier_for_split()` now treats unlisted splits as `no_access`, and `/status` reports those budgeted-but-unlisted splits as not agent-evaluable. But this path still transfers and scores the commit before `_route_results()` suppresses only the result files. If a config includes a budget for a hidden or unlisted split, the agent can call `/eval` and receive `mean_score`, `n_samples`, and budget data in the returned `EvalSummary`. Please reject non-admin evals whose tier is `no_access` before calling `engine.evaluate()`, or ensure those splits cannot be present in the agent budget allowlist.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

@shehabyasser-scale

Copy link
Copy Markdown
Collaborator

Empirical finding from a live Mode B smoke run (GAIA example, nested Modal evals, combined stack #3-#6 + fix PRs #7-#9):

With reward_mode: auto_best and total_run_budget: 1, the optimizer (claude-code / haiku) spent its only eval on the seeded baseline commit ("measure the baseline first", a natural strategy), then committed its improvements with no budget left. finalize returned:

409 {"error":"auto_best mode but no candidate experiments on split 'train'."}

because the selector excludes base_commit from the candidate pool, and the outer Harbor trial died with RewardFileNotFoundError, i.e. an exception, not a score.

Two thoughts:

  1. The base_commit exclusion itself is right (it stops "do nothing" from winning selection). But "the optimizer produced no measurable candidate" is a legitimate outcome of an optimization run, and its honest value is reward 0.0, not a missing-reward exception. As-is, harness-level aggregation counts these trials as errors rather than zeros, and with small budgets a reasonable agent strategy triggers it. Suggestion: when the pool is empty after the exclusion, write reward.json with 0.0 (plus e.g. "note": "no_candidates") instead of 409.

  2. The generated instruction should warn the optimizer that evaluating the unmodified baseline consumes budget without creating a candidate. It currently doesn't, so the agent walks into this blind.

(Fix PR #8 keeps the exclusion unchanged; this is a separate follow-up decision. Happy to implement either piece.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants