Skip to content

fix(harbor): floor rewards on no-candidate outcome instead of erroring#11

Open
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecar-fixesfrom
harbor-2-sidecar-autobest-fallback
Open

fix(harbor): floor rewards on no-candidate outcome instead of erroring#11
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecar-fixesfrom
harbor-2-sidecar-autobest-fallback

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #8 (harbor-2-sidecar-fixes). Implements the fix proposed in the live-run finding on #4.

Problem (found empirically, not in review)

In a live Mode B smoke run of examples/gaia-optimization (outer claude-code trial, nested Modal evals), the optimizer spent its whole total_run_budget: 1 on evaluating the seeded baseline commit ("measure the baseline first" is a natural agent strategy). auto_best selection excludes base_commit from the candidate pool, so finalize returned:

409 {"error":"auto_best mode but no candidate experiments on split 'train'."}

and the outer Harbor trial died with RewardFileNotFoundError: an exception, not a score. Harness-level aggregation counts these trials as errors rather than zeros, and with small budgets a perfectly reasonable agent walks into it.

Fix

"The optimizer produced no scorable candidate" is a legitimate outcome of an optimization run, and its honest value is the floor score:

  • Verifier.finalize() now catches NoCandidateError and writes {reward_key: default_minimum_score} (0.0) for every configured target, with a warning log. Applies to both auto_best (no non-baseline experiments on the selection split) and submit (agent never submitted).
  • A missing experiment database is reclassified from NoCandidateError to RuntimeError: that is sidecar misconfiguration, and it must surface as an error rather than silently zeroing every trial. This is the deliberate line between agent outcomes (floor) and infra failures (raise).
  • The base_commit exclusion itself is unchanged; it correctly stops "do nothing" from winning selection.

Behavior change

Scenario Before After
auto_best, only baseline evaluated 409 -> RewardFileNotFoundError (trial errors) reward.json with 0.0 per target
auto_best, no experiments at all 409 -> trial errors reward.json with 0.0 per target
submit mode, never submitted 409 -> trial errors reward.json with 0.0 per target
experiment DB missing 409 -> trial errors RuntimeError (500; loud infra failure)
candidates exist normal selection unchanged

Tests

tests/test_harbor_verifier.py: baseline-only pool floors, empty experiment table floors, submit-no-submission floors (multi-target), missing DB still raises, candidates-present regression guard. 19 pass.

A follow-up (separate, compiler-side) adds an instruction-template warning that baseline evals consume budget without creating candidates.

🤖 Generated with Claude Code

Greptile Summary

This PR changes finalize() to treat "no scorable candidate" as a legitimate run outcome rather than a hard error: when _select_commit raises NoCandidateError, every configured target is now written to reward.json at default_minimum_score (0.0) with a warning, preventing the outer Harbor harness from counting the trial as an error.

  • Verifier.finalize() wraps _select_commit in a try/except NoCandidateError block and returns a floor-score dict, with no downstream admin scoring triggered.
  • A missing experiment database is reclassified from NoCandidateError to RuntimeError so broken-sidecar misconfiguration still surfaces as a loud error rather than silently zeroing trials.
  • Five new tests cover baseline-only pool, empty DB, missing DB (must raise), submit-no-submission (multi-target), and a regression guard that the normal candidate path is unaffected.

Confidence Score: 4/5

Safe to merge; the logic change is narrow and well-tested, with the only open item being a trivial unused import in the test file.

The verifier change is well-scoped: the except NoCandidateError block is correctly typed (it does not accidentally swallow the new RuntimeError path because NoCandidateError is a subclass, not the parent), the floor-score dict comprehension is straightforward, and all five new test scenarios validate the intended boundary. The one minor gap is the leftover NoCandidateError import in the test file.

The unused NoCandidateError import in vero/tests/test_harbor_verifier.py is the only item worth a second look.

Important Files Changed

Filename Overview
vero/src/vero/harbor/verifier.py finalize() now catches NoCandidateError and returns floor rewards; missing-DB branch correctly reclassified to RuntimeError so it bypasses the catch and propagates as an infra error
vero/tests/test_harbor_verifier.py Good coverage of all new paths; NoCandidateError is now an unused import since the one test that asserted on it was replaced

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[finalize called] --> B[_select_commit]
    B --> C{reward_mode}
    C -->|submit| D[_submitted_commit]
    C -->|auto_best| E[_best_from_db]
    D -->|submission.json missing or no commit| F[raise NoCandidateError]
    D -->|commit found| G[sha]
    E -->|engine.db is None| H[raise RuntimeError - sidecar misconfiguration]
    E -->|df empty or missing column| F
    E -->|all rows are base_commit| F
    E -->|candidates found| I[admin re-score shortlist]
    I --> G
    F --> J[catch NoCandidateError in finalize]
    J --> K[log warning - return floor scores for all targets]
    H --> L[propagates uncaught outside finalize]
    G --> M[evaluate_admin per target]
    M --> N[return reward dict]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[finalize called] --> B[_select_commit]
    B --> C{reward_mode}
    C -->|submit| D[_submitted_commit]
    C -->|auto_best| E[_best_from_db]
    D -->|submission.json missing or no commit| F[raise NoCandidateError]
    D -->|commit found| G[sha]
    E -->|engine.db is None| H[raise RuntimeError - sidecar misconfiguration]
    E -->|df empty or missing column| F
    E -->|all rows are base_commit| F
    E -->|candidates found| I[admin re-score shortlist]
    I --> G
    F --> J[catch NoCandidateError in finalize]
    J --> K[log warning - return floor scores for all targets]
    H --> L[propagates uncaught outside finalize]
    G --> M[evaluate_admin per target]
    M --> N[return reward dict]
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/tests/test_harbor_verifier.py:9
`NoCandidateError` is imported but unused after the old `pytest.raises(NoCandidateError)` test was replaced.

```suggestion
from vero.harbor.verifier import VerificationTarget, Verifier
```

Reviews (1): Last reviewed commit: "fix(harbor): floor rewards on no-candida..." | Re-trigger Greptile

Found in a live Mode B smoke run (GAIA example): with a small budget, an
optimizer that spends every eval on the seeded baseline leaves an empty
candidate pool (auto_best excludes base_commit from selection), and
finalize returned 409 "no candidate experiments", killing the outer
Harbor trial with RewardFileNotFoundError. "The optimizer produced no
scorable candidate" is an outcome of an optimization run, not an
infrastructure failure; its honest value is the floor score.

- Verifier.finalize() catches NoCandidateError and returns
  {reward_key: default_minimum_score} for every target, with a warning
  log. Applies to both auto_best (no non-baseline experiments) and
  submit (no submission.json / empty commit).
- A missing experiment database is reclassified as RuntimeError:
  that is sidecar misconfiguration, and it must surface as an error
  rather than silently zeroing every trial.
- Tests: baseline-only pool, empty experiment table, and submit-mode
  no-submission all floor to 0.0 without spending admin evals; missing
  DB still raises; candidates-present path unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant