fix(harbor): floor rewards on no-candidate outcome instead of erroring by shehabyasser-scale · Pull Request #11 · scaleapi/vero

shehabyasser-scale · 2026-07-02T10:33:20Z

Stacked on #8 (harbor-2-sidecar-fixes). Implements the fix proposed in the live-run finding on #4.

Problem (found empirically, not in review)

In a live Mode B smoke run of examples/gaia-optimization (outer claude-code trial, nested Modal evals), the optimizer spent its whole total_run_budget: 1 on evaluating the seeded baseline commit ("measure the baseline first" is a natural agent strategy). auto_best selection excludes base_commit from the candidate pool, so finalize returned:

409 {"error":"auto_best mode but no candidate experiments on split 'train'."}

and the outer Harbor trial died with RewardFileNotFoundError: an exception, not a score. Harness-level aggregation counts these trials as errors rather than zeros, and with small budgets a perfectly reasonable agent walks into it.

Fix

"The optimizer produced no scorable candidate" is a legitimate outcome of an optimization run, and its honest value is the floor score:

Verifier.finalize() now catches NoCandidateError and writes {reward_key: default_minimum_score} (0.0) for every configured target, with a warning log. Applies to both auto_best (no non-baseline experiments on the selection split) and submit (agent never submitted).
A missing experiment database is reclassified from NoCandidateError to RuntimeError: that is sidecar misconfiguration, and it must surface as an error rather than silently zeroing every trial. This is the deliberate line between agent outcomes (floor) and infra failures (raise).
The base_commit exclusion itself is unchanged; it correctly stops "do nothing" from winning selection.

Behavior change

Scenario	Before	After
auto_best, only baseline evaluated	409 -> `RewardFileNotFoundError` (trial errors)	`reward.json` with 0.0 per target
auto_best, no experiments at all	409 -> trial errors	`reward.json` with 0.0 per target
submit mode, never submitted	409 -> trial errors	`reward.json` with 0.0 per target
experiment DB missing	409 -> trial errors	`RuntimeError` (500; loud infra failure)
candidates exist	normal selection	unchanged

Tests

tests/test_harbor_verifier.py: baseline-only pool floors, empty experiment table floors, submit-no-submission floors (multi-target), missing DB still raises, candidates-present regression guard. 19 pass.

A follow-up (separate, compiler-side) adds an instruction-template warning that baseline evals consume budget without creating candidates.

🤖 Generated with Claude Code

Greptile Summary

This PR changes finalize() to treat "no scorable candidate" as a legitimate run outcome rather than a hard error: when _select_commit raises NoCandidateError, every configured target is now written to reward.json at default_minimum_score (0.0) with a warning, preventing the outer Harbor harness from counting the trial as an error.

Verifier.finalize() wraps _select_commit in a try/except NoCandidateError block and returns a floor-score dict, with no downstream admin scoring triggered.
A missing experiment database is reclassified from NoCandidateError to RuntimeError so broken-sidecar misconfiguration still surfaces as a loud error rather than silently zeroing trials.
Five new tests cover baseline-only pool, empty DB, missing DB (must raise), submit-no-submission (multi-target), and a regression guard that the normal candidate path is unaffected.

Confidence Score: 4/5

Safe to merge; the logic change is narrow and well-tested, with the only open item being a trivial unused import in the test file.

The verifier change is well-scoped: the except NoCandidateError block is correctly typed (it does not accidentally swallow the new RuntimeError path because NoCandidateError is a subclass, not the parent), the floor-score dict comprehension is straightforward, and all five new test scenarios validate the intended boundary. The one minor gap is the leftover NoCandidateError import in the test file.

The unused NoCandidateError import in vero/tests/test_harbor_verifier.py is the only item worth a second look.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/verifier.py	finalize() now catches NoCandidateError and returns floor rewards; missing-DB branch correctly reclassified to RuntimeError so it bypasses the catch and propagates as an infra error
vero/tests/test_harbor_verifier.py	Good coverage of all new paths; NoCandidateError is now an unused import since the one test that asserted on it was replaced

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[finalize called] --> B[_select_commit]
    B --> C{reward_mode}
    C -->|submit| D[_submitted_commit]
    C -->|auto_best| E[_best_from_db]
    D -->|submission.json missing or no commit| F[raise NoCandidateError]
    D -->|commit found| G[sha]
    E -->|engine.db is None| H[raise RuntimeError - sidecar misconfiguration]
    E -->|df empty or missing column| F
    E -->|all rows are base_commit| F
    E -->|candidates found| I[admin re-score shortlist]
    I --> G
    F --> J[catch NoCandidateError in finalize]
    J --> K[log warning - return floor scores for all targets]
    H --> L[propagates uncaught outside finalize]
    G --> M[evaluate_admin per target]
    M --> N[return reward dict]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[finalize called] --> B[_select_commit]
    B --> C{reward_mode}
    C -->|submit| D[_submitted_commit]
    C -->|auto_best| E[_best_from_db]
    D -->|submission.json missing or no commit| F[raise NoCandidateError]
    D -->|commit found| G[sha]
    E -->|engine.db is None| H[raise RuntimeError - sidecar misconfiguration]
    E -->|df empty or missing column| F
    E -->|all rows are base_commit| F
    E -->|candidates found| I[admin re-score shortlist]
    I --> G
    F --> J[catch NoCandidateError in finalize]
    J --> K[log warning - return floor scores for all targets]
    H --> L[propagates uncaught outside finalize]
    G --> M[evaluate_admin per target]
    M --> N[return reward dict]

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/tests/test_harbor_verifier.py:9
`NoCandidateError` is imported but unused after the old `pytest.raises(NoCandidateError)` test was replaced.

```suggestion
from vero.harbor.verifier import VerificationTarget, Verifier
```

_{Reviews (1): Last reviewed commit: "fix(harbor): floor rewards on no-candida..." | Re-trigger Greptile}

Found in a live Mode B smoke run (GAIA example): with a small budget, an optimizer that spends every eval on the seeded baseline leaves an empty candidate pool (auto_best excludes base_commit from selection), and finalize returned 409 "no candidate experiments", killing the outer Harbor trial with RewardFileNotFoundError. "The optimizer produced no scorable candidate" is an outcome of an optimization run, not an infrastructure failure; its honest value is the floor score. - Verifier.finalize() catches NoCandidateError and returns {reward_key: default_minimum_score} for every target, with a warning log. Applies to both auto_best (no non-baseline experiments) and submit (no submission.json / empty commit). - A missing experiment database is reclassified as RuntimeError: that is sidecar misconfiguration, and it must surface as an error rather than silently zeroing every trial. - Tests: baseline-only pool, empty experiment table, and submit-mode no-submission all floor to 0.0 without spending admin evals; missing DB still raises; candidates-present path unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale mentioned this pull request Jul 2, 2026

docs(harbor): warn in the task instruction that baseline evals create no candidate #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harbor): floor rewards on no-candidate outcome instead of erroring#11

fix(harbor): floor rewards on no-candidate outcome instead of erroring#11
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecar-fixesfrom
harbor-2-sidecar-autobest-fallback

shehabyasser-scale commented Jul 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem (found empirically, not in review)

Fix

Behavior change

Tests

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 2, 2026 •

edited by greptile-apps Bot

Loading