fix(harbor): floor rewards on no-candidate outcome instead of erroring#11
Open
shehabyasser-scale wants to merge 1 commit into
Open
fix(harbor): floor rewards on no-candidate outcome instead of erroring#11shehabyasser-scale wants to merge 1 commit into
shehabyasser-scale wants to merge 1 commit into
Conversation
Found in a live Mode B smoke run (GAIA example): with a small budget, an
optimizer that spends every eval on the seeded baseline leaves an empty
candidate pool (auto_best excludes base_commit from selection), and
finalize returned 409 "no candidate experiments", killing the outer
Harbor trial with RewardFileNotFoundError. "The optimizer produced no
scorable candidate" is an outcome of an optimization run, not an
infrastructure failure; its honest value is the floor score.
- Verifier.finalize() catches NoCandidateError and returns
{reward_key: default_minimum_score} for every target, with a warning
log. Applies to both auto_best (no non-baseline experiments) and
submit (no submission.json / empty commit).
- A missing experiment database is reclassified as RuntimeError:
that is sidecar misconfiguration, and it must surface as an error
rather than silently zeroing every trial.
- Tests: baseline-only pool, empty experiment table, and submit-mode
no-submission all floor to 0.0 without spending admin evals; missing
DB still raises; candidates-present path unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #8 (
harbor-2-sidecar-fixes). Implements the fix proposed in the live-run finding on #4.Problem (found empirically, not in review)
In a live Mode B smoke run of
examples/gaia-optimization(outer claude-code trial, nested Modal evals), the optimizer spent its wholetotal_run_budget: 1on evaluating the seeded baseline commit ("measure the baseline first" is a natural agent strategy).auto_bestselection excludesbase_commitfrom the candidate pool, sofinalizereturned:and the outer Harbor trial died with
RewardFileNotFoundError: an exception, not a score. Harness-level aggregation counts these trials as errors rather than zeros, and with small budgets a perfectly reasonable agent walks into it.Fix
"The optimizer produced no scorable candidate" is a legitimate outcome of an optimization run, and its honest value is the floor score:
Verifier.finalize()now catchesNoCandidateErrorand writes{reward_key: default_minimum_score}(0.0) for every configured target, with a warning log. Applies to bothauto_best(no non-baseline experiments on the selection split) andsubmit(agent never submitted).NoCandidateErrortoRuntimeError: that is sidecar misconfiguration, and it must surface as an error rather than silently zeroing every trial. This is the deliberate line between agent outcomes (floor) and infra failures (raise).base_commitexclusion itself is unchanged; it correctly stops "do nothing" from winning selection.Behavior change
RewardFileNotFoundError(trial errors)reward.jsonwith 0.0 per targetreward.jsonwith 0.0 per targetreward.jsonwith 0.0 per targetRuntimeError(500; loud infra failure)Tests
tests/test_harbor_verifier.py: baseline-only pool floors, empty experiment table floors, submit-no-submission floors (multi-target), missing DB still raises, candidates-present regression guard. 19 pass.A follow-up (separate, compiler-side) adds an instruction-template warning that baseline evals consume budget without creating candidates.
🤖 Generated with Claude Code
Greptile Summary
This PR changes
finalize()to treat "no scorable candidate" as a legitimate run outcome rather than a hard error: when_select_commitraisesNoCandidateError, every configured target is now written toreward.jsonatdefault_minimum_score(0.0) with a warning, preventing the outer Harbor harness from counting the trial as an error.Verifier.finalize()wraps_select_commitin atry/except NoCandidateErrorblock and returns a floor-score dict, with no downstream admin scoring triggered.NoCandidateErrortoRuntimeErrorso broken-sidecar misconfiguration still surfaces as a loud error rather than silently zeroing trials.Confidence Score: 4/5
Safe to merge; the logic change is narrow and well-tested, with the only open item being a trivial unused import in the test file.
The verifier change is well-scoped: the except NoCandidateError block is correctly typed (it does not accidentally swallow the new RuntimeError path because NoCandidateError is a subclass, not the parent), the floor-score dict comprehension is straightforward, and all five new test scenarios validate the intended boundary. The one minor gap is the leftover NoCandidateError import in the test file.
The unused NoCandidateError import in vero/tests/test_harbor_verifier.py is the only item worth a second look.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[finalize called] --> B[_select_commit] B --> C{reward_mode} C -->|submit| D[_submitted_commit] C -->|auto_best| E[_best_from_db] D -->|submission.json missing or no commit| F[raise NoCandidateError] D -->|commit found| G[sha] E -->|engine.db is None| H[raise RuntimeError - sidecar misconfiguration] E -->|df empty or missing column| F E -->|all rows are base_commit| F E -->|candidates found| I[admin re-score shortlist] I --> G F --> J[catch NoCandidateError in finalize] J --> K[log warning - return floor scores for all targets] H --> L[propagates uncaught outside finalize] G --> M[evaluate_admin per target] M --> N[return reward dict]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[finalize called] --> B[_select_commit] B --> C{reward_mode} C -->|submit| D[_submitted_commit] C -->|auto_best| E[_best_from_db] D -->|submission.json missing or no commit| F[raise NoCandidateError] D -->|commit found| G[sha] E -->|engine.db is None| H[raise RuntimeError - sidecar misconfiguration] E -->|df empty or missing column| F E -->|all rows are base_commit| F E -->|candidates found| I[admin re-score shortlist] I --> G F --> J[catch NoCandidateError in finalize] J --> K[log warning - return floor scores for all targets] H --> L[propagates uncaught outside finalize] G --> M[evaluate_admin per target] M --> N[return reward dict]Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix(harbor): floor rewards on no-candida..." | Re-trigger Greptile