docs(harbor): warn in the task instruction that baseline evals create no candidate#12
Conversation
… no candidate Companion to the finalize no-candidate fallback (PR #11): in auto_best mode the generated instruction now tells the optimizer that only non-baseline commits are selectable and that evaluating the unmodified baseline spends budget without creating a candidate. Found live: an optimizer that spent its whole budget measuring the baseline walked into finalize's empty candidate pool blind. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| def test_instruction_warns_baseline_not_selectable(built): | ||
| # auto_best: the agent must be told baseline evals do not create candidates | ||
| # (found live: an optimizer that spent its whole budget measuring the | ||
| # baseline died with "no candidate experiments" at finalize). | ||
| text = (built / "instruction.md").read_text() | ||
| assert "other than the seeded" in text | ||
| assert "spends budget without" in text |
There was a problem hiding this comment.
Test only checks positive case; no guard for wrong branch
The test verifies the warning is present in auto_best mode, but there is no assertion that it is absent when submit_enabled=True. If the warning text were accidentally moved outside the {% else %} block (i.e., into the unconditional portion of the template), this test would still pass while all submit_enabled tasks would also display the misleading baseline warning. Adding a second fixture or parametrised case that compiles with reward_mode set to a manual-submit mode and asserts neither phrase appears would make the conditional boundary explicit.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/tests/test_harbor_build.py
Line: 172-178
Comment:
**Test only checks positive case; no guard for wrong branch**
The test verifies the warning is present in `auto_best` mode, but there is no assertion that it is absent when `submit_enabled=True`. If the warning text were accidentally moved outside the `{% else %}` block (i.e., into the unconditional portion of the template), this test would still pass while all `submit_enabled` tasks would also display the misleading baseline warning. Adding a second fixture or parametrised case that compiles with `reward_mode` set to a manual-submit mode and asserts neither phrase appears would make the conditional boundary explicit.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Stacked on #9 (
harbor-3-compiler-fixes). Companion to #11: the generatedinstruction.md(auto_best branch) now warns the optimizer that only commits other than the seeded baseline are selectable, and that evaluating the unmodified baseline spends budget without creating a candidate.Found in the same live Mode B smoke run as #11: the optimizer spent its whole budget measuring the baseline and walked blind into finalize's empty candidate pool. #11 makes that outcome score 0.0 instead of erroring; this PR makes the agent unlikely to hit it at all.
One rendered-content test added (
test_instruction_warns_baseline_not_selectable). 9 pass.🤖 Generated with Claude Code
Greptile Summary
Adds a warning to the
auto_besttask instruction to tell the optimizer that evaluating the unmodified baseline does not create a candidate, so budget spent there cannot contribute to the final selection. A new rendered-content test verifies the warning is present in the compiledinstruction.md.instruction.md.j2: inside the{% else %}(auto-select) branch, appends two sentences explaining that only commits other than the seeded baseline are selectable and that baseline evals consume budget without creating a candidate.test_harbor_build.py: addstest_instruction_warns_baseline_not_selectablewhich reads the compiled output and asserts two key substrings are present.Confidence Score: 5/5
Safe to merge — template and test changes only, no runtime logic altered.
The change is a two-sentence documentation addition inside an existing Jinja2 conditional block, paired with a straightforward substring-presence test. The warning lands only in the auto_best (non-submit) rendering path and does not touch any Python logic. The test fixture already exercises that path via reward_mode='auto_best', confirming the rendered output contains both asserted strings.
No files require special attention.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[compile_task] --> B{submit_enabled?} B -- Yes --> C["Step 5: vero harbor submit\n(manual nomination)"] B -- No --> D["Auto-select best commit\non selection_split"] D --> E["⚠️ Warning: baseline evals spend\nbudget without creating a candidate.\nEval at least one modified commit."] E --> F[finalize selects best non-baseline commit] C --> G[finalize uses nominated commit]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[compile_task] --> B{submit_enabled?} B -- Yes --> C["Step 5: vero harbor submit\n(manual nomination)"] B -- No --> D["Auto-select best commit\non selection_split"] D --> E["⚠️ Warning: baseline evals spend\nbudget without creating a candidate.\nEval at least one modified commit."] E --> F[finalize selects best non-baseline commit] C --> G[finalize uses nominated commit]Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "docs(harbor): warn in the task instructi..." | Re-trigger Greptile