docs(harbor): warn in the task instruction that baseline evals create no candidate by shehabyasser-scale · Pull Request #12 · scaleapi/vero

shehabyasser-scale · 2026-07-02T10:34:53Z

Stacked on #9 (harbor-3-compiler-fixes). Companion to #11: the generated instruction.md (auto_best branch) now warns the optimizer that only commits other than the seeded baseline are selectable, and that evaluating the unmodified baseline spends budget without creating a candidate.

Found in the same live Mode B smoke run as #11: the optimizer spent its whole budget measuring the baseline and walked blind into finalize's empty candidate pool. #11 makes that outcome score 0.0 instead of erroring; this PR makes the agent unlikely to hit it at all.

One rendered-content test added (test_instruction_warns_baseline_not_selectable). 9 pass.

🤖 Generated with Claude Code

Greptile Summary

Adds a warning to the auto_best task instruction to tell the optimizer that evaluating the unmodified baseline does not create a candidate, so budget spent there cannot contribute to the final selection. A new rendered-content test verifies the warning is present in the compiled instruction.md.

instruction.md.j2: inside the {% else %} (auto-select) branch, appends two sentences explaining that only commits other than the seeded baseline are selectable and that baseline evals consume budget without creating a candidate.
test_harbor_build.py: adds test_instruction_warns_baseline_not_selectable which reads the compiled output and asserts two key substrings are present.

Confidence Score: 5/5

Safe to merge — template and test changes only, no runtime logic altered.

The change is a two-sentence documentation addition inside an existing Jinja2 conditional block, paired with a straightforward substring-presence test. The warning lands only in the auto_best (non-submit) rendering path and does not touch any Python logic. The test fixture already exercises that path via reward_mode='auto_best', confirming the rendered output contains both asserted strings.

No files require special attention.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/build/templates/instruction.md.j2	Adds a warning paragraph in the auto_best (submit_enabled=False) branch telling the optimizer that baseline evals do not create candidates and budget will be wasted if no modified commit is evaluated.
vero/tests/test_harbor_build.py	Adds test_instruction_warns_baseline_not_selectable that reads the compiled instruction.md and asserts two key substrings are present; the shared built fixture uses reward_mode='auto_best' so the else-branch renders correctly, but there is no corresponding assertion that the warning is absent when submit_enabled=True.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[compile_task] --> B{submit_enabled?}
    B -- Yes --> C["Step 5: vero harbor submit\n(manual nomination)"]
    B -- No --> D["Auto-select best commit\non selection_split"]
    D --> E["⚠️ Warning: baseline evals spend\nbudget without creating a candidate.\nEval at least one modified commit."]
    E --> F[finalize selects best non-baseline commit]
    C --> G[finalize uses nominated commit]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[compile_task] --> B{submit_enabled?}
    B -- Yes --> C["Step 5: vero harbor submit\n(manual nomination)"]
    B -- No --> D["Auto-select best commit\non selection_split"]
    D --> E["⚠️ Warning: baseline evals spend\nbudget without creating a candidate.\nEval at least one modified commit."]
    E --> F[finalize selects best non-baseline commit]
    C --> G[finalize uses nominated commit]

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/tests/test_harbor_build.py:172-178
**Test only checks positive case; no guard for wrong branch**

The test verifies the warning is present in `auto_best` mode, but there is no assertion that it is absent when `submit_enabled=True`. If the warning text were accidentally moved outside the `{% else %}` block (i.e., into the unconditional portion of the template), this test would still pass while all `submit_enabled` tasks would also display the misleading baseline warning. Adding a second fixture or parametrised case that compiles with `reward_mode` set to a manual-submit mode and asserts neither phrase appears would make the conditional boundary explicit.

_{Reviews (1): Last reviewed commit: "docs(harbor): warn in the task instructi..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

… no candidate Companion to the finalize no-candidate fallback (PR #11): in auto_best mode the generated instruction now tells the optimizer that only non-baseline commits are selectable and that evaluating the unmodified baseline spends budget without creating a candidate. Found live: an optimizer that spent its whole budget measuring the baseline walked into finalize's empty candidate pool blind. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-07-02T10:37:02Z

+def test_instruction_warns_baseline_not_selectable(built):
+    # auto_best: the agent must be told baseline evals do not create candidates
+    # (found live: an optimizer that spent its whole budget measuring the
+    # baseline died with "no candidate experiments" at finalize).
+    text = (built / "instruction.md").read_text()
+    assert "other than the seeded" in text
+    assert "spends budget without" in text


Test only checks positive case; no guard for wrong branch

The test verifies the warning is present in auto_best mode, but there is no assertion that it is absent when submit_enabled=True. If the warning text were accidentally moved outside the {% else %} block (i.e., into the unconditional portion of the template), this test would still pass while all submit_enabled tasks would also display the misleading baseline warning. Adding a second fixture or parametrised case that compiles with reward_mode set to a manual-submit mode and asserts neither phrase appears would make the conditional boundary explicit.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/tests/test_harbor_build.py Line: 172-178 Comment: **Test only checks positive case; no guard for wrong branch** The test verifies the warning is present in `auto_best` mode, but there is no assertion that it is absent when `submit_enabled=True`. If the warning text were accidentally moved outside the `{% else %}` block (i.e., into the unconditional portion of the template), this test would still pass while all `submit_enabled` tasks would also display the misleading baseline warning. Adding a second fixture or parametrised case that compiles with `reward_mode` set to a manual-submit mode and asserts neither phrase appears would make the conditional boundary explicit. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(harbor): warn in the task instruction that baseline evals create no candidate#12

docs(harbor): warn in the task instruction that baseline evals create no candidate#12
shehabyasser-scale wants to merge 1 commit into
harbor-3-compiler-fixesfrom
harbor-3-compiler-instruction-warning

shehabyasser-scale commented Jul 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 2, 2026 •

edited by greptile-apps Bot

Loading