Skip to content

docs(harbor): warn in the task instruction that baseline evals create no candidate#12

Open
shehabyasser-scale wants to merge 1 commit into
harbor-3-compiler-fixesfrom
harbor-3-compiler-instruction-warning
Open

docs(harbor): warn in the task instruction that baseline evals create no candidate#12
shehabyasser-scale wants to merge 1 commit into
harbor-3-compiler-fixesfrom
harbor-3-compiler-instruction-warning

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #9 (harbor-3-compiler-fixes). Companion to #11: the generated instruction.md (auto_best branch) now warns the optimizer that only commits other than the seeded baseline are selectable, and that evaluating the unmodified baseline spends budget without creating a candidate.

Found in the same live Mode B smoke run as #11: the optimizer spent its whole budget measuring the baseline and walked blind into finalize's empty candidate pool. #11 makes that outcome score 0.0 instead of erroring; this PR makes the agent unlikely to hit it at all.

One rendered-content test added (test_instruction_warns_baseline_not_selectable). 9 pass.

🤖 Generated with Claude Code

Greptile Summary

Adds a warning to the auto_best task instruction to tell the optimizer that evaluating the unmodified baseline does not create a candidate, so budget spent there cannot contribute to the final selection. A new rendered-content test verifies the warning is present in the compiled instruction.md.

  • instruction.md.j2: inside the {% else %} (auto-select) branch, appends two sentences explaining that only commits other than the seeded baseline are selectable and that baseline evals consume budget without creating a candidate.
  • test_harbor_build.py: adds test_instruction_warns_baseline_not_selectable which reads the compiled output and asserts two key substrings are present.

Confidence Score: 5/5

Safe to merge — template and test changes only, no runtime logic altered.

The change is a two-sentence documentation addition inside an existing Jinja2 conditional block, paired with a straightforward substring-presence test. The warning lands only in the auto_best (non-submit) rendering path and does not touch any Python logic. The test fixture already exercises that path via reward_mode='auto_best', confirming the rendered output contains both asserted strings.

No files require special attention.

Important Files Changed

Filename Overview
vero/src/vero/harbor/build/templates/instruction.md.j2 Adds a warning paragraph in the auto_best (submit_enabled=False) branch telling the optimizer that baseline evals do not create candidates and budget will be wasted if no modified commit is evaluated.
vero/tests/test_harbor_build.py Adds test_instruction_warns_baseline_not_selectable that reads the compiled instruction.md and asserts two key substrings are present; the shared built fixture uses reward_mode='auto_best' so the else-branch renders correctly, but there is no corresponding assertion that the warning is absent when submit_enabled=True.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[compile_task] --> B{submit_enabled?}
    B -- Yes --> C["Step 5: vero harbor submit\n(manual nomination)"]
    B -- No --> D["Auto-select best commit\non selection_split"]
    D --> E["⚠️ Warning: baseline evals spend\nbudget without creating a candidate.\nEval at least one modified commit."]
    E --> F[finalize selects best non-baseline commit]
    C --> G[finalize uses nominated commit]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[compile_task] --> B{submit_enabled?}
    B -- Yes --> C["Step 5: vero harbor submit\n(manual nomination)"]
    B -- No --> D["Auto-select best commit\non selection_split"]
    D --> E["⚠️ Warning: baseline evals spend\nbudget without creating a candidate.\nEval at least one modified commit."]
    E --> F[finalize selects best non-baseline commit]
    C --> G[finalize uses nominated commit]
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/tests/test_harbor_build.py:172-178
**Test only checks positive case; no guard for wrong branch**

The test verifies the warning is present in `auto_best` mode, but there is no assertion that it is absent when `submit_enabled=True`. If the warning text were accidentally moved outside the `{% else %}` block (i.e., into the unconditional portion of the template), this test would still pass while all `submit_enabled` tasks would also display the misleading baseline warning. Adding a second fixture or parametrised case that compiles with `reward_mode` set to a manual-submit mode and asserts neither phrase appears would make the conditional boundary explicit.

Reviews (1): Last reviewed commit: "docs(harbor): warn in the task instructi..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

… no candidate

Companion to the finalize no-candidate fallback (PR #11): in auto_best
mode the generated instruction now tells the optimizer that only
non-baseline commits are selectable and that evaluating the unmodified
baseline spends budget without creating a candidate. Found live: an
optimizer that spent its whole budget measuring the baseline walked into
finalize's empty candidate pool blind.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +172 to +178
def test_instruction_warns_baseline_not_selectable(built):
# auto_best: the agent must be told baseline evals do not create candidates
# (found live: an optimizer that spent its whole budget measuring the
# baseline died with "no candidate experiments" at finalize).
text = (built / "instruction.md").read_text()
assert "other than the seeded" in text
assert "spends budget without" in text

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test only checks positive case; no guard for wrong branch

The test verifies the warning is present in auto_best mode, but there is no assertion that it is absent when submit_enabled=True. If the warning text were accidentally moved outside the {% else %} block (i.e., into the unconditional portion of the template), this test would still pass while all submit_enabled tasks would also display the misleading baseline warning. Adding a second fixture or parametrised case that compiles with reward_mode set to a manual-submit mode and asserts neither phrase appears would make the conditional boundary explicit.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/tests/test_harbor_build.py
Line: 172-178

Comment:
**Test only checks positive case; no guard for wrong branch**

The test verifies the warning is present in `auto_best` mode, but there is no assertion that it is absent when `submit_enabled=True`. If the warning text were accidentally moved outside the `{% else %}` block (i.e., into the unconditional portion of the template), this test would still pass while all `submit_enabled` tasks would also display the misleading baseline warning. Adding a second fixture or parametrised case that compiles with `reward_mode` set to a manual-submit mode and asserts neither phrase appears would make the conditional boundary explicit.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant