Skip to content

[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3

Open
varunursekar wants to merge 3 commits into
mainfrom
harbor-1-core
Open

[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3
varunursekar wants to merge 3 commits into
mainfrom
harbor-1-core

Conversation

@varunursekar

@varunursekar varunursekar commented Jun 24, 2026

Copy link
Copy Markdown

Draft · Stack 1 of 4 — review/merge in order (1→4). No Harbor concepts here; foundations the rest builds on (and useful on their own).

  • 3-tier split visibility: SplitAccessLevel gains no_access (hidden) alongside visible / non_viewable (aggregate-only).
  • BudgetLedger (core/budget.py): metered, optionally-durable per-split eval budget.
  • Staged eval: TaskOutput is a model; inference and scoring run as separate resumable disk-backed stages, with label-field scrubbing before inference.
  • Evaluation refactor: consolidate under vero/evaluation/ as a shared EvaluationEngine + Evaluator, port ExperimentRunnerTool onto it, and add the injectable EvalStrategy seam (Mode B's extension point).

Review focus: the visibility enum, BudgetLedger, the EvaluationEngine API, and the strategy seam. Much of evaluation/ is relocated code (skim).

Part of the Harbor integration (full squashed view: #2). Stack: this → [2/4] sidecar → [3/4] compiler → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the core evaluation pieces for split visibility, budget metering, and staged evaluation. The main changes are:

  • A new no_access split tier and fail-closed split access resolution.
  • A BudgetLedger for per-split run and sample budgets.
  • A shared EvaluationEngine used by evaluation frontends.
  • Staged inference/scoring persistence with label-field scrubbing.
  • An injectable evaluation strategy seam for alternate execution modes.

Confidence Score: 4/5

Merge is blocked by the durable budget reload behavior because persisted spend state is ignored after restart.

The implementation is generally well-scoped, but the persistence path for budget accounting does not restore recorded remaining values, which can allow repeated spending across process restarts.

vero/src/vero/core/budget.py

T-Rex T-Rex Logs

What T-Rex did

  • Reproduced the persisted budget reset behavior by running the focused tempfile repro script with PYTHONPATH=vero/src, which used a direct module-load fallback due to a missing toml.
  • The first reserve reduced remaining_run_budget to 0 and remaining_sample_budget to 2 and wrote those values to the persisted JSON.
  • Constructed a fresh BudgetLedger with the same persist_path and configured totals to confirm remaining_run_budget=1 and remaining_sample_budget=5 after restart.
  • The fresh ledger allowed check(3) and a second reserve, proving it spent from reset totals rather than the persisted remaining budget.
  • The split-visibility gating behavior changed from before to after, with the after state satisfying the requested gating and policy budget-tier contracts.
  • The staged eval resume workflow moved from a TypeError caused by create_task with an unexpected keyword argument to a successful after-state with has_run_inference_stage True and a persisted JSON showing output, counters, and later scoring updates.
  • The evaluation engine strategy before showed a ModuleNotFoundError for vero.evaluation.engine, while the after shows head imports ok, engine_type as EvaluationEngine, and budget decrement during evaluation.
  • A supplemental attempt shows pytest is unavailable in the environment, so the pytest run failed for environmental reasons rather than PR code behavior.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (1)

  1. vero/src/vero/core/budget.py, line 175-193 (link)

    P1 Blocking I/O inside asyncio.Lock in reserve()

    _flush() calls Path.write_text() (synchronous) while reserve() holds self._lock — an asyncio.Lock. Any concurrent coroutine that awaits reserve() will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes in asyncio.to_thread() (or use aiofiles) and await them inside reserve(), keeping the lock protecting only the in-memory check+decrement step.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: vero/src/vero/core/budget.py
    Line: 175-193
    
    Comment:
    **Blocking I/O inside `asyncio.Lock` in `reserve()`**
    
    `_flush()` calls `Path.write_text()` (synchronous) while `reserve()` holds `self._lock` — an `asyncio.Lock`. Any concurrent coroutine that awaits `reserve()` will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes in `asyncio.to_thread()` (or use `aiofiles`) and `await` them inside `reserve()`, keeping the lock protecting only the in-memory check+decrement step.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/src/vero/core/budget.py:102-110
**Persisted budgets reset**

`_flush()` writes `remaining_sample_budget` and `remaining_run_budget` to `persist_path`, but constructing a ledger with that same path never reads the file back. After the durable Harbor sidecar spends budget and restarts, `BudgetLedger(configured_budgets, persist_path=...)` rebuilds fresh `SplitBudget` objects with their original totals, so callers can spend the same run/sample budget again even though the JSON file already recorded it as used.

Reviews (2): Last reviewed commit: "Merge pull request #7 from scaleapi/harb..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

…ne refactor

Foundational changes that the Harbor integration builds on, useful on their own:

- 3-tier split visibility: `SplitAccessLevel` gains `no_access` (hidden) alongside
  `visible` and `non_viewable` (aggregate-only).
- `BudgetLedger` (core/budget.py): a metered, optionally durable per-split eval
  budget extracted from the experiment runner.
- `TaskOutput` is now a model; the task pipeline runs inference and scoring as
  separate, resumable, disk-backed stages, with label-field scrubbing before inference.
- Evaluation refactor: consolidate the eval path under `vero/evaluation/` as a shared
  `EvaluationEngine` + `Evaluator`, port `ExperimentRunnerTool` onto it, and add an
  injectable `EvalStrategy` seam (the extension point Harbor Mode B uses).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@varunursekar varunursekar requested a review from a team June 24, 2026 18:17
@varunursekar varunursekar marked this pull request as ready for review June 24, 2026 18:21
Comment thread vero/src/vero/core/task/task.py
Comment thread vero/src/vero/policy.py
@@ -337,6 +343,13 @@ async def init(self) -> None:
self._validate_budget_splits()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_budget_splits() only checks budget splits against the dataset; it never reconciles budget against split_accesses. These two lists jointly enforce visibility (budget = which splits the agent may RUN; split_accesses = what gets WRITTEN back per tier). An init-time cross-check would close a real hole: every agent-budgeted split should be explicitly tiered non_viewable/no_access, raising on viewable-or-unlisted. Combined with tier_for_split defaulting an unlisted split to viewable (over in harbor/protocol.py), the concrete failure is: budget a split, forget to tier it, and the agent both runs it and gets its per-sample labels written to its own volume.

Comment thread vero/src/vero/evaluation/engine.py Outdated
async def evaluate(self, req: EvalRequest, *, admin: bool = False) -> Experiment:
"""Meter (unless admin) and run one evaluation; return the full Experiment.

``no_access`` gating is implicit: those splits are absent from the budget

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is honest that no_access gating is implicit — it relies on those splits being absent from the budget ledger so reserve() raises. That couples a security tier to a budgeting structure: a split that is in the ledger but should be no_access would not be rejected here. Recommend evaluate() consult split_accesses directly and hard-reject tier == no_access for non-admin callers, independent of ledger membership.

]
self.persist_path.parent.mkdir(parents=True, exist_ok=True)
tmp = self.persist_path.with_suffix(self.persist_path.suffix + ".tmp")
tmp.write_text(json.dumps(data, indent=2))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve() holds self._lock (an asyncio.Lock) across this synchronous tmp.write_text() + replace(). In the durable sidecar path under concurrency this stalls the event loop on every reservation. Keep the lock around only the in-memory check+decrement, and push the flush out with await asyncio.to_thread(...). (Confirms Greptile's P1.)

…f-loop ledger flush

Core trust-boundary hardening for the eval engine (review findings on PR #3):

- engine: explicit, fail-closed no_access gate in evaluate(). Resolve the
  split's tier (unlisted defaults to no_access) and reject no_access for
  non-admin callers before the budget ledger, instead of relying implicitly on
  the split being absent from the ledger. Adds resolve_split_access() in
  core.dataset.base (no vero.harbor import) as the single fail-closed resolver.
- policy: reconcile budgets and split access at init. Auto-tier every
  agent-budgeted split to non_viewable when unlisted (evaluable, labels hidden),
  then reject any budgeted split explicitly tiered viewable or no_access. Keeps
  the train_budget convenience path working while closing the leak.
- budget: move the durable ledger flush off the event loop (asyncio.to_thread)
  while keeping it under the reservation lock, so the sync write no longer
  blocks the loop and concurrent flushes cannot race the temp file.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shehabyasser-scale

shehabyasser-scale commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

On the design question raised in review (should split access be a dataset property checked in real time?): yes. I implemented the minimal version in the fix PR for this branch (#7) and wrote up the full reasoning there: #7 (comment) (budget vs access as separate axes, trust ownership, flexibility, and the deferred DatasetInfo follow-up).

fix(harbor): fail-closed split access, budget/tier reconciliation, off-loop ledger flush [fixes 1/4 core]
@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpypi/​fastapi@​0.137.1100100100100100

View full report

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown

T-Rex pricing update — T-Rex was free through June 2026. Effective July 1, 2026, T-Rex adds 2 credits on top of the standard 1-credit review (3 total). T-Rex settings

Comment on lines +102 to +110
budgets: list[SplitBudget] | None = None,
*,
persist_path: Path | str | None = None,
):
self._budgets: dict[tuple[str, str], SplitBudget] = {
(b.split, b.dataset_id): b for b in (budgets or [])
}
self.persist_path = Path(persist_path) if persist_path else None
self._lock = asyncio.Lock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Persisted budgets reset

_flush() writes remaining_sample_budget and remaining_run_budget to persist_path, but constructing a ledger with that same path never reads the file back. After the durable Harbor sidecar spends budget and restarts, BudgetLedger(configured_budgets, persist_path=...) rebuilds fresh SplitBudget objects with their original totals, so callers can spend the same run/sample budget again even though the JSON file already recorded it as used.

Artifacts

Repro: focused persisted budget reset script

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: execution output showing persisted remaining budgets ignored after restart

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/core/budget.py
Line: 102-110

Comment:
**Persisted budgets reset**

`_flush()` writes `remaining_sample_budget` and `remaining_run_budget` to `persist_path`, but constructing a ledger with that same path never reads the file back. After the durable Harbor sidecar spends budget and restarts, `BudgetLedger(configured_budgets, persist_path=...)` rebuilds fresh `SplitBudget` objects with their original totals, so callers can spend the same run/sample budget again even though the JSON file already recorded it as used.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants