[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor by varunursekar · Pull Request #3 · scaleapi/vero

varunursekar · 2026-06-24T18:12:06Z

Draft · Stack 1 of 4 — review/merge in order (1→4). No Harbor concepts here; foundations the rest builds on (and useful on their own).

3-tier split visibility: SplitAccessLevel gains no_access (hidden) alongside visible / non_viewable (aggregate-only).
BudgetLedger (core/budget.py): metered, optionally-durable per-split eval budget.
Staged eval: TaskOutput is a model; inference and scoring run as separate resumable disk-backed stages, with label-field scrubbing before inference.
Evaluation refactor: consolidate under vero/evaluation/ as a shared EvaluationEngine + Evaluator, port ExperimentRunnerTool onto it, and add the injectable EvalStrategy seam (Mode B's extension point).

Review focus: the visibility enum, BudgetLedger, the EvaluationEngine API, and the strategy seam. Much of evaluation/ is relocated code (skim).

Part of the Harbor integration (full squashed view: #2). Stack: this → [2/4] sidecar → [3/4] compiler → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the core evaluation pieces for split visibility, budget metering, and staged evaluation. The main changes are:

A new no_access split tier and fail-closed split access resolution.
A BudgetLedger for per-split run and sample budgets.
A shared EvaluationEngine used by evaluation frontends.
Staged inference/scoring persistence with label-field scrubbing.
An injectable evaluation strategy seam for alternate execution modes.

Confidence Score: 4/5

Merge is blocked by the durable budget reload behavior because persisted spend state is ignored after restart.

The implementation is generally well-scoped, but the persistence path for budget accounting does not restore recorded remaining values, which can allow repeated spending across process restarts.

vero/src/vero/core/budget.py

T-Rex Logs

What T-Rex did

Reproduced the persisted budget reset behavior by running the focused tempfile repro script with PYTHONPATH=vero/src, which used a direct module-load fallback due to a missing toml.
The first reserve reduced remaining_run_budget to 0 and remaining_sample_budget to 2 and wrote those values to the persisted JSON.
Constructed a fresh BudgetLedger with the same persist_path and configured totals to confirm remaining_run_budget=1 and remaining_sample_budget=5 after restart.
The fresh ledger allowed check(3) and a second reserve, proving it spent from reset totals rather than the persisted remaining budget.
The split-visibility gating behavior changed from before to after, with the after state satisfying the requested gating and policy budget-tier contracts.
The staged eval resume workflow moved from a TypeError caused by create_task with an unexpected keyword argument to a successful after-state with has_run_inference_stage True and a persisted JSON showing output, counters, and later scoring updates.
The evaluation engine strategy before showed a ModuleNotFoundError for vero.evaluation.engine, while the after shows head imports ok, engine_type as EvaluationEngine, and budget decrement during evaluation.
A supplemental attempt shows pytest is unavailable in the environment, so the pytest run failed for environmental reasons rather than PR code behavior.

_{Ran code and verified through T-Rex}

Comments Outside Diff (1)

vero/src/vero/core/budget.py, line 175-193 (link)

Blocking I/O inside asyncio.Lock in reserve()

_flush() calls Path.write_text() (synchronous) while reserve() holds self._lock — an asyncio.Lock. Any concurrent coroutine that awaits reserve() will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes in asyncio.to_thread() (or use aiofiles) and await them inside reserve(), keeping the lock protecting only the in-memory check+decrement step.

Prompt To Fix With AI

This is a comment left during a code review.
Path: vero/src/vero/core/budget.py
Line: 175-193

Comment:
**Blocking I/O inside `asyncio.Lock` in `reserve()`**

`_flush()` calls `Path.write_text()` (synchronous) while `reserve()` holds `self._lock` — an `asyncio.Lock`. Any concurrent coroutine that awaits `reserve()` will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes in `asyncio.to_thread()` (or use `aiofiles`) and `await` them inside `reserve()`, keeping the lock protecting only the in-memory check+decrement step.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/src/vero/core/budget.py:102-110
**Persisted budgets reset**

`_flush()` writes `remaining_sample_budget` and `remaining_run_budget` to `persist_path`, but constructing a ledger with that same path never reads the file back. After the durable Harbor sidecar spends budget and restarts, `BudgetLedger(configured_budgets, persist_path=...)` rebuilds fresh `SplitBudget` objects with their original totals, so callers can spend the same run/sample budget again even though the JSON file already recorded it as used.

_{Reviews (2): Last reviewed commit: "Merge pull request #7 from scaleapi/harb..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

…ne refactor Foundational changes that the Harbor integration builds on, useful on their own: - 3-tier split visibility: `SplitAccessLevel` gains `no_access` (hidden) alongside `visible` and `non_viewable` (aggregate-only). - `BudgetLedger` (core/budget.py): a metered, optionally durable per-split eval budget extracted from the experiment runner. - `TaskOutput` is now a model; the task pipeline runs inference and scoring as separate, resumable, disk-backed stages, with label-field scrubbing before inference. - Evaluation refactor: consolidate the eval path under `vero/evaluation/` as a shared `EvaluationEngine` + `Evaluator`, port `ExperimentRunnerTool` onto it, and add an injectable `EvalStrategy` seam (the extension point Harbor Mode B uses). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale · 2026-06-30T08:32:55Z

@@ -337,6 +343,13 @@ async def init(self) -> None:
        self._validate_budget_splits()


_validate_budget_splits() only checks budget splits against the dataset; it never reconciles budget against split_accesses. These two lists jointly enforce visibility (budget = which splits the agent may RUN; split_accesses = what gets WRITTEN back per tier). An init-time cross-check would close a real hole: every agent-budgeted split should be explicitly tiered non_viewable/no_access, raising on viewable-or-unlisted. Combined with tier_for_split defaulting an unlisted split to viewable (over in harbor/protocol.py), the concrete failure is: budget a split, forget to tier it, and the agent both runs it and gets its per-sample labels written to its own volume.

shehabyasser-scale · 2026-06-30T08:33:01Z

+    async def evaluate(self, req: EvalRequest, *, admin: bool = False) -> Experiment:
+        """Meter (unless admin) and run one evaluation; return the full Experiment.
+
+        ``no_access`` gating is implicit: those splits are absent from the budget


This docstring is honest that no_access gating is implicit — it relies on those splits being absent from the budget ledger so reserve() raises. That couples a security tier to a budgeting structure: a split that is in the ledger but should be no_access would not be rejected here. Recommend evaluate() consult split_accesses directly and hard-reject tier == no_access for non-admin callers, independent of ledger membership.

shehabyasser-scale · 2026-06-30T08:33:10Z

+        ]
+        self.persist_path.parent.mkdir(parents=True, exist_ok=True)
+        tmp = self.persist_path.with_suffix(self.persist_path.suffix + ".tmp")
+        tmp.write_text(json.dumps(data, indent=2))


reserve() holds self._lock (an asyncio.Lock) across this synchronous tmp.write_text() + replace(). In the durable sidecar path under concurrency this stalls the event loop on every reservation. Keep the lock around only the in-memory check+decrement, and push the flush out with await asyncio.to_thread(...). (Confirms Greptile's P1.)

…f-loop ledger flush Core trust-boundary hardening for the eval engine (review findings on PR #3): - engine: explicit, fail-closed no_access gate in evaluate(). Resolve the split's tier (unlisted defaults to no_access) and reject no_access for non-admin callers before the budget ledger, instead of relying implicitly on the split being absent from the ledger. Adds resolve_split_access() in core.dataset.base (no vero.harbor import) as the single fail-closed resolver. - policy: reconcile budgets and split access at init. Auto-tier every agent-budgeted split to non_viewable when unlisted (evaluable, labels hidden), then reject any budgeted split explicitly tiered viewable or no_access. Keeps the train_budget convenience path working while closing the leak. - budget: move the durable ledger flush off the event loop (asyncio.to_thread) while keeping it under the reservation lock, so the sync write no longer blocks the loop and concurrent flushes cannot race the temp file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale · 2026-06-30T10:27:37Z

On the design question raised in review (should split access be a dataset property checked in real time?): yes. I implemented the minimal version in the fix PR for this branch (#7) and wrote up the full reasoning there: #7 (comment) (budget vs access as separate axes, trust ownership, flexibility, and the deferred DatasetInfo follow-up).

fix(harbor): fail-closed split access, budget/tier reconciliation, off-loop ledger flush [fixes 1/4 core]

socket-security · 2026-06-30T23:39:38Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	pypi/fastapi@0.137.1

View full report

greptile-apps · 2026-06-30T23:43:48Z

T-Rex pricing update — T-Rex was free through June 2026. Effective July 1, 2026, T-Rex adds 2 credits on top of the standard 1-credit review (3 total). T-Rex settings

greptile-apps · 2026-06-30T23:43:52Z

+        budgets: list[SplitBudget] | None = None,
+        *,
+        persist_path: Path | str | None = None,
+    ):
+        self._budgets: dict[tuple[str, str], SplitBudget] = {
+            (b.split, b.dataset_id): b for b in (budgets or [])
+        }
+        self.persist_path = Path(persist_path) if persist_path else None
+        self._lock = asyncio.Lock()


Persisted budgets reset

_flush() writes remaining_sample_budget and remaining_run_budget to persist_path, but constructing a ledger with that same path never reads the file back. After the durable Harbor sidecar spends budget and restarts, BudgetLedger(configured_budgets, persist_path=...) rebuilds fresh SplitBudget objects with their original totals, so callers can spend the same run/sample budget again even though the JSON file already recorded it as used.

Artifacts

Repro: focused persisted budget reset script

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: execution output showing persisted remaining budgets ignored after restart

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/core/budget.py Line: 102-110 Comment: **Persisted budgets reset** `_flush()` writes `remaining_sample_budget` and `remaining_run_budget` to `persist_path`, but constructing a ledger with that same path never reads the file back. After the durable Harbor sidecar spends budget and restarts, `BudgetLedger(configured_budgets, persist_path=...)` rebuilds fresh `SplitBudget` objects with their original totals, so callers can spend the same run/sample budget again even though the JSON file already recorded it as used. How can I resolve this? If you propose a fix, please make it concise.

varunursekar mentioned this pull request Jun 24, 2026

Add Harbor integration: optimization-as-a-Harbor-task #2

Closed

varunursekar requested a review from a team June 24, 2026 18:17

varunursekar marked this pull request as ready for review June 24, 2026 18:21

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread vero/src/vero/core/task/task.py

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

fix(harbor): fail-closed split access, budget/tier reconciliation, off-loop ledger flush [fixes 1/4 core] #7

Merged

Merge pull request #7 from scaleapi/harbor-1-core-fixes

e7c8962

fix(harbor): fail-closed split access, budget/tier reconciliation, off-loop ledger flush [fixes 1/4 core]

greptile-apps Bot reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jul 2, 2026

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A) #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3

[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3
varunursekar wants to merge 3 commits into
mainfrom
harbor-1-core

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale commented Jun 30, 2026 •

edited

Loading

Uh oh!

socket-security Bot commented Jun 30, 2026

Uh oh!

greptile-apps Bot commented Jun 30, 2026

Uh oh!

greptile-apps Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -337,6 +343,13 @@ async def init(self) -> None:
		self._validate_budget_splits()

Uh oh!

Conversation

varunursekar commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

T-Rex Logs

Comments Outside Diff (1)

Uh oh!

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security Bot commented Jun 30, 2026

Uh oh!

greptile-apps Bot commented Jun 30, 2026

Uh oh!

greptile-apps Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

shehabyasser-scale commented Jun 30, 2026 •

edited

Loading