[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3
[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3varunursekar wants to merge 3 commits into
Conversation
…ne refactor Foundational changes that the Harbor integration builds on, useful on their own: - 3-tier split visibility: `SplitAccessLevel` gains `no_access` (hidden) alongside `visible` and `non_viewable` (aggregate-only). - `BudgetLedger` (core/budget.py): a metered, optionally durable per-split eval budget extracted from the experiment runner. - `TaskOutput` is now a model; the task pipeline runs inference and scoring as separate, resumable, disk-backed stages, with label-field scrubbing before inference. - Evaluation refactor: consolidate the eval path under `vero/evaluation/` as a shared `EvaluationEngine` + `Evaluator`, port `ExperimentRunnerTool` onto it, and add an injectable `EvalStrategy` seam (the extension point Harbor Mode B uses). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| @@ -337,6 +343,13 @@ async def init(self) -> None: | |||
| self._validate_budget_splits() | |||
There was a problem hiding this comment.
_validate_budget_splits() only checks budget splits against the dataset; it never reconciles budget against split_accesses. These two lists jointly enforce visibility (budget = which splits the agent may RUN; split_accesses = what gets WRITTEN back per tier). An init-time cross-check would close a real hole: every agent-budgeted split should be explicitly tiered non_viewable/no_access, raising on viewable-or-unlisted. Combined with tier_for_split defaulting an unlisted split to viewable (over in harbor/protocol.py), the concrete failure is: budget a split, forget to tier it, and the agent both runs it and gets its per-sample labels written to its own volume.
| async def evaluate(self, req: EvalRequest, *, admin: bool = False) -> Experiment: | ||
| """Meter (unless admin) and run one evaluation; return the full Experiment. | ||
|
|
||
| ``no_access`` gating is implicit: those splits are absent from the budget |
There was a problem hiding this comment.
This docstring is honest that no_access gating is implicit — it relies on those splits being absent from the budget ledger so reserve() raises. That couples a security tier to a budgeting structure: a split that is in the ledger but should be no_access would not be rejected here. Recommend evaluate() consult split_accesses directly and hard-reject tier == no_access for non-admin callers, independent of ledger membership.
| ] | ||
| self.persist_path.parent.mkdir(parents=True, exist_ok=True) | ||
| tmp = self.persist_path.with_suffix(self.persist_path.suffix + ".tmp") | ||
| tmp.write_text(json.dumps(data, indent=2)) |
There was a problem hiding this comment.
reserve() holds self._lock (an asyncio.Lock) across this synchronous tmp.write_text() + replace(). In the durable sidecar path under concurrency this stalls the event loop on every reservation. Keep the lock around only the in-memory check+decrement, and push the flush out with await asyncio.to_thread(...). (Confirms Greptile's P1.)
…f-loop ledger flush Core trust-boundary hardening for the eval engine (review findings on PR #3): - engine: explicit, fail-closed no_access gate in evaluate(). Resolve the split's tier (unlisted defaults to no_access) and reject no_access for non-admin callers before the budget ledger, instead of relying implicitly on the split being absent from the ledger. Adds resolve_split_access() in core.dataset.base (no vero.harbor import) as the single fail-closed resolver. - policy: reconcile budgets and split access at init. Auto-tier every agent-budgeted split to non_viewable when unlisted (evaluable, labels hidden), then reject any budgeted split explicitly tiered viewable or no_access. Keeps the train_budget convenience path working while closing the leak. - budget: move the durable ledger flush off the event loop (asyncio.to_thread) while keeping it under the reservation lock, so the sync write no longer blocks the loop and concurrent flushes cannot race the temp file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
On the design question raised in review (should split access be a dataset property checked in real time?): yes. I implemented the minimal version in the fix PR for this branch (#7) and wrote up the full reasoning there: #7 (comment) (budget vs access as separate axes, trust ownership, flexibility, and the deferred DatasetInfo follow-up). |
fix(harbor): fail-closed split access, budget/tier reconciliation, off-loop ledger flush [fixes 1/4 core]
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
T-Rex pricing update — T-Rex was free through June 2026. Effective July 1, 2026, T-Rex adds 2 credits on top of the standard 1-credit review (3 total). T-Rex settings |
| budgets: list[SplitBudget] | None = None, | ||
| *, | ||
| persist_path: Path | str | None = None, | ||
| ): | ||
| self._budgets: dict[tuple[str, str], SplitBudget] = { | ||
| (b.split, b.dataset_id): b for b in (budgets or []) | ||
| } | ||
| self.persist_path = Path(persist_path) if persist_path else None | ||
| self._lock = asyncio.Lock() |
There was a problem hiding this comment.
_flush() writes remaining_sample_budget and remaining_run_budget to persist_path, but constructing a ledger with that same path never reads the file back. After the durable Harbor sidecar spends budget and restarts, BudgetLedger(configured_budgets, persist_path=...) rebuilds fresh SplitBudget objects with their original totals, so callers can spend the same run/sample budget again even though the JSON file already recorded it as used.
Artifacts
Repro: focused persisted budget reset script
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: execution output showing persisted remaining budgets ignored after restart
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/core/budget.py
Line: 102-110
Comment:
**Persisted budgets reset**
`_flush()` writes `remaining_sample_budget` and `remaining_run_budget` to `persist_path`, but constructing a ledger with that same path never reads the file back. After the durable Harbor sidecar spends budget and restarts, `BudgetLedger(configured_budgets, persist_path=...)` rebuilds fresh `SplitBudget` objects with their original totals, so callers can spend the same run/sample budget again even though the JSON file already recorded it as used.
How can I resolve this? If you propose a fix, please make it concise.
Draft · Stack 1 of 4 — review/merge in order (1→4). No Harbor concepts here; foundations the rest builds on (and useful on their own).
SplitAccessLevelgainsno_access(hidden) alongsidevisible/non_viewable(aggregate-only).BudgetLedger(core/budget.py): metered, optionally-durable per-split eval budget.TaskOutputis a model; inference and scoring run as separate resumable disk-backed stages, with label-field scrubbing before inference.vero/evaluation/as a sharedEvaluationEngine+Evaluator, portExperimentRunnerToolonto it, and add the injectableEvalStrategyseam (Mode B's extension point).Review focus: the visibility enum,
BudgetLedger, theEvaluationEngineAPI, and the strategy seam. Much ofevaluation/is relocated code (skim).Part of the Harbor integration (full squashed view: #2). Stack: this → [2/4] sidecar → [3/4] compiler → [4/4] docs.
🤖 Generated with Claude Code
Greptile Summary
This PR adds the core evaluation pieces for split visibility, budget metering, and staged evaluation. The main changes are:
no_accesssplit tier and fail-closed split access resolution.BudgetLedgerfor per-split run and sample budgets.EvaluationEngineused by evaluation frontends.Confidence Score: 4/5
Merge is blocked by the durable budget reload behavior because persisted spend state is ignored after restart.
The implementation is generally well-scoped, but the persistence path for budget accounting does not restore recorded remaining values, which can allow repeated spending across process restarts.
vero/src/vero/core/budget.py
What T-Rex did
Comments Outside Diff (1)
vero/src/vero/core/budget.py, line 175-193 (link)asyncio.Lockinreserve()_flush()callsPath.write_text()(synchronous) whilereserve()holdsself._lock— anasyncio.Lock. Any concurrent coroutine that awaitsreserve()will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes inasyncio.to_thread()(or useaiofiles) andawaitthem insidereserve(), keeping the lock protecting only the in-memory check+decrement step.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (2): Last reviewed commit: "Merge pull request #7 from scaleapi/harb..." | Re-trigger Greptile