From f5c76ca5d1c9072026ebcb1ad0e5d9434af60f25 Mon Sep 17 00:00:00 2001 From: Albert Wu Date: Mon, 8 Jun 2026 21:05:57 -0700 Subject: [PATCH] docs: add gateway list api rfc --- doc/rfc/submitqueue/list-api.md | 227 ++++++++++++++++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 doc/rfc/submitqueue/list-api.md diff --git a/doc/rfc/submitqueue/list-api.md b/doc/rfc/submitqueue/list-api.md new file mode 100644 index 00000000..6ad64c4e --- /dev/null +++ b/doc/rfc/submitqueue/list-api.md @@ -0,0 +1,227 @@ +# Gateway List API + +Design notes for a gateway `List` API that powers a queue-scoped UX for +observing SubmitQueue requests over a time window. + +This document captures **design decisions and rationale only**. + +## Problem + +Users need to inspect what happened in a queue during a time window: which +requests were still running, which reached a terminal state, and what useful +state each request is currently in. The existing `Status` API answers that +question for one `sqid`; the UX needs the same gateway-owned view, but across a +queue and time range. + +The gateway owns the request log. The orchestrator may emit request-log events, +but it does not persist or read the log. `List` should preserve that ownership +boundary and must not read orchestrator-owned working tables. + +## API Shape + +`List` is a read-only gateway RPC, named tersely to match `Land`, `Cancel`, and +`Status`. + +At a high level: + +- **Input** — queue name, time window, optional status filters, and pagination + cursor. +- **Output** — a page of request summaries and the next cursor. + +Each request summary should include: + +- `sqid` +- queue +- current customer-facing status +- change URIs submitted with the request +- last error, if any +- display/debug metadata +- time the request entered SubmitQueue +- time the visible state last changed +- time the request completed, if terminal +- whether the request is terminal + +The summary intentionally exposes gateway/user-facing lifecycle information, not +orchestrator implementation details such as batch IDs, internal request states, +or speculation-tree structure. + +## Time Window Semantics + +The time window is a lifecycle-overlap filter, not a "started during this +window" filter. + +A request belongs in `[T1, T2)` when it was active at any point in that interval: + +- it started before `T2`, and +- it either has not completed, or completed at or after `T1`. + +This is the behavior the UX needs for questions like "what was running between +10:00 and 10:30?" A request that began at 09:55 and completed at 10:05 should +appear. A request that began at 10:20 and is still running should appear. A +request that completed at 09:59 should not. + +`List` returns the request's **current** reconciled status at read time for rows +that match the window. It is not a historical "status as of T2" API. A historical +snapshot API would be a different product shape and should be designed +separately if needed. + +## Status Filtering + +`List` should support filtering by the same customer-facing status strings that +`Status` returns: examples include `accepted`, `validating`, `building`, +`landing`, `landed`, `error`, `cancelling`, and `cancelled`. + +This keeps the API stable at the same abstraction level as `Status`. Clients do +not need to learn an internal enum or translate orchestrator state-machine +values into display states. + +The status filter applies to the request's **current** reconciled status after +the queue/time-window match has been computed. It does not mean "requests that +ever had this status during the window." That historical event query belongs +with a timeline/debug API, not the queue summary list. + +The filter should accept multiple statuses so the UX can ask for groups such as +"currently active" or "terminal outcomes" without making separate RPC calls. The +server should validate status strings against the public status vocabulary it +can emit; unknown statuses are caller errors rather than silent misses. + +## Read Model + +Serving `List` directly from the append-only request log would force the gateway +to scan and reconcile many log rows per request. That is the wrong shape for a +queue dashboard. + +The gateway should maintain a request-summary read model derived from the +request log. Every request-log write updates two gateway-owned views: + +- the immutable request log, used for audit/debug history and point + reconciliation; +- the mutable request summary, used for bounded queue/time-window listing. + +The summary row is a materialized current view of the same state that `Status` +would report. `Status` may continue reading and reconciling from the log during +rollout; the important invariant is that both views use the same reconciliation +rules. + +This is deliberately a query store, unlike the mostly key-oriented stores used +by the pipeline. Its boundary should be page-in/page-out: queue, time window, +statuses, cursor, and limit in; rows plus next cursor out. The backend owns the +indexing strategy for lifecycle overlap. For SQL, avoid an unindexed open-ended +OR by representing "still running" with an index-friendly sentinel completion +time or by splitting active and completed scans. + +Every request-log persistence path must update this read model through the same +helper: direct gateway writes such as `Land` and `Cancel`, plus the gateway log +sink that persists orchestrator-emitted events. The invariant is +`RequestLogStore.Insert` paired with a guarded summary upsert, not best-effort +ad hoc updates at each call site. + +Request-log events should carry `queue` as first-class data. The log sink only +receives the log event, so relying on `sqid` parsing would make the read model +depend on an ID-format convention. Legacy backfills may parse queue from `sqid` +as a fallback, but new events should be queue-attributable at the source. + +## Change URIs + +Request summaries should include the change URIs submitted with the request. The +UX needs them to make each row recognizable and actionable without an additional +lookup. + +To support this cleanly, the gateway must capture change URIs at request +acceptance time. `Land` already receives the change set before handing work to +the orchestrator, so it is the right boundary to persist that display data into +the gateway-owned request log and summary read model. + +This should not be implemented by joining from `List` into orchestrator-owned +request tables. That would break the service ownership model and couple a UX +read path to pipeline internals. + +For existing requests, change URIs are available only if they can be recovered +from gateway-owned data. If old request-log entries do not contain them, the +backfill can still build summaries, but those older rows will have empty change +URIs unless a separate one-time migration from an authoritative source is +accepted explicitly. + +## Reconciliation + +Request-log timestamps are useful for display and broad ordering, but they are +not always the strongest signal for "current state." Some log entries reflect +informational progress, while others reflect versioned request-state changes. + +`Status` reconciles by reading all request-log rows at once. The summary must do +the equivalent incrementally, one incoming event at a time. Each update is a +guarded merge between the stored winner and the incoming log record, never a +blind last-write-wins overwrite. + +The summary should persist enough comparison state to make that decision: +winning status, winning request version, winning timestamp, and whether the +winner is a versioned terminal state. The incoming event replaces the stored +winner only when it would have won in the full-log reconciliation: + +- terminal request-state records with a request version are authoritative; +- among versioned terminal records, the highest request version wins, with + timestamp as a tie-breaker; +- if no terminal versioned winner exists yet, the newest log timestamp wins. + +When the winning state is terminal, the summary records a completion time. When +the winning state is non-terminal, completion time is empty and the request is +considered active for future time-window overlap. + +## Pagination + +`List` should be cursor-paginated. Offset pagination is the wrong fit because the +underlying set changes while users page through it. + +The cursor should be opaque to clients and tied to the original query shape: +queue, time window, status filter, and the last row seen. Reusing a cursor with a +different queue, time window, or status filter should be rejected. + +Default page size should be modest. The API should cap page size so a single UX +request cannot force an unbounded queue scan. + +## Retention + +The first retention target is 30 days after completion. Non-terminal requests +must never be purged by age alone; a request that started 40 days ago and is +still running must appear in a current overlap query. + +Terminal summaries and detailed logs can expire 30 days after completion. +Detailed logs may have a separate policy later only if the UX no longer needs +timeline/debug information for the same period. + +## Flow + +``` + ┌────────────────────────────────────────────┐ + │ gateway:Land / gateway:Cancel / log sink │ + │ persist request-log event │ + │ update request summary │ + └──────────────────────────┬─────────────────┘ + │ + ▼ + ┌────────────────────────────────────────────┐ + │ gateway:List │ + │ validate queue + time window + statuses │ + │ read summaries by lifecycle/status match │ + │ return page of current request summaries │ + └────────────────────────────────────────────┘ +``` + +## Why Not Reuse `Status` + +`Status` is a point lookup: one `sqid`, one current answer. Keeping it narrow +makes it cheap and predictable for polling and integrations. + +`List` is a collection query: one queue, one time window, many request summaries. +It needs pagination, time filtering, optional status filtering, and a read model +shaped for queue UX. Those semantics do not belong in `Status`. + +## Why Not Return Timelines + +Timelines are useful for debugging, but they are not part of the first `List` +shape. Returning per-request histories in every list row would make page cost +scale with both the number of requests and the number of events per request. + +The first API should return summaries only. If the UX later needs row expansion, +add a dedicated timeline/debug API that reads the append-only request log for one +`sqid`.