From f5c76ca5d1c9072026ebcb1ad0e5d9434af60f25 Mon Sep 17 00:00:00 2001
From: Albert Wu <wua@uber.com>
Date: Mon, 8 Jun 2026 21:05:57 -0700
Subject: [PATCH] docs: add gateway list api rfc

---
 doc/rfc/submitqueue/list-api.md | 227 ++++++++++++++++++++++++++++++++
 1 file changed, 227 insertions(+)
 create mode 100644 doc/rfc/submitqueue/list-api.md

diff --git a/doc/rfc/submitqueue/list-api.md b/doc/rfc/submitqueue/list-api.md
new file mode 100644
index 00000000..6ad64c4e
--- /dev/null
+++ b/doc/rfc/submitqueue/list-api.md
@@ -0,0 +1,227 @@
+# Gateway List API
+
+Design notes for a gateway `List` API that powers a queue-scoped UX for
+observing SubmitQueue requests over a time window.
+
+This document captures **design decisions and rationale only**.
+
+## Problem
+
+Users need to inspect what happened in a queue during a time window: which
+requests were still running, which reached a terminal state, and what useful
+state each request is currently in. The existing `Status` API answers that
+question for one `sqid`; the UX needs the same gateway-owned view, but across a
+queue and time range.
+
+The gateway owns the request log. The orchestrator may emit request-log events,
+but it does not persist or read the log. `List` should preserve that ownership
+boundary and must not read orchestrator-owned working tables.
+
+## API Shape
+
+`List` is a read-only gateway RPC, named tersely to match `Land`, `Cancel`, and
+`Status`.
+
+At a high level:
+
+- **Input** — queue name, time window, optional status filters, and pagination
+  cursor.
+- **Output** — a page of request summaries and the next cursor.
+
+Each request summary should include:
+
+- `sqid`
+- queue
+- current customer-facing status
+- change URIs submitted with the request
+- last error, if any
+- display/debug metadata
+- time the request entered SubmitQueue
+- time the visible state last changed
+- time the request completed, if terminal
+- whether the request is terminal
+
+The summary intentionally exposes gateway/user-facing lifecycle information, not
+orchestrator implementation details such as batch IDs, internal request states,
+or speculation-tree structure.
+
+## Time Window Semantics
+
+The time window is a lifecycle-overlap filter, not a "started during this
+window" filter.
+
+A request belongs in `[T1, T2)` when it was active at any point in that interval:
+
+- it started before `T2`, and
+- it either has not completed, or completed at or after `T1`.
+
+This is the behavior the UX needs for questions like "what was running between
+10:00 and 10:30?" A request that began at 09:55 and completed at 10:05 should
+appear. A request that began at 10:20 and is still running should appear. A
+request that completed at 09:59 should not.
+
+`List` returns the request's **current** reconciled status at read time for rows
+that match the window. It is not a historical "status as of T2" API. A historical
+snapshot API would be a different product shape and should be designed
+separately if needed.
+
+## Status Filtering
+
+`List` should support filtering by the same customer-facing status strings that
+`Status` returns: examples include `accepted`, `validating`, `building`,
+`landing`, `landed`, `error`, `cancelling`, and `cancelled`.
+
+This keeps the API stable at the same abstraction level as `Status`. Clients do
+not need to learn an internal enum or translate orchestrator state-machine
+values into display states.
+
+The status filter applies to the request's **current** reconciled status after
+the queue/time-window match has been computed. It does not mean "requests that
+ever had this status during the window." That historical event query belongs
+with a timeline/debug API, not the queue summary list.
+
+The filter should accept multiple statuses so the UX can ask for groups such as
+"currently active" or "terminal outcomes" without making separate RPC calls. The
+server should validate status strings against the public status vocabulary it
+can emit; unknown statuses are caller errors rather than silent misses.
+
+## Read Model
+
+Serving `List` directly from the append-only request log would force the gateway
+to scan and reconcile many log rows per request. That is the wrong shape for a
+queue dashboard.
+
+The gateway should maintain a request-summary read model derived from the
+request log. Every request-log write updates two gateway-owned views:
+
+- the immutable request log, used for audit/debug history and point
+  reconciliation;
+- the mutable request summary, used for bounded queue/time-window listing.
+
+The summary row is a materialized current view of the same state that `Status`
+would report. `Status` may continue reading and reconciling from the log during
+rollout; the important invariant is that both views use the same reconciliation
+rules.
+
+This is deliberately a query store, unlike the mostly key-oriented stores used
+by the pipeline. Its boundary should be page-in/page-out: queue, time window,
+statuses, cursor, and limit in; rows plus next cursor out. The backend owns the
+indexing strategy for lifecycle overlap. For SQL, avoid an unindexed open-ended
+OR by representing "still running" with an index-friendly sentinel completion
+time or by splitting active and completed scans.
+
+Every request-log persistence path must update this read model through the same
+helper: direct gateway writes such as `Land` and `Cancel`, plus the gateway log
+sink that persists orchestrator-emitted events. The invariant is
+`RequestLogStore.Insert` paired with a guarded summary upsert, not best-effort
+ad hoc updates at each call site.
+
+Request-log events should carry `queue` as first-class data. The log sink only
+receives the log event, so relying on `sqid` parsing would make the read model
+depend on an ID-format convention. Legacy backfills may parse queue from `sqid`
+as a fallback, but new events should be queue-attributable at the source.
+
+## Change URIs
+
+Request summaries should include the change URIs submitted with the request. The
+UX needs them to make each row recognizable and actionable without an additional
+lookup.
+
+To support this cleanly, the gateway must capture change URIs at request
+acceptance time. `Land` already receives the change set before handing work to
+the orchestrator, so it is the right boundary to persist that display data into
+the gateway-owned request log and summary read model.
+
+This should not be implemented by joining from `List` into orchestrator-owned
+request tables. That would break the service ownership model and couple a UX
+read path to pipeline internals.
+
+For existing requests, change URIs are available only if they can be recovered
+from gateway-owned data. If old request-log entries do not contain them, the
+backfill can still build summaries, but those older rows will have empty change
+URIs unless a separate one-time migration from an authoritative source is
+accepted explicitly.
+
+## Reconciliation
+
+Request-log timestamps are useful for display and broad ordering, but they are
+not always the strongest signal for "current state." Some log entries reflect
+informational progress, while others reflect versioned request-state changes.
+
+`Status` reconciles by reading all request-log rows at once. The summary must do
+the equivalent incrementally, one incoming event at a time. Each update is a
+guarded merge between the stored winner and the incoming log record, never a
+blind last-write-wins overwrite.
+
+The summary should persist enough comparison state to make that decision:
+winning status, winning request version, winning timestamp, and whether the
+winner is a versioned terminal state. The incoming event replaces the stored
+winner only when it would have won in the full-log reconciliation:
+
+- terminal request-state records with a request version are authoritative;
+- among versioned terminal records, the highest request version wins, with
+  timestamp as a tie-breaker;
+- if no terminal versioned winner exists yet, the newest log timestamp wins.
+
+When the winning state is terminal, the summary records a completion time. When
+the winning state is non-terminal, completion time is empty and the request is
+considered active for future time-window overlap.
+
+## Pagination
+
+`List` should be cursor-paginated. Offset pagination is the wrong fit because the
+underlying set changes while users page through it.
+
+The cursor should be opaque to clients and tied to the original query shape:
+queue, time window, status filter, and the last row seen. Reusing a cursor with a
+different queue, time window, or status filter should be rejected.
+
+Default page size should be modest. The API should cap page size so a single UX
+request cannot force an unbounded queue scan.
+
+## Retention
+
+The first retention target is 30 days after completion. Non-terminal requests
+must never be purged by age alone; a request that started 40 days ago and is
+still running must appear in a current overlap query.
+
+Terminal summaries and detailed logs can expire 30 days after completion.
+Detailed logs may have a separate policy later only if the UX no longer needs
+timeline/debug information for the same period.
+
+## Flow
+
+```
+   ┌────────────────────────────────────────────┐
+   │ gateway:Land / gateway:Cancel / log sink   │
+   │   persist request-log event                │
+   │   update request summary                   │
+   └──────────────────────────┬─────────────────┘
+                              │
+                              ▼
+   ┌────────────────────────────────────────────┐
+   │ gateway:List                               │
+   │   validate queue + time window + statuses  │
+   │   read summaries by lifecycle/status match │
+   │   return page of current request summaries │
+   └────────────────────────────────────────────┘
+```
+
+## Why Not Reuse `Status`
+
+`Status` is a point lookup: one `sqid`, one current answer. Keeping it narrow
+makes it cheap and predictable for polling and integrations.
+
+`List` is a collection query: one queue, one time window, many request summaries.
+It needs pagination, time filtering, optional status filtering, and a read model
+shaped for queue UX. Those semantics do not belong in `Status`.
+
+## Why Not Return Timelines
+
+Timelines are useful for debugging, but they are not part of the first `List`
+shape. Returning per-request histories in every list row would make page cost
+scale with both the number of requests and the number of events per request.
+
+The first API should return summaries only. If the UX later needs row expansion,
+add a dedicated timeline/debug API that reads the append-only request log for one
+`sqid`.