Skip to content

Add authentication#538

Open
taylordowns2000 wants to merge 10 commits into
mainfrom
api-token
Open

Add authentication#538
taylordowns2000 wants to merge 10 commits into
mainfrom
api-token

Conversation

@taylordowns2000

@taylordowns2000 taylordowns2000 commented Jun 18, 2026

Copy link
Copy Markdown
Member

Adds a lightning_clients table so that Apollo can:

  1. authenticate lightning instances who attempt to use its services
  2. handle the issuance of ai service keys (e.g., anthropic API keys) and associate those keys with specific clients

It is designed as a MVP, and built to be opt-in.

closes #545

To migrate

  1. add INSTANCE_AUTH=true to your .env
  2. add APOLLO_ENC_KEY to your .env (or secret store) with openssl rand -base64 32
  3. add the lightning_clients table with set -a; . ./.env; set +a; psql "$POSTGRES_URL" -f services/_instance_auth/schema.sql
  4. provision a token for any lightning instance you want to connect with bun services/_instance_auth/provision_client.ts <some-identifier-123> <some-sk-ant-token-blah-456>
  5. Add this hashed/encrypted record to the Apollo DB
  6. Add this un-hashed AI_ASSISTANT_API_KEY (i.e., the secret!) to your lightning .env (or secret store)
  7. Repeat for as many instances as you'd like to connect.
  8. Optionally, merge this PR so that Lightning users see better error messages if their tokens are rejected by Apollo and/or downstream AI providers: optional: nicer apollo errors for lightning users lightning#4879

After this PR (what's next)

Related to @elias-ba 's concurrent proposal to enhance error handling on Lightning, my next goal is to let Lightning reliably tell "we rejected your api_key" from "your api_key is fine, but the Anthropic key we use for you failed / is over its limit." HTTP status collides (both are 401), so the discriminator is the type field, not the status code.

Proposed contract (type is the discriminator)

type HTTP Meaning to client Source
UNAUTHORIZED 401 Lightning api_key not recognised The gate, only the gate
AUTH_ERROR 401 api_key fine; Anthropic key invalid/revoked Anthropic/LiteLLM AuthenticationError
RATE_LIMIT 429 Anthropic key hit a rate/usage limit, retry later (retry_after) Anthropic/LiteLLM RateLimitError
BILLING (new) 402 Anthropic key out of credit / over spend cap billing_error type + "credit balance too low"
KEY_UNAVAILABLE (new) 503 Server-side: stored key can't be decrypted (APOLLO_ENC_KEY misconfig) auth.ts, not the client's fault

Rule: UNAUTHORIZED is reserved for credential rejection. Anything touching the Anthropic/LiteLLM key returns AUTH_ERROR / RATE_LIMIT / BILLING, never UNAUTHORIZED.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

@taylordowns2000 taylordowns2000 marked this pull request as ready for review June 19, 2026 14:57

@elias-ba elias-ba left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @taylordowns2000, this is genuinely strong work and a clever evolution of the concept. Using the existing api_key field as the credential is the right call, it gets us per-client keys with no Lightning code change. Encryption fails closed correctly, the single-flight + stale-while-revalidate cache is solid and well tested, and swapping the loopback exemption for a per-process internal token is a real security improvement. Two blockers before we enable this, then a couple of things I'd like you to confirm.

Blockers

  1. .tool-versions still pins bun 1.1.13, but auth.ts imports { SQL } from "bun", which doesn't exist before Bun 1.2.0 (I checked, typeof SQL is undefined on the current toolchain). With INSTANCE_AUTH=true, new SQL(url) throws, gets caught, and the gate fails closed, so every external caller gets a 401. #536 bumped this to bun 1.3.14 for exactly this reason and it needs to come along here.

  2. The pinned cross-language hash test from #536 got dropped. We now have three SHA-256 sites with nothing enforcing they agree, and any drift silently 401s every provisioned client. Please restore it. (inline)

Needs confirmation

  • The gate is all-or-nothing per Apollo process, so turning INSTANCE_AUTH on for our shared production Apollo would 401 every instance we haven't provisioned yet. Totally fine for dedicated Apollo instances. But concerning for shared ones. I just want to confirm that's the plan. Also worth being precise that "no Lightning-side change" is true for code, but each instance's configured key value still has to be swapped to the minted credential.

  • WS transport is fully disabled while auth is on, since the upgrade has no body to carry the key. Should be fine since Lightning uses POST and /stream, but can you confirm nothing still speaks WS?

  • No changeset entry on this one and we've been keeping those (#536 had one), worth adding before release.

Comment on lines +198 to +202
try {
sql = new SQL(url);
const rows = (await sql`
SELECT to_regclass('public.lightning_clients') AS t
`) as Array<{ t: string | null }>;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the path breaks on breaks on any environment respecting the pinned Bun version in .tools-version: SQL is undefined on <1.2.0, so new SQL(url) throws here. It's swallowed by the catch below, so there's no crash - just a silent fail-closed gate and a log line. The whole real-DB path (this + loadClients) has no CI coverage because every test goes through the __setLoaderForTest seam. Worth one integration test that exercises initAuth/loadClients against a real PG so this can't rot again.

Comment on lines +249 to +255
const apiKey =
typeof ctx.body?.api_key === "string" ? ctx.body.api_key.trim() : "";
if (!apiKey) return unauthorized(ctx);

const lookup = lookupOverride ?? dbLookup;
const client = await lookup(hashToken(apiKey));
if (!client) return unauthorized(ctx);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantics changed from #536: that PR recognized tokens and swapped keys but passed unknown callers through untouched, so several instances could share one Apollo and the un-provisioned ones kept working as today. Here, any unknown or missing api_key is a hard 401 when enabled - it's all-or-nothing per Apollo process. That's correct for a dedicated Apollo instance, but flipping INSTANCE_AUTH on our shared production Apollo would 401 every not-yet-provisioned instance. Worth really noting and paying attention to. Can you confirm this is what we really want ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. That's the idea of the initiative.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too big and too restrictive for now. The approach I would take and that I proposed in #536 was to unblock the current with the minimum and we discuss this and build it in another phase.

Comment on lines +18 to +19
def hash_token(token: str) -> str:
return hashlib.sha256(token.encode()).hexdigest()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says this "must match the hash Apollo computes in auth.ts" - but #538 dropped the pinned-hash test that enforced it (#536 had KNOWN_TOKEN/KNOWN_HASH asserted on both the TS and Python sides). With three independent SHA-256 sites now (this, auth.ts:62, provision_client.ts:34) and nothing pinning them, a drift would silently 401 every provisioned client - the exact "nightmare to debug" failure mode. Please restore the cross-language pinned test.

}

const apiKey = randomBytes(32).toString("base64url");
const authTokenHash = createHash("sha256").update(apiKey).digest("hex");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-implements the hash inline instead of importing hashToken from auth.ts. Since this is the canonical provisioning path, a drift here vs. the runtime hasher silently locks clients out. Import it instead:

import { hashToken } from "../../platform/src/middleware/auth";
...
const authTokenHash = hashToken(apiKey);

Comment on lines +40 to +47
// MULTI-PROCESS: when unset, each process mints its OWN token. apollo() self-calls
// hit 127.0.0.1:{port} and normally land on the same process, but if processes
// share a port (SO_REUSEPORT / clustering) a self-call can hit a sibling and 401.
// Set APOLLO_INTERNAL_TOKEN to the SAME value across processes in that case.
export const INTERNAL_HEADER = "x-apollo-internal";
const INTERNAL_TOKEN =
process.env.APOLLO_INTERNAL_TOKEN ?? randomBytes(32).toString("hex");
process.env.APOLLO_INTERNAL_TOKEN = INTERNAL_TOKEN;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-process caveat is documented here - just flagging for the deploy: if prod runs Apollo with SO_REUSEPORT/clustering and APOLLO_INTERNAL_TOKEN unset, apollo() self-calls can hit a sibling process with a different token and 401. Confirm our topology is single-process, or set a shared APOLLO_INTERNAL_TOKEN.


// In-memory client cache (keyed by token hash), refreshed on a TTL so DB changes
// are picked up within CACHE_TTL_MS.
const CACHE_TTL_MS = 60_000;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming this is an accepted trade-off: a revoked/rotated credential keeps working for up to ~60s (until the cache TTL refreshes). Fine for rotation; just want it acknowledged for the security posture, since "revoke" isn't immediate without a restart.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, intentional. I don't think there's a benefit in hard cutoff.

@github-project-automation github-project-automation Bot moved this to New Issues in Core Jun 22, 2026
@github-project-automation github-project-automation Bot moved this from New Issues to In review in Core Jun 22, 2026
@theroinaochieng theroinaochieng moved this from In review to In progress in Core Jun 22, 2026
expect(body.type).toBe("UNAUTHORIZED");
});

it("rejects an unknown api_key with 401", async () => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks production

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct! Did you catch up with @stuartc about this? Migration notes are in the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Epic: Authentication MVP

4 participants