Skip to content

docs: RFC for Firecracker snapshots (instant bazel-diff starts)#376

Merged
tinder-maxwellelliott merged 3 commits into
masterfrom
claude/musing-jang-a8e7d7
Jun 22, 2026
Merged

docs: RFC for Firecracker snapshots (instant bazel-diff starts)#376
tinder-maxwellelliott merged 3 commits into
masterfrom
claude/musing-jang-a8e7d7

Conversation

@tinder-maxwellelliott

Copy link
Copy Markdown
Collaborator

Summary

Adds a design RFC (docs/firecracker-snapshots.md) for using Firecracker microVM snapshots to give instant starts of bazel-diff.

The win: bazel-diff's own JVM CLI starts in <1s — the cost is the bazel query deps(//...) it shells out to, which pays full Bazel server warmup + external-repo/bzlmod resolution + Skyframe graph load on every cold start (minutes on a large monorepo). A snapshot captures that warm state once and restores it in ~sub-second, so the PR-time path only re-analyzes changed packages.

Decided scope

  • CLI hooks in the Kotlin tool (warmup, fingerprint) + a Go orchestration tool (tools/firecracker/)
  • Captures a full warm Bazel server + repo cache
  • Targets self-hosted CI (we control host kernel + CPU model)

What the RFC covers

  • Record/consume lifecycle and the host-vs-CLI architectural split
  • New CLI surface: warmup (record entrypoint, clean exit = "safe to snapshot") and fingerprint (cache key); consume reuses existing generate-hashes + get-impacted-targets
  • Correctness (the linchpin): fingerprint cache key over bazel version / MODULE.bazel.lock / .bazelrc / bazel-diff version / flag set, a fail-safe fall-back-to-cold on any mismatch, the fact that SourceFileHasher already makes content correctness independent of server incrementality, and a CI canary
  • Firecracker self-hosted specifics (CPU pinning, COW overlay, UFFD, clock/net resync)
  • Snapshot store layout, Go tool UX, phasing, and open questions

This PR is docs-only

No code changes. Intended as the review artifact before implementation. Phase 1 (the fingerprint + warmup subcommands) would follow in a separate PR.

🤖 Generated with Claude Code

Design doc for capturing a warm Bazel server in a Firecracker microVM
snapshot so PR-time bazel-diff runs restore in ~sub-second instead of
paying full server warmup + external-repo fetch on every cold start.

Scope: CLI hooks (warmup, fingerprint) + a Go orchestration tool,
full warm-server snapshots, self-hosted CI. Centers the correctness
story (fingerprint cache key + fail-safe fall-back-to-cold) since an
incorrect affected set is worse than none.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tinder-maxwellelliott tinder-maxwellelliott merged commit 16d088f into master Jun 22, 2026
15 checks passed
@tinder-maxwellelliott tinder-maxwellelliott deleted the claude/musing-jang-a8e7d7 branch June 22, 2026 15:14
tinder-maxwellelliott added a commit that referenced this pull request Jun 22, 2026
…) (#381)

* feat: Firecracker snapshot harness for instant bazel-diff starts (#376)

Implements + validates the Firecracker snapshot design (PR #376):

CLI hooks (//cli, RFC Phase 1):
- `fingerprint` subcommand + FingerprintInteractor (pure, unit-tested):
  snapshot cache key over bazel version, MODULE.bazel.lock, .bazelrc,
  bazel-diff version, and the query-affecting flag set.
- `warmup` subcommand: record-side entrypoint = generate-hashes for the base
  revision + writes base_hashes.json/fingerprint.json + clean-exit contract.
  Extends GenerateHashesCommand so base hashes are byte-identical to a cold run.

Go orchestrator (tools/firecracker/, stdlib-only static binary):
- `bazel-diff-snap record|consume`; Firecracker REST API over a unix socket.
- local driver (no VM, runs anywhere) + firecracker driver (Linux+KVM).
- consume is fail-safe: fingerprint match + nearest-ancestor resolution,
  exit 2 -> cold fallback. Pure logic + API client unit-tested.

Benchmark harness (tools/firecracker/bench/):
- gen_project.py: synthetic large-Bazel-project generator (layered genrule DAG,
  bounded depth, no external toolchains, two git revisions).
- bench.py: cold-vs-warm analysis-time benchmark; asserts warm == cold.
- Dockerfile + scripts to run on Linux at scale.

Validated on Linux at ~149.5k targets: cold consume 11.7s vs warm 4.9s (58%
faster), warm output byte-identical to cold; orchestrator local record/consume
produced a bounded impacted-target cone.

Real microVM record/consume requires Linux + /dev/kvm (the self-hosted CI host).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(firecracker): wire guest networking + real-microVM e2e validation

Close the firecracker driver's networking gap and add an end-to-end
validation path so the snapshot flow can be proven on real KVM.

Driver:
- fcapi: add networkInterface payload + addNetworkInterface
  (PUT /network-interfaces/{id}).
- fcDriver: attach a TAP-backed virtio-net NIC before InstanceStart, bake a
  static `ip=` directive into the kernel cmdline (matches the guest image's
  MAC->IP fcnet convention), and check the host TAP exists before a restore.
- main: --tap-device/--guest-ip/--host-ip/--netmask/--guest-mac flags.

Bug fixes in the previously-unrun driver:
- bootArgs now passes `root=/dev/vda rw` (Firecracker does not synthesize it),
  so a disk-backed guest actually boots.
- record now checks out the base SHA in the guest before warmup, mirroring
  localDriver.record and consume, so baked base hashes are for the base rev.

Image + host setup:
- bench/build_guest_image.sh: build kernel + rootfs.base.ext4 with JDK, bazel,
  git, bazel-diff, the workspace, and a standalone (non-socket-activated) sshd
  that survives snapshot restore.
- bench/setup_tap.sh: privileged host TAP setup (driver stays privilege-free).

Validation:
- fc_integration_test.go (build tag `fcintegration`): drives fcDriver
  record+consume against a real microVM, env-configured.
- .github/workflows/firecracker-e2e.yml: workflow_dispatch job that builds the
  image, boots a real microVM on an x86_64 + /dev/kvm runner, and asserts the
  snapshot-consumed impacted set is byte-identical to the cold/local set
  (RFC §5.3).
- README: networking flags, build/setup docs, and validation notes — incl. the
  known aarch64 16KB-page-host post-restore userspace freeze.

Unit-tested (fcapi network call, netConfig boot args, ensureTapExists);
go test + vet clean for both default and fcintegration tag sets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(firecracker): make guest image build + driver work end-to-end

Found while running the §5.3 canary on real Linux (in a nested-virt VM): the
guest image build and the firecracker driver had several real bugs that would
also bite on a Linux+KVM CI host. Fixes:

build_guest_image.sh (the firecracker-ci minimized base exposed these):
- run apt in the chroot with the sandbox off + create /tmp, apt spool/log dirs
  (else "Couldn't create temporary file" / repos "not signed")
- mount /dev/pts in the chroot (JDK postinst calls posix_openpt)
- create /usr/share/man/manN (JDK update-alternatives man symlinks)
- chown the baked /work to root (git "detected dubious ownership" -> exit 128)
- actually switch sshd from socket-activation to a standalone always-on
  ssh.service (the README claimed this but the script never did it; socket sshd
  doesn't reliably serve connections after a snapshot restore)

driver_firecracker.go:
- add waitForGuest() and poll guest ssh after instanceStart (record) and after
  snapshot resume (consume) before issuing commands — the driver previously
  raced the guest's boot / resume and the first ssh would fail.

fc_integration_test.go:
- ssh ConnectTimeout + BatchMode so the readiness poll fails fast and never
  hangs on a prompt.

With these, the canary boots -> NIC/TAP -> standalone sshd -> guest exec ->
git checkout -> java all work. (Local nested-virt runs are then gated only by
~70x JVM slowdown under Apple's L2 nested virtualization, which does not exist
on a bare-metal Linux+KVM host; bazel's server can't start within its timeout
in the nested guest. Run the canary on real hardware via firecracker-e2e.yml.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(firecracker): raise fcClient timeout for snapshot create/load

The HTTP client used a 60s timeout for all Firecracker API calls, but
/snapshot/create and /snapshot/load dump/load the guest's full memory to/from
disk and take well over a minute for a multi-GB VM. On a real KVM host the
canary's record step hit "PUT /snapshot/create: context deadline exceeded"
mid-write. Raise the timeout to 15m (other calls are instant over the socket).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(firecracker): gitignore the go orchestrator build output

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(firecracker): raise Go + Kotlin coverage for the snapshot tooling

Go orchestrator (tools/firecracker): 30% -> 66% statement coverage.
- Extract the in-guest command builders (warmupCommand/consumeScript) and make
  waitForGuest's poll interval injectable, so the record/consume logic is unit-
  testable without a microVM.
- Add tests: main.go runRecord/runConsume end-to-end via the local driver with a
  fake bazel-diff (+ makeDriver branches, multiFlag, arg validation, cold
  fallback); store newEntry/writeMetadata/path accessors/mustAbs; readBazelLabel;
  fingerprint error paths; fcClient resume; driver helpers (copyFile, baseRootfs,
  netConfig, waitForGuest, sshGuest args, boot error, teardown).
- Fix TestEnsureTapExists to skip the lo check off-Linux (filepath.Glob returns a
  nil error on no match, so it was wrongly running + failing on macOS).
  Remaining 0% is the genuinely VM/ssh-bound record/consume/boot path, covered by
  the fcintegration canary.

Kotlin CLI hooks (//cli, protects the >=90% bazel coverage gate):
- Extract WarmupCommand.writeFingerprint() out of call() so the fingerprint
  emission is testable without the bazel-backed generate-hashes run.
- Add FingerprintGathererTest, FingerprintCommandTest, WarmupCommandTest.
- New-file coverage: FingerprintInteractor 100%, FingerprintCommand 91%,
  FingerprintGatherer 92%, WarmupCommand 76% (remainder is super.call()).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Maxwell Elliott <maxwell@elliott.now>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant