docs: RFC for Firecracker snapshots (instant bazel-diff starts)#376
Merged
Conversation
Design doc for capturing a warm Bazel server in a Firecracker microVM snapshot so PR-time bazel-diff runs restore in ~sub-second instead of paying full server warmup + external-repo fetch on every cold start. Scope: CLI hooks (warmup, fingerprint) + a Go orchestration tool, full warm-server snapshots, self-hosted CI. Centers the correctness story (fingerprint cache key + fail-safe fall-back-to-cold) since an incorrect affected set is worse than none. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tinder-maxwellelliott
added a commit
that referenced
this pull request
Jun 22, 2026
…) (#381) * feat: Firecracker snapshot harness for instant bazel-diff starts (#376) Implements + validates the Firecracker snapshot design (PR #376): CLI hooks (//cli, RFC Phase 1): - `fingerprint` subcommand + FingerprintInteractor (pure, unit-tested): snapshot cache key over bazel version, MODULE.bazel.lock, .bazelrc, bazel-diff version, and the query-affecting flag set. - `warmup` subcommand: record-side entrypoint = generate-hashes for the base revision + writes base_hashes.json/fingerprint.json + clean-exit contract. Extends GenerateHashesCommand so base hashes are byte-identical to a cold run. Go orchestrator (tools/firecracker/, stdlib-only static binary): - `bazel-diff-snap record|consume`; Firecracker REST API over a unix socket. - local driver (no VM, runs anywhere) + firecracker driver (Linux+KVM). - consume is fail-safe: fingerprint match + nearest-ancestor resolution, exit 2 -> cold fallback. Pure logic + API client unit-tested. Benchmark harness (tools/firecracker/bench/): - gen_project.py: synthetic large-Bazel-project generator (layered genrule DAG, bounded depth, no external toolchains, two git revisions). - bench.py: cold-vs-warm analysis-time benchmark; asserts warm == cold. - Dockerfile + scripts to run on Linux at scale. Validated on Linux at ~149.5k targets: cold consume 11.7s vs warm 4.9s (58% faster), warm output byte-identical to cold; orchestrator local record/consume produced a bounded impacted-target cone. Real microVM record/consume requires Linux + /dev/kvm (the self-hosted CI host). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(firecracker): wire guest networking + real-microVM e2e validation Close the firecracker driver's networking gap and add an end-to-end validation path so the snapshot flow can be proven on real KVM. Driver: - fcapi: add networkInterface payload + addNetworkInterface (PUT /network-interfaces/{id}). - fcDriver: attach a TAP-backed virtio-net NIC before InstanceStart, bake a static `ip=` directive into the kernel cmdline (matches the guest image's MAC->IP fcnet convention), and check the host TAP exists before a restore. - main: --tap-device/--guest-ip/--host-ip/--netmask/--guest-mac flags. Bug fixes in the previously-unrun driver: - bootArgs now passes `root=/dev/vda rw` (Firecracker does not synthesize it), so a disk-backed guest actually boots. - record now checks out the base SHA in the guest before warmup, mirroring localDriver.record and consume, so baked base hashes are for the base rev. Image + host setup: - bench/build_guest_image.sh: build kernel + rootfs.base.ext4 with JDK, bazel, git, bazel-diff, the workspace, and a standalone (non-socket-activated) sshd that survives snapshot restore. - bench/setup_tap.sh: privileged host TAP setup (driver stays privilege-free). Validation: - fc_integration_test.go (build tag `fcintegration`): drives fcDriver record+consume against a real microVM, env-configured. - .github/workflows/firecracker-e2e.yml: workflow_dispatch job that builds the image, boots a real microVM on an x86_64 + /dev/kvm runner, and asserts the snapshot-consumed impacted set is byte-identical to the cold/local set (RFC §5.3). - README: networking flags, build/setup docs, and validation notes — incl. the known aarch64 16KB-page-host post-restore userspace freeze. Unit-tested (fcapi network call, netConfig boot args, ensureTapExists); go test + vet clean for both default and fcintegration tag sets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(firecracker): make guest image build + driver work end-to-end Found while running the §5.3 canary on real Linux (in a nested-virt VM): the guest image build and the firecracker driver had several real bugs that would also bite on a Linux+KVM CI host. Fixes: build_guest_image.sh (the firecracker-ci minimized base exposed these): - run apt in the chroot with the sandbox off + create /tmp, apt spool/log dirs (else "Couldn't create temporary file" / repos "not signed") - mount /dev/pts in the chroot (JDK postinst calls posix_openpt) - create /usr/share/man/manN (JDK update-alternatives man symlinks) - chown the baked /work to root (git "detected dubious ownership" -> exit 128) - actually switch sshd from socket-activation to a standalone always-on ssh.service (the README claimed this but the script never did it; socket sshd doesn't reliably serve connections after a snapshot restore) driver_firecracker.go: - add waitForGuest() and poll guest ssh after instanceStart (record) and after snapshot resume (consume) before issuing commands — the driver previously raced the guest's boot / resume and the first ssh would fail. fc_integration_test.go: - ssh ConnectTimeout + BatchMode so the readiness poll fails fast and never hangs on a prompt. With these, the canary boots -> NIC/TAP -> standalone sshd -> guest exec -> git checkout -> java all work. (Local nested-virt runs are then gated only by ~70x JVM slowdown under Apple's L2 nested virtualization, which does not exist on a bare-metal Linux+KVM host; bazel's server can't start within its timeout in the nested guest. Run the canary on real hardware via firecracker-e2e.yml.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(firecracker): raise fcClient timeout for snapshot create/load The HTTP client used a 60s timeout for all Firecracker API calls, but /snapshot/create and /snapshot/load dump/load the guest's full memory to/from disk and take well over a minute for a multi-GB VM. On a real KVM host the canary's record step hit "PUT /snapshot/create: context deadline exceeded" mid-write. Raise the timeout to 15m (other calls are instant over the socket). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(firecracker): gitignore the go orchestrator build output Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(firecracker): raise Go + Kotlin coverage for the snapshot tooling Go orchestrator (tools/firecracker): 30% -> 66% statement coverage. - Extract the in-guest command builders (warmupCommand/consumeScript) and make waitForGuest's poll interval injectable, so the record/consume logic is unit- testable without a microVM. - Add tests: main.go runRecord/runConsume end-to-end via the local driver with a fake bazel-diff (+ makeDriver branches, multiFlag, arg validation, cold fallback); store newEntry/writeMetadata/path accessors/mustAbs; readBazelLabel; fingerprint error paths; fcClient resume; driver helpers (copyFile, baseRootfs, netConfig, waitForGuest, sshGuest args, boot error, teardown). - Fix TestEnsureTapExists to skip the lo check off-Linux (filepath.Glob returns a nil error on no match, so it was wrongly running + failing on macOS). Remaining 0% is the genuinely VM/ssh-bound record/consume/boot path, covered by the fcintegration canary. Kotlin CLI hooks (//cli, protects the >=90% bazel coverage gate): - Extract WarmupCommand.writeFingerprint() out of call() so the fingerprint emission is testable without the bazel-backed generate-hashes run. - Add FingerprintGathererTest, FingerprintCommandTest, WarmupCommandTest. - New-file coverage: FingerprintInteractor 100%, FingerprintCommand 91%, FingerprintGatherer 92%, WarmupCommand 76% (remainder is super.call()). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Maxwell Elliott <maxwell@elliott.now>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a design RFC (
docs/firecracker-snapshots.md) for using Firecracker microVM snapshots to give instant starts of bazel-diff.The win: bazel-diff's own JVM CLI starts in <1s — the cost is the
bazel query deps(//...)it shells out to, which pays full Bazel server warmup + external-repo/bzlmod resolution + Skyframe graph load on every cold start (minutes on a large monorepo). A snapshot captures that warm state once and restores it in ~sub-second, so the PR-time path only re-analyzes changed packages.Decided scope
warmup,fingerprint) + a Go orchestration tool (tools/firecracker/)What the RFC covers
warmup(record entrypoint, clean exit = "safe to snapshot") andfingerprint(cache key); consume reuses existinggenerate-hashes+get-impacted-targetsMODULE.bazel.lock/.bazelrc/ bazel-diff version / flag set, a fail-safe fall-back-to-cold on any mismatch, the fact thatSourceFileHasheralready makes content correctness independent of server incrementality, and a CI canaryThis PR is docs-only
No code changes. Intended as the review artifact before implementation. Phase 1 (the
fingerprint+warmupsubcommands) would follow in a separate PR.🤖 Generated with Claude Code