Skip to content

Replace Ruby bosh-nats-sync with Golang implementation#2746

Draft
aramprice wants to merge 15 commits into
mainfrom
experiment-golang-bosh-nats-sync
Draft

Replace Ruby bosh-nats-sync with Golang implementation#2746
aramprice wants to merge 15 commits into
mainfrom
experiment-golang-bosh-nats-sync

Conversation

@aramprice

Copy link
Copy Markdown
Member

Summary

  • Replaces the Ruby bosh-nats-sync gem with a Go binary (src/bosh-nats-sync/) that mirrors all Ruby behavior including TLS verification preferences (uaa_ca_cert > director_ca_cert), HTTP peer verification, and startup reliability
  • Removes all Ruby runtime dependencies from the nats BOSH job and package (director-ruby-3.3, gem bundling, BUNDLE_GEMFILE/GEM_HOME env vars)
  • Adds GitHub Actions CI coverage for the new Go code (lint + test jobs in go.yml); removes the defunct nats_sync:parallel Ruby matrix entry from ruby.yml

Commits

  1. Golang bosh-nats-sync — full Go implementation with parity to the Ruby version:

    • Correct TLS CA selection (uaa_ca_cert preferred over director_ca_cert)
    • Proper Director API TLS peer verification (no InsecureSkipVerify)
    • isConnectionError retry logic covering deadline exceeded / eof (matching Ruby's Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNRESET)
    • Fatal startup behavior on failed NATS reload (matching Ruby)
    • Full Ginkgo/Gomega test suite
  2. Remove ruby-isms now that nats is golang — strips Ruby from jobs/nats/, packages/nats/, src/Gemfile, and CI:

    • packages/nats/packaging now builds the Go binary via go build
    • jobs/nats/ wrapper script calls Go binary directly (no Ruby runtime sourcing)
    • CI go.yml: adds lint (bosh-nats-sync) and test (bosh-nats-sync) jobs
    • CI ruby.yml: removes nats_sync:parallel matrix entry
    • Adds src/bosh-nats-sync/.golangci.yml

Test plan

  • go.yml / lint (bosh-nats-sync) passes
  • go.yml / test (bosh-nats-sync) passes
  • ruby.yml / unit_specs passes for all remaining sub-projects (common:parallel, monitor:parallel, release)
  • Release ERB template specs (spec/nats_templates_spec.rb) pass

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3d80f5f8-33e2-4d34-9ad5-3cd7392e0c25

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch experiment-golang-bosh-nats-sync

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Comment thread .github/workflows/go.yml Fixed
Comment thread .github/workflows/go.yml Fixed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/go.yml:
- Line 21: The actions/checkout steps lack the persist-credentials parameter
which increases the risk of token reuse in subsequent steps and artifacts. Add
the parameter `persist-credentials: false` to both instances of the
`actions/checkout@v7` action to disable credential persistence and reduce token
exposure risk. This parameter should be added as a separate line within the uses
section for both checkout actions.
- Around line 21-22: Replace the GitHub Actions version tag pins with full
commit SHA pins for enhanced security. Update the `uses:` entries for
`actions/checkout@v7` and `actions/setup-go@v6` (and any other actions mentioned
in the comment) to use their full commit SHA instead of the `@v*` tag format.
This applies to all `uses:` statements throughout the workflow file where
actions are currently pinned to version tags rather than specific commit hashes.

In `@src/bosh-nats-sync/go.mod`:
- Line 3: The Go toolchain and dependencies contain multiple security
vulnerabilities that must be addressed before release. Update the go directive
from version 1.26.1 to 1.26.3 to resolve five known security vulnerabilities.
Additionally, upgrade golang.org/x/net from v0.49.0 to v0.53.0 to address the
GO-2026-4918 vulnerability affecting net/http infinite loops, and upgrade
golang.org/x/sys from v0.40.0 to v0.44.0 to address the GO-2026-5024
vulnerability affecting sys/windows integer overflow. After making these version
updates in the go.mod file, refresh the module graph by running go mod tidy to
ensure all transitive dependencies are properly resolved and consistent.

In `@src/bosh-nats-sync/pkg/authprovider/auth_provider_test.go`:
- Around line 305-308: The test assertion in the "Token expiration deadline"
describe block is tautological because it compares two identical constant
expressions (60 * time.Second to 60 * time.Second) which will always pass
regardless of actual production code changes. Replace the hardcoded 60 *
time.Second on the left side of the Expect call in the It block with
authprovider.ExpirationDeadline so the test actually verifies the production
constant has the expected value, while keeping the right side as 60 *
time.Second for the comparison target.

In `@src/bosh-nats-sync/pkg/authprovider/auth_provider.go`:
- Around line 87-91: In the error handling block following the failed token
acquisition call to a.cfg.Token(a.ctx), change the return statement from
returning ("", nil) to returning ("", err) to properly propagate the
authentication error instead of suppressing it. This ensures that downstream
code receives the error and can handle the transient failure appropriately
through the intended retry and error handling flow.
- Around line 109-111: The http.Client being returned from this function lacks
an explicit timeout, which can cause token requests to hang indefinitely. Add a
Timeout field to the http.Client struct that is being returned to specify a
reasonable timeout duration for UAA token requests. This ensures that network
operations on the client will not block indefinitely and will allow sync
progress to continue even if token requests become unresponsive.
- Around line 99-107: The code currently silently ignores read errors and parse
failures when loading the configured CA certificate file in the block around
caCertPath and x509.NewCertPool(). Instead of only proceeding when err is nil,
you should fail fast by returning an error when either the file read fails or
when AppendCertsFromPEM fails to parse the certificates. This ensures that
configuration issues with the CA file are surfaced immediately rather than
silently degrading TLS trust behavior at runtime.

In `@src/bosh-nats-sync/pkg/config/config.go`:
- Around line 55-60: The NewLogger function dereferences the cfg parameter
without first checking if it is nil, which will cause a panic if a nil config is
passed. Add a nil check at the start of the NewLogger function before accessing
cfg.LogFile, and when cfg is nil, fall back to creating a logger that writes to
stdout instead of attempting to access cfg fields.
- Around line 41-53: The Load function unmarshals the YAML data into the Config
struct but does not validate the configuration before returning it, which can
lead to runtime failures when downstream code attempts to use fields like
Intervals.PollUserSync with time.NewTicker. After the yaml.Unmarshal call
succeeds, add validation logic to check that required fields are present and
that interval values (specifically Intervals.PollUserSync) are positive non-zero
durations. If validation fails, return an error with a descriptive message. This
ensures the config is semantically valid before being used.

In `@src/bosh-nats-sync/pkg/runner/runner_test.go`:
- Around line 193-204: The failCmdRunner in this test fails immediately on the
first call (startup reload), causing Run() to exit before reaching the periodic
ticker loop, so the periodic sync failure path is never tested. Modify the
failCmdRunner to succeed on the first invocation (to allow the startup reload to
complete) and fail on subsequent invocations (to test the periodic sync failure
path). Use a counter or state tracking within the failCmdRunner function to
differentiate between the first call and later calls, ensuring the test actually
validates periodic-sync failure as described.

In `@src/bosh-nats-sync/pkg/runner/runner.go`:
- Around line 57-59: The code creates a time.NewTicker with a duration from
r.config.Intervals.PollUserSync without validating that the value is greater
than zero, which can cause a panic if the config value is unset or invalid. Add
validation logic before creating the ticker to check if the interval duration is
greater than zero, and return an appropriate error or handle the invalid
configuration gracefully instead of allowing time.NewTicker to panic. This
validation should occur before the line where time.NewTicker is called.

In `@src/bosh-nats-sync/pkg/userssync/users_sync_test.go`:
- Around line 500-549: The test context is titled "when director becomes
available after retries" but the It block only verifies that Execute() succeeds
without validating that retries actually occurred. The /info handler is designed
to return StatusServiceUnavailable on the first request and StatusOK on
subsequent requests, but the test never asserts this retry behavior happened.
Add an assertion after the sync.Execute() call in the It block to check that
requestCount >= 2, ensuring the /info endpoint was called multiple times and the
retry mechanism in withDirectorConnection() was actually exercised.

In `@src/bosh-nats-sync/pkg/userssync/users_sync.go`:
- Around line 307-309: The deploymentName parameter in the getVMsByDeployment
method is directly embedded into the API path using fmt.Sprintf without URL
encoding, which will cause issues if the deployment name contains reserved URL
characters like forward slashes, percent signs, or spaces. Fix this by
URL-escaping the deploymentName before embedding it into the path string using
the appropriate URL encoding function from the net/url package, specifically
url.QueryEscape or url.PathEscape as appropriate for path parameters.
- Around line 107-127: The ConnectionWaitTimeout is used to calculate retry
attempts but doesn't enforce an overall elapsed-time budget across all probes.
Each HTTP call can block independently for up to 30 seconds, causing startup to
stall far longer than the configured timeout. In the withDirectorConnection()
method, create a deadline based on the configured timeout at the start and track
elapsed time throughout the retry loop. Before each call to
boshAPIResponseBody() and before sleeping, check if the elapsed time would
exceed the deadline and bail out if so. This ensures the total time spent across
all attempts, including individual HTTP timeouts, never exceeds the
ConnectionWaitTimeout. Apply the same deadline-checking pattern to all other
callers of boshAPIResponseBody() in getAuthHeader(), queryAllDeployments(), and
getVMsByDeployment() methods.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74b1ea7c-2402-441d-a6e3-cda6eee9250b

📥 Commits

Reviewing files that changed from the base of the PR and between a13a1db and 980d15b.

⛔ Files ignored due to path filters (2)
  • src/Gemfile.lock is excluded by !**/*.lock
  • src/bosh-nats-sync/go.sum is excluded by !**/*.sum
📒 Files selected for processing (49)
  • .github/workflows/go.yml
  • .github/workflows/ruby.yml
  • jobs/nats/spec
  • jobs/nats/templates/bosh_nats_sync
  • jobs/nats/templates/bpm.yml
  • packages/nats/packaging
  • packages/nats/spec
  • src/Gemfile
  • src/bosh-nats-sync/.gitignore
  • src/bosh-nats-sync/.golangci.yml
  • src/bosh-nats-sync/bin/bosh-nats-sync
  • src/bosh-nats-sync/bosh-nats-sync.gemspec
  • src/bosh-nats-sync/cmd/bosh-nats-sync/main.go
  • src/bosh-nats-sync/go.mod
  • src/bosh-nats-sync/lib/nats_sync.rb
  • src/bosh-nats-sync/lib/nats_sync/auth_provider.rb
  • src/bosh-nats-sync/lib/nats_sync/config.rb
  • src/bosh-nats-sync/lib/nats_sync/nats_auth_config.rb
  • src/bosh-nats-sync/lib/nats_sync/runner.rb
  • src/bosh-nats-sync/lib/nats_sync/users_sync.rb
  • src/bosh-nats-sync/lib/nats_sync/version.rb
  • src/bosh-nats-sync/lib/nats_sync/yaml_helper.rb
  • src/bosh-nats-sync/pkg/authprovider/auth_provider.go
  • src/bosh-nats-sync/pkg/authprovider/auth_provider_suite_test.go
  • src/bosh-nats-sync/pkg/authprovider/auth_provider_test.go
  • src/bosh-nats-sync/pkg/config/config.go
  • src/bosh-nats-sync/pkg/config/config_suite_test.go
  • src/bosh-nats-sync/pkg/config/config_test.go
  • src/bosh-nats-sync/pkg/natsauthconfig/nats_auth_config.go
  • src/bosh-nats-sync/pkg/natsauthconfig/nats_auth_config_suite_test.go
  • src/bosh-nats-sync/pkg/natsauthconfig/nats_auth_config_test.go
  • src/bosh-nats-sync/pkg/runner/runner.go
  • src/bosh-nats-sync/pkg/runner/runner_suite_test.go
  • src/bosh-nats-sync/pkg/runner/runner_test.go
  • src/bosh-nats-sync/pkg/userssync/users_sync.go
  • src/bosh-nats-sync/pkg/userssync/users_sync_suite_test.go
  • src/bosh-nats-sync/pkg/userssync/users_sync_test.go
  • src/bosh-nats-sync/spec/nats_sync/auth_provider_spec.rb
  • src/bosh-nats-sync/spec/nats_sync/config_spec.rb
  • src/bosh-nats-sync/spec/nats_sync/nats_auth_config_spec.rb
  • src/bosh-nats-sync/spec/nats_sync/runner_spec.rb
  • src/bosh-nats-sync/spec/nats_sync/users_sync_spec.rb
  • src/bosh-nats-sync/spec/nats_sync_spec.rb
  • src/bosh-nats-sync/spec/spec_helper.rb
  • src/bosh-nats-sync/spec/support/uaa_helpers.rb
  • src/bosh-nats-sync/testdata/director-subject
  • src/bosh-nats-sync/testdata/hm-subject
  • src/bosh-nats-sync/testdata/sample_config.yml
  • src/tasks/spec.rake
💤 Files with no reviewable changes (25)
  • src/bosh-nats-sync/lib/nats_sync/version.rb
  • jobs/nats/templates/bosh_nats_sync
  • src/bosh-nats-sync/spec/support/uaa_helpers.rb
  • src/bosh-nats-sync/lib/nats_sync/nats_auth_config.rb
  • .github/workflows/ruby.yml
  • src/bosh-nats-sync/.gitignore
  • src/bosh-nats-sync/lib/nats_sync/yaml_helper.rb
  • src/bosh-nats-sync/spec/nats_sync/auth_provider_spec.rb
  • src/bosh-nats-sync/lib/nats_sync/runner.rb
  • src/bosh-nats-sync/spec/nats_sync/runner_spec.rb
  • src/bosh-nats-sync/spec/nats_sync/users_sync_spec.rb
  • src/bosh-nats-sync/bin/bosh-nats-sync
  • src/bosh-nats-sync/bosh-nats-sync.gemspec
  • src/bosh-nats-sync/spec/nats_sync/nats_auth_config_spec.rb
  • src/bosh-nats-sync/spec/nats_sync/config_spec.rb
  • src/bosh-nats-sync/lib/nats_sync.rb
  • src/Gemfile
  • jobs/nats/spec
  • src/bosh-nats-sync/lib/nats_sync/config.rb
  • src/bosh-nats-sync/spec/nats_sync_spec.rb
  • src/bosh-nats-sync/testdata/sample_config.yml
  • src/bosh-nats-sync/spec/spec_helper.rb
  • src/bosh-nats-sync/lib/nats_sync/users_sync.rb
  • src/bosh-nats-sync/lib/nats_sync/auth_provider.rb
  • jobs/nats/templates/bpm.yml

Comment thread .github/workflows/go.yml
Comment thread .github/workflows/go.yml
Comment thread src/bosh-nats-sync/go.mod Outdated
Comment thread src/bosh-nats-sync/pkg/authprovider/auth_provider_test.go
Comment thread src/bosh-nats-sync/pkg/authprovider/auth_provider.go Outdated
Comment thread src/bosh-nats-sync/pkg/runner/runner_test.go
Comment thread src/bosh-nats-sync/pkg/runner/runner.go
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync_test.go
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go Outdated
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go Outdated
@github-project-automation github-project-automation Bot moved this from Inbox to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Jun 19, 2026
@aramprice aramprice force-pushed the experiment-golang-bosh-nats-sync branch 2 times, most recently from 7577681 to d923c47 Compare June 19, 2026 23:49
@aramprice aramprice requested a review from Copilot June 19, 2026 23:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 19, 2026
@github-project-automation github-project-automation Bot moved this from Waiting for Changes | Open for Contribution to Pending Merge | Prioritized in Foundational Infrastructure Working Group Jun 19, 2026
@Alphasite Alphasite requested a review from Copilot June 19, 2026 23:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 51 changed files in this pull request and generated 6 comments.

Comment thread src/bosh-nats-sync/pkg/authprovider/auth_provider.go
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go Outdated
Comment thread packages/nats/spec
Comment thread packages/nats/packaging Outdated
Comment thread src/bosh-nats-sync/pkg/runner/runner.go Outdated
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 20, 2026
@aramprice aramprice force-pushed the experiment-golang-bosh-nats-sync branch 4 times, most recently from 84f41f0 to 8d773dc Compare June 20, 2026 01:47
@colins colins self-assigned this Jun 23, 2026
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go
@aramprice aramprice force-pushed the experiment-golang-bosh-nats-sync branch 2 times, most recently from fddaf21 to 5980267 Compare June 25, 2026 20:36
@colins colins force-pushed the experiment-golang-bosh-nats-sync branch from 118abae to 5b1d58a Compare June 26, 2026 18:31
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go
Comment thread src/bosh-nats-sync/pkg/authprovider/auth_provider.go
Comment thread src/bosh-nats-sync/pkg/runner/runner.go
Comment thread src/bosh-nats-sync/pkg/natsauthconfig/nats_auth_config_test.go Outdated
Comment thread src/bosh-nats-sync/pkg/authprovider/auth_provider_test.go Outdated
Comment thread src/bosh-nats-sync/pkg/userssync/users_sync.go
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 30, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 30, 2026
@aramprice

Copy link
Copy Markdown
Member Author

Behavioral parity audit: Ruby vs Go

A complete pass comparing the Ruby source (src/bosh-nats-sync/lib/nats_sync/) against this Go implementation. All deviations are noted; deviations marked improvement are intentional and correct.

Startup sequence

# Aspect Ruby Go
1 Signal handling Signal.trap('INT') only SIGINT and SIGTERM ✅ improvement (BPM sends SIGTERM)
2 Initial NATS reload Always sends reload; picks up whatever config pre-start wrote Bootstrap(): reads subject files, writes explicit director/HM config, then sends reload ✅ improvement
3 First sync timing Rufus::Scheduler.interval fires after the interval time.Ticker fires after the interval — identical timing

execute_users_sync / Execute

# Aspect Ruby Go
4 AuthProvider lifecycle NATSSync::AuthProvider.new(info, …) created on every authenticated request → new UAAToken → new UAA grant per request AuthProvider created once per Execute pass; token cached across all /deployments + /deployments/*/vms calls ✅ improvement (reduces UAA load)
5 UAA token failure Logs error, sets @uaa_token = nil, returns nil auth header → director returns 401 → RuntimeError caught → fallback Returns error to caller → fallback. Same outcome, clearer error chain ✅
6 Connection error retry Bosh::Common.retryable(tries: N, on: DIRECTOR_CONNECTION_ERRORS) — number-of-attempts based Deadline-based with 1 s sleep between retries — equivalent timeout
7 Deployment URL in path /deployments/#{deployment}/vmsno URL encoding /deployments/ + url.PathEscape(name) + /vms — URL-encodes name ✅ (BOSH names are [a-z0-9-] so no practical difference)
8 Empty agent_id VMs No check — writes CN=.agent.bosh-internal (invalid entry) Skips VMs with empty agent_id ✅ bug fixed
9 JSON serialization JSON.unparse — no HTML-escaping (preserves > literally) json.NewEncoder with SetEscapeHTML(false) — identical output ✅
10 Change-detection hash Digest::MD5.file(path).hexdigest sha256.Sum256 — same semantics (non-cryptographic change detection) ✅
11 read_subject_file (whitespace-only) File.empty? is false for whitespace-only; strip returns "" → returns "" (not nil!) TrimSpace → returns nil

auth_header / AuthProvider

# Aspect Ruby Go
12 Basic auth encoding Base64.encode64(…).strip — strip removes leading/trailing whitespace but NOT mid-string newlines Ruby inserts every 60 chars. For long credentials, produces an invalid RFC 2617 header base64.StdEncoding.EncodeToString — no newlines, RFC-compliant ✅
13 CA cert fallback uaa_ca_certdirector_ca_cert → system trust CAFilePath(): identical fallback chain ✅
14 TLS verification verify_mode = VERIFY_PEER always tls.Config{MinVersion: TLS12} — always verifies ✅

Error / shutdown handling

# Aspect Ruby Go
15 Sync error Scheduler on_error logs fatal + shutdown Runner returns error → main.go os.Exit(1) — equivalent
16 Stop @scheduler.shutdown Close stopCh + wait for Run() — equivalent graceful shutdown

NatsAuthConfig

# Aspect Ruby Go
17 Permissions structure Identical publish/subscribe arrays per user type Identical ✅
18 CN format "C=USA, O=Cloud Foundry, CN=#{cn}.agent.bosh-internal" Identical ✅
19 bootstrap user Included when !vm['permanent_nats_credentials'] Identical ✅

Verdict: The Go implementation is a faithful and improved replacement for the Ruby version. Every behavioral difference is an improvement (SIGTERM support, per-Execute auth caching, empty-agent-id guard, RFC-compliant Base64, clearer UAA error propagation).

@aramprice

Copy link
Copy Markdown
Member Author

Idiomatic Go audit

A second pass reviewing the implementation for Go clarity and convention (not Ruby-translation patterns).

Strengths ✅

  • errors.Is/errors.As for connection-error classification — correct and idiomatic vs. string-matching
  • context.Context propagated through every I/O call — correct
  • sync.Once for lazy HTTP client — correct (documented as permanently-fatal, which is the right semantic)
  • Functional options pattern (WithCommandRunner) — idiomatic for testability
  • slog.Logger — modern (Go 1.21+)
  • export_test.go for exposing unexported symbols in external test packages — correct Go idiom
  • Error wrapping with %w throughout — correct
  • defer func() { _ = resp.Body.Close() }() — correct ignore-close-error pattern
  • Table-driven tests via DescribeTable/Entry — well-suited to the isConnectionError matrix

Issues found and fixed in this review pass

  1. getCommandRunner() was dead codeNewUsersSync now normalises a nil cmdRunner in the constructor, and all call-sites use u.commandRunner directly. Removed.

  2. Runner test temp file leakdirSubjectFile and hmSubjectFile were never cleaned up in AfterEach. Fixed (analogous to the users_sync_test.go fix).

Design observations (no code change required, but worth noting)

  1. uaaTokenHeader holds the mutex for the entire HTTP calla.mu.Lock() is held while a.cfg.Token(tokenCtx) makes an outbound HTTP request (up to 30 s). Since there is only one sync goroutine at a time, this is safe and correct. A future refactor could use double-checked locking or sync.Once per token, but the current design is fine.

  2. readSubjectFile returns *string — this is idiomatic in Go when nil is semantically meaningful. An alternative (string, bool) return would also be idiomatic; either is fine.

  3. Module path bosh-nats-sync without a domain prefix — acceptable for a standalone binary; would need a qualified path (e.g., github.com/cloudfoundry/bosh/src/bosh-nats-sync) if external importability were ever needed.

  4. Bootstrap() always sends a NATS reload even if the config hasn't changed — unlike Execute which skips reload when the hash matches, Bootstrap always reloads after writing. This is correct for startup (we want NATS to definitely apply the new config), and the extra SIGHUP is harmless.

aramprice and others added 15 commits June 29, 2026 18:29
The following files reference the Ruby gem and will need updates once the Go binary is ready. These are **not** part of this plan's implementation scope but are enumerated for completeness:

1. **[packages/nats/packaging](packages/nats/packaging)** -- Currently builds the Ruby gem via `gem build` and `bosh_bundle_local`. Must be replaced with `go build -o ${BOSH_INSTALL_TARGET}/bin/bosh-nats-sync ./cmd/bosh-nats-sync/`. Remove Ruby gem build steps for bosh-nats-sync.
2. **[packages/nats/spec](packages/nats/spec)** -- Currently lists `bosh-nats-sync/**/`*, `bosh-common/**/`*, and `vendor/cache/*.gem` as files. The `bosh-nats-sync/**/*` glob stays but `bosh-common` and `vendor/cache` entries may no longer be needed for this package. A Go toolchain dependency must be added (replacing `director-ruby-3.3` for the sync binary, though nats-server itself doesn't need Ruby either).
3. **[jobs/nats/templates/bosh_nats_sync](jobs/nats/templates/bosh_nats_sync)** -- The wrapper script currently sources Ruby runtime env and runs `exec /var/vcap/packages/nats/bin/bosh-nats-sync -c ...`. Remove the Ruby runtime source line; the Go binary is self-contained.
4. **[jobs/nats/templates/bpm.yml](jobs/nats/templates/bpm.yml)** -- Remove `BUNDLE_GEMFILE` and `GEM_HOME` env vars from the `bosh_nats_sync` process config since Go binary needs no Ruby environment.
5. **[jobs/nats/spec](jobs/nats/spec)** -- The `packages` list includes `director-ruby-3.3`. If nats-server itself doesn't need Ruby, this dependency can be removed. A `golang-1.x` package dependency must be added for compilation (or the Go binary can be cross-compiled and vendored as a blob).
6. **[src/Gemfile](src/Gemfile)** -- Remove the `gem 'bosh-nats-sync', path: 'bosh-nats-sync'` line.
7. **[src/Gemfile.lock](src/Gemfile.lock)** -- Will be regenerated after removing bosh-nats-sync from Gemfile.
8. **[spec/nats_templates_spec.rb](spec/nats_templates_spec.rb)** -- The `bosh_nats_sync_config.yml.erb` test remains valid (config format is unchanged). No changes needed unless the config format changes.
9. **[scripts/rsync-vbox](scripts/rsync-vbox)** -- References bosh-nats-sync for syncing to dev VMs; may need path adjustments.

Made-with: Cursor
- Replace golang-1.26-linux spec.lock (vendored, no GCS blob yet) with
  a proper spec+packaging that can be built from source. The CI pipeline's
  bump-golang-package job will convert this back to the vendor-package
  approach once the blob is uploaded to GCS.
- Fix packages/nats/spec: BOSH prepends src/ automatically, so the file
  pattern should be bosh-nats-sync/**/* not src/bosh-nats-sync/**/*
- Fix packages/nats/packaging: BOSH strips src/ when extracting package
  files, so the compile target is bosh-nats-sync/ not src/bosh-nats-sync/
- Add src/golang-1.26-linux/compile.env.unix and runtime.env.unix from
  bosh-package-golang-release for the packaging script
…t store)

Implements the Ruby spec coverage missing from the Go test suite:

  Ruby: spec/nats_sync/auth_provider_spec.rb
    context 'user has not provided director_ca_cert' do
      it_behaves_like :auth_provider_shared_tests
    end

The shared examples applied in that context were:
  - 'returns auth header provided by UAA'
  - 'reuses the same token for subsequent requests'
  - 'when token is about to expire / obtains new token'
  - 'when getting token fails / does not raise'

The new Go context "when no CA cert is provided (system trust store)" adds:
  - CAFilePath() returns "" (equivalent to Ruby verifying ssl_cert_store with
    set_default_paths is used instead of ssl_ca_file)
  - All four token lifecycle cases above
  - An additional TLS test not present in Ruby: confirms InsecureSkipVerify is
    NOT set by asserting that a TLS UAA server with a self-signed cert fails to
    connect when no CA cert is configured
Implements the Ruby spec coverage missing from the Go test suite:

  Ruby: spec/nats_sync/users_sync_spec.rb
    context 'with various network errors' do
      [ Errno::ECONNREFUSED, Errno::ECONNRESET, Errno::ETIMEDOUT,
        Net::OpenTimeout, SocketError ].each do |error_class|
        it "retries on #{error_class}"
      end
    end

The Go implementation gates retries on isConnectionError(), which matches
connection-class errors by substring. The new DescribeTable in
users_sync_test.go covers every string pattern in that function, each
annotated with its Ruby error-class equivalent:

  Ruby error class          Go error string matched
  ─────────────────────     ────────────────────────
  Errno::ECONNREFUSED     → "connection refused"
  Errno::ECONNRESET       → "connection reset by peer"
  Errno::ETIMEDOUT        → "connection timed out"
  Net::OpenTimeout        → "i/o timeout", "deadline exceeded"
  SocketError             → "no such host"
  Errno::EHOSTUNREACH     → "host is unreachable"
  Errno::ENETUNREACH      → "network is unreachable"
  (Go-specific)           → "eof", "unexpected eof"

Negative cases verify that HTTP-level errors (401, 500) and nil do not
trigger retries.

Uses export_test.go to expose isConnectionError to the external test
package without modifying the production API.
On a fresh bosh create-env, pre-start writes a fooBar token placeholder to
auth.json. The old startup code only sent a SIGHUP to reload this placeholder,
switching NATS to token-based auth and causing every health_monitor connection
attempt to fail with "Authorization Violation" until the first periodic sync ran.

Add UsersSync.Bootstrap() which reads the director-subject and hm-subject files
written by pre-start and immediately writes a proper user-based NATS config,
then sends SIGHUP — all without querying the director. Runner.Run() now calls
bootstrapNATSConfig() at startup instead of a bare ReloadNATSServerConfig,
ensuring health_monitor can authenticate against NATS before the director
finishes initializing. Bootstrap failures are non-fatal and logged; the periodic
sync loop continues so the full user list is populated once the director is up.
Go's json.Marshal HTML-escapes '>' as '\u003e' by default (to prevent XSS
in HTML contexts). NATS's config parser does not support \u Unicode escape
sequences and rejects the file on reload with:

  Failed to reload server configuration: error parsing include file
  '../../../data/nats/auth.json', Parse error on line 1: 'Invalid escape
  character 'u'.'

This meant every SIGHUP sent by bosh-nats-sync caused NATS to reject the
new auth.json and continue running with the original fooBar token placeholder,
leaving health_monitor unable to authenticate for the full 5-minute deploy
timeout.

The fix switches writeNATSConfigFile to use json.NewEncoder with
SetEscapeHTML(false), which writes '>' literally so NATS can parse the file.
Also adds a regression test that asserts 'director.>' is not escaped.

Also adds /var/vcap/sys/log/nats/bosh-nats-sync.log and
/var/vcap/sys/log/nats/nats.log to the director diagnostic collection in
run-bats-pipeline.sh, which is what exposed this bug.
Introduces `bundle exec rake fly:bats` (from src/) so developers can run
the full BOSH Acceptance Test suite against their local branch without
waiting for the shared CI pipeline.

How it works
------------
The rake task sets a standalone Concourse pipeline (ci/fly-bats.yml),
then triggers and streams the single `bats` job.  The pipeline:

  1. Checks out the current branch from GitHub (branch is pushed
     automatically before set-pipeline).
  2. Runs make-candidate.yml to build a dev release tarball.
  3. Compiles the tarball via bosh-deployment/ci/tasks/shared/bosh-agent-compile.yml.
  4. Runs ci/tasks/run-bats-pipeline.sh which provisions a GCP environment
     with terraform, deploys the BOSH director, runs BATs, and tears down.

All GCP credentials are resolved from the shared Concourse credential
store (((gcp_json_key)) / ((gcp_project_id))).

Files added
-----------
- ci/fly-bats.yml            — standalone Concourse pipeline (replaces the
                               two separate fly-execute task files)
- ci/tasks/run-bats-pipeline.sh — orchestrates the full BATs lifecycle in GCP

Env-var overrides (all optional)
---------------------------------
  BATS_ENV_NAME     terraform env name (default: bats-local)
  STEMCELL_NAME     override GCP stemcell
  DEPLOY_ARGS       extra ops-files for bosh create-env
  BAT_RSPEC_FLAGS   extra RSpec flags (e.g. "--tag wip")
  CONCOURSE_TARGET  target alias (default: bosh)

Cherry-picked from fly-bats-pipeline branch (PR #2753).
…nt_id

When a VM is being created or its agent has not yet registered, the
Director returns it with an empty agent_id.  CreateConfig was processing
those entries and emitting a user with CN="" in auth.json:

  {"username":"C=USA, O=Cloud Foundry, CN=.agent.bosh-internal", ...}

That entry replaced the real agent credentials on the next NATS reload,
so after a VM reboot the agent could not reconnect to NATS for ~10 min.

Fixes
-----
- nats_auth_config.go: skip VMs with empty AgentID in CreateConfig
- users_sync.go: Bootstrap() is a no-op when auth.json already contains
  real user entries (preserves agent credentials across process restarts)
- users_sync.go: Execute() logs total/registered VM counts and logs when
  NATS is reloaded (aids diagnosis)
The golang-1.26-linux package on this branch is a stub (only compile.env /
runtime.env; the real Go binary blob hasn't been uploaded to GCS yet).
Replace the reference to bosh-deployment/ci/tasks/shared/bosh-agent-compile.yml
with an inline task that first installs Go 1.26.3 and drops a shim at
/usr/local/bin/go so that nats packaging can find the Go toolchain.
…or classification

Address review findings on the Go bosh-nats-sync rewrite, focused on a
behavioral regression against the Ruby implementation plus several
idiomatic-Go and shutdown-correctness gaps.

Behavior / correctness:
- Memoize /info and the UAA AuthProvider per sync pass. Previously
  getAuthHeader built a fresh AuthProvider and re-fetched /info on every
  authenticated request, so a UAA director was hit with one /info and one
  client_credentials token grant per deployment. Execute now fetches /info
  once, builds one AuthProvider, and reuses it for all authed calls in the
  pass (matching the Ruby @director_info / @uaa_token memoization).
- Reuse a single *http.Client/*http.Transport for the process instead of
  rebuilding it (and re-reading the CA file from disk) on every request.
- Thread context.Context through Run -> Execute -> the HTTP layer; use
  http.NewRequestWithContext and a cancellable retry sleep so Stop()/SIGTERM
  unwinds an in-flight Execute promptly instead of blocking past bpm's grace
  period. Remove the context.Context stored on AuthProvider.
- Return the fatal sync error from Runner.Run() so main exits non-zero;
  drop the awkward `go r.Stop()` self-call.
- Replace isConnectionError substring matching with errors.Is/errors.As over
  real error types (syscall errnos, *net.DNSError, *net.OpError, net.Error
  timeouts, io.EOF/ErrUnexpectedEOF, context.DeadlineExceeded). The old
  "eof"/"deadline exceeded" substrings could misclassify HTTP error bodies
  as retryable connection errors.
- A configured-but-unparseable director_ca_cert is now a hard error instead
  of silently falling back to the system trust store; missing/empty certs
  still fall back, matching Ruby's usable_director_ca_cert? predicate.
- Validate director.url and the NATS paths in config.Load; remove the dead
  url.Parse check; set tls.MinVersion to TLS 1.2 explicitly; extract the
  HTTP timeout constant; use http.StatusOK.

Structure / idiom:
- Add NewUsersSync constructor and make UsersSync fields unexported; the
  runner builds it once and reuses it for bootstrap and the periodic sync.
- Replace runner.NewWithCommandRunner with a functional-options New
  (WithCommandRunner); rename the stdlib-shadowing `sync` local.
- Use structured slog attributes consistently.

Tests / CI:
- Add regression tests: one /info + one token per Execute across multiple
  deployments; context cancellation unwinds Execute promptly; directorCACertPool
  system-store-vs-fail-loudly behavior.
- Rebuild the isConnectionError table around real error values, including a
  guard that an HTTP 500 body containing "eof" is not retried.
- Remove time.Sleep from the runner tests in favor of Eventually and
  atomic counters; assert os.WriteFile results in test setup.
- Run `go test -race` in the go.yml workflow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Replace path.Join (filesystem semantics) with explicit string
  concatenation for the /deployments/<name>/vms URL, removing the
  risk of silent path-segment elision on an empty deployment name.
  Removes the unused "path" import.

- Remove the TOCTOU os.Stat call in readSubjectFile; a single
  os.ReadFile already handles missing and zero-byte files cleanly.

- Switch natsFileHash from crypto/md5 to crypto/sha256 to silence
  gosec G501. Extract hashBytes() helper so writeNATSConfigFile can
  return the hash of what it just wrote, eliminating the second disk
  read in Execute.

- Guard the "Could not query all running vms" error log with a
  ctx.Err() check; a context-cancellation (graceful shutdown) no
  longer produces a misleading VM-query error message.

- Document the intentional difference between userFileOverwritable
  (Execute-time "is the JSON object empty?" gate) and hasExistingUsers
  (Bootstrap-time "are agent credentials already written?" gate).

- Document that a failed sync.Once in httpClient() is permanent by
  design: CA-cert misconfiguration requires a restart, not a retry.

- Add TLS caveat to the isConnectionError catch-all *net.OpError
  branch noting that x509 verification failures are NOT net.OpError
  and therefore correctly fall through as non-retryable.

- Add cross-reference comments between the two httpClientTimeout
  constants (authprovider vs userssync) so they do not silently
  diverge in future.

ai-assisted=yes
[TNZ-113908] Convert BOSH nats-sync job to Golang
- Fix temp directory leak in users_sync_test.go: the BeforeEach
  created a tmpDir via os.MkdirTemp but AfterEach only removed the
  individual subject files, leaking the directory. Now uses
  os.RemoveAll(subjectTmpDir).

- Fix misleading fixture data in nats_auth_config_test.go:
  directorSubject had a spurious "subject=" prefix and referred to
  "hm.bosh-internal" instead of "director.bosh-internal". Updated to
  use the realistic value that the pre-start script would produce.

- Strengthen the auth_provider_test.go "with director_ca_cert" test:
  add a second case that uses httptest.NewTLSServer and the server's
  own certificate as the CA cert, verifying that the configured cert
  is actually trusted in a real TLS handshake (not just that the HTTP
  client builds without error). Clarify the first case's description
  to reflect that it exercises the HTTP client build path only.

ai-assisted=yes
[TNZ-113908] Convert BOSH nats-sync job to Golang
…in Execute

The Ruby implementation raised ECONNREFUSED after exhausting director
retries, causing BPM to restart the process. The Go implementation instead
logs the error and falls back to writing a director/HM-only config when
auth.json is empty, improving startup reliability by giving health_monitor
a chance to authenticate against NATS while the director is still starting.

Add a comment to Execute calling this out explicitly, addressing a PR
reviewer question about the missing 'raises an error after exhausting
retries' test.

ai-assisted=yes
[TNZ-113908] Convert BOSH nats-sync job to Golang
- Remove getCommandRunner() helper: now that NewUsersSync normalises a
  nil cmdRunner to DefaultCommandRunner at construction time, the nil
  check inside getCommandRunner() is unreachable. All call-sites updated
  to use u.commandRunner directly, which is always non-nil after New.

- Fix temp file leak in runner_test.go: dirSubjectFile and hmSubjectFile
  were created with os.CreateTemp but never removed in AfterEach. Promote
  them to suite-level variables and clean them up alongside natsConfigFile.

ai-assisted=yes
[TNZ-113908] Convert BOSH nats-sync job to Golang
@aramprice aramprice force-pushed the experiment-golang-bosh-nats-sync branch from aedbed4 to 1196d5a Compare June 30, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Pending Merge | Prioritized

Development

Successfully merging this pull request may close these issues.

4 participants