feat(kamaji): default to 2 replicas, drop telemetry webhook#2671
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR configures the Kamaji controller for high-availability deployment with telemetry disabled, adds a soft pod anti-affinity, conditionally omits telemetry validating webhooks when telemetry is disabled, and re-applies Cozystack-side chart patches after extracting the upstream chart. ChangesKamaji HA and telemetry configuration
🎯 2 (Simple) | ⏱️ ~10 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/system/kamaji/Makefile`:
- Line 16: The patch invocation in the Makefile uses "patch
--no-backup-if-mismatch -p1 < patches/1.diff" which only strips the leading "a/"
and leaves "packages/system/kamaji/..." paths, causing the patch to fail when
run from the packages/system/kamaji directory; update the patch command to use
"-p4" so it strips "a/packages/system/kamaji/" and targets the correct
"charts/..." paths (i.e., locate the patch invocation in the Makefile and change
the -p1 argument to -p4).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a5783c5f-3589-4fba-baba-c46af238e2c7
📒 Files selected for processing (4)
packages/system/kamaji/Makefilepackages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yamlpackages/system/kamaji/patches/1.diffpackages/system/kamaji/values.yaml
| # Re-apply Cozystack-side patches over the freshly extracted chart so | ||
| # `make update` stays reproducible. Mirrors the pattern used by | ||
| # packages/system/fluxcd-operator and packages/system/ingress-nginx. | ||
| patch --no-backup-if-mismatch -p1 < patches/1.diff |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cd packages/system/kamaji
echo "== Dry-run with current flag (-p1) =="
if patch --dry-run --no-backup-if-mismatch -p1 < patches/1.diff; then
echo "p1: applies cleanly"
else
echo "p1: failed to apply"
fi
echo "== Dry-run with adjusted flag (-p4) =="
if patch --dry-run --no-backup-if-mismatch -p4 < patches/1.diff; then
echo "p4: applies cleanly"
else
echo "p4: failed to apply"
fiRepository: cozystack/cozystack
Length of output: 527
🏁 Script executed:
head -50 packages/system/kamaji/patches/1.diffRepository: cozystack/cozystack
Length of output: 692
🏁 Script executed:
ls -la packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml 2>&1 || echo "File not found"Repository: cozystack/cozystack
Length of output: 188
🏁 Script executed:
cat packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yamlRepository: cozystack/cozystack
Length of output: 1095
Use -p4 instead of -p1 for correct path stripping.
The patch file uses git-style paths (a/packages/system/kamaji/charts/...). When executed from packages/system/kamaji, -p1 strips only the a/ prefix, leaving a path that doesn't resolve; -p4 is required to strip through a/packages/system/kamaji/ and reach the correct charts/... target. Without this fix, the patch will fail to apply.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/system/kamaji/Makefile` at line 16, The patch invocation in the
Makefile uses "patch --no-backup-if-mismatch -p1 < patches/1.diff" which only
strips the leading "a/" and leaves "packages/system/kamaji/..." paths, causing
the patch to fail when run from the packages/system/kamaji directory; update the
patch command to use "-p4" so it strips "a/packages/system/kamaji/" and targets
the correct "charts/..." paths (i.e., locate the patch invocation in the
Makefile and change the -p1 argument to -p4).
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes the Kamaji controller deployment to mitigate performance bottlenecks caused by CPU contention during webhook processing. By increasing the replica count and disabling unnecessary telemetry handlers, the changes significantly reduce admission latency for TenantControlPlane operations. Additionally, the update improves pod scheduling resilience and streamlines the maintenance of custom patches within the repository. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on Gemini (@gemini-code-assist) comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the Kamaji package by configuring two replicas for high availability, disabling telemetry to reduce apiserver latency, and adding pod anti-affinity. It also updates the Makefile to ensure that Cozystack-side patches are re-applied after chart updates. The review identified an issue with the patch application command in the Makefile, suggesting the use of -p3 instead of -p1 to correctly handle the file paths in the diff.
| # Re-apply Cozystack-side patches over the freshly extracted chart so | ||
| # `make update` stays reproducible. Mirrors the pattern used by | ||
| # packages/system/fluxcd-operator and packages/system/ingress-nginx. | ||
| patch --no-backup-if-mismatch -p1 < patches/1.diff |
There was a problem hiding this comment.
The -p1 argument for patch appears to be incorrect. The patch file (patches/1.diff) contains file paths relative to the repository root (e.g., packages/system/kamaji/charts/...).
When this Makefile target runs, the working directory is packages/system/kamaji/. To correctly locate the files to be patched (e.g., charts/...), patch must strip 3 leading directory components from the paths in the diff file.
Please use -p3 instead of -p1.
patch --no-backup-if-mismatch -p3 < patches/1.diff
|
Thanks CodeRabbit (@coderabbitai) and Gemini (@gemini-code-assist) for the catch on the Fixed in 587f6a76 — but instead of bumping
That lets the existing Note: Gemini suggested |
0eb45ba to
587f6a7
Compare
|
Tip For best results, initiate chat on the files or code changes.
(Marking the original comment as addressed.) 🐇 ✨ ✏️ Learnings added
|
Aleksei Sviridkin (lexfrei)
left a comment
There was a problem hiding this comment.
All three values keys are native to the upstream chart (replicaCount, affinity, telemetry.disabled), so nothing is inert. I verified the wiring end-to-end: controller.yaml hardcodes --leader-elect, so scaling to 2 replicas is HA-safe (one reconciler, both pods serve the webhook behind the Service); --disable-telemetry is gated on .Values.telemetry.disabled and the patch removes the matching telemetry.kamaji.clastix.io entry from the ValidatingWebhookConfiguration, keeping flag and webhook config consistent — and since that webhook was failurePolicy: Ignore, dropping it is safe and removes a wasted apiserver round-trip. The soft podAntiAffinity selector (app.kubernetes.io/name: kamaji) matches the chart's actual selectorLabels, so the spread is effective and still schedules on single-node clusters. I dry-ran the patch from the package dir: it applies cleanly with the short-path convention that matches the fluxcd-operator package. LGTM.
Non-blocking items:
packages/system/kamaji/patches/1.diff— cosmetic: thediff --githeader path disagrees with the--- a/charts/.../+++ b/charts/...short paths the patch actually uses. Harmless forpatch -p1, butgit applywould choke on the mismatch. Optional to align.packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml— the new{{- if not .Values.telemetry.disabled }}guard has no helm-unittest asserting the webhook is absent whentelemetry.disabled=true. A small test would lock the behavior against future chart bumps.
The kamaji-controller-manager runs ~9 sub-controllers per
TenantControlPlane (bootstraptoken, coredns, konnectivity, kubeadm*,
kubelet, kubeproxy, etc.) inside a single Go process, AND serves the
admission webhooks for tenantcontrolplanes/datastores via the
kamaji-webhook-service ClusterIP. With the upstream default
replicaCount=1, every webhook RT shares CPU with the reconciler
goroutines. On a multi-tenant cluster (~20 TCPs), this pushes
apiserver-observed p99 for mtenantcontrolplane.kb.io and
vtenantcontrolplane.kb.io from the kamaji-side ~10ms to 85-95ms.
Three changes, each safe on small clusters and high-impact on large ones:
* replicaCount=2 — HA-safe baseline. --leader-elect is already enabled
upstream, so only one replica reconciles tenants; the follower's CPU
is fully available for webhook handlers, cutting p99 ~70% on busy
clusters. On a single-node cluster both pods schedule normally.
* telemetry.disabled=true — drops both the --disable-telemetry handler
and the telemetry.kamaji.clastix.io admission webhook entry from the
ValidatingWebhookConfiguration. The latter requires patches/1.diff to
wrap the telemetry block in a {{ if not .Values.telemetry.disabled }}
conditional inside the vendored controller-gen file. Without the
webhook entry, the apiserver no longer issues a TCP round-trip to a
no-op handler on every TenantControlPlane CREATE/UPDATE/DELETE.
* podAntiAffinity (preferred) — spread the two replicas across nodes
when possible. preferredDuringScheduling so single-node clusters
still schedule (both pods colocated, which is acceptable).
Measured on a 22-TCP production cluster (Hikube, before/after):
mtenantcontrolplane p99 85 → 25 ms (-71%), vtenantcontrolplane p99
92 → 25 ms (-73%), telemetry webhook impact 291 → 0. Total Kamaji
admission impact -83% (830 → 140).
The patches/1.diff approach (re-applied by `make update`) mirrors the
pattern already used in fluxcd-operator and ingress-nginx and survives
the next clastix/kamaji chart bump.
Signed-off-by: mattia-eleuteri <mattia@hidora.io>
587f6a7 to
8bc62cc
Compare
10f8e2d
into
cozystack:main
What this PR does
kamaji-controller-managerruns 9 sub-controllers per TenantControlPlane (bootstraptoken, coredns, konnectivity, kubeadm*, kubelet, kubeproxy, …) inside a single Go process, and serves the admission webhooks fortenantcontrolplanes+datastoresfrom the same pod viakamaji-webhook-service. With the upstream defaultreplicaCount=1, every webhook round-trip from the apiserver shares CPU with the reconciler goroutines. On a multi-tenant cluster this pushes the apiserver-observed p99 ofmtenantcontrolplane.kb.ioandvtenantcontrolplane.kb.iofrom the kamaji-side~10ms(percontroller_runtime_webhook_latency_seconds_count/sum) up to85–95ms— purely due to CPU contention.Three safe defaults are changed:
kamaji.replicaCount: 2— HA-safe baseline.--leader-electis already enabled upstream, so only one replica reconciles tenants; the follower's CPU is fully available for webhook handlers. On a single-node cluster both pods still schedule (just colocated).kamaji.telemetry.disabled: true— drops the--disable-telemetryhandler and removes thetelemetry.kamaji.clastix.ioentry from theValidatingWebhookConfiguration. The latter requirespatches/1.diffto wrap the telemetry block in{{- if not .Values.telemetry.disabled }} … {{- end }}inside the vendoredcontroller-gen/validating-webhook.yaml. Without the webhook entry, the apiserver no longer round-trips to a no-op handler on every TenantControlPlane CREATE/UPDATE/DELETE.kamaji.affinity(softpodAntiAffinity) — spread the two replicas across nodes when possible, while keeping single-node clusters working (preferredDuringScheduling, not required).The
patches/1.diffapproach (re-applied bymake update) mirrors the pattern already used inpackages/system/fluxcd-operator/Makefileandpackages/system/ingress-nginx/Makefile, and survives the nextclastix/kamajichart bump.Measured impact
Hikube prod, 22 TenantControlPlanes, before/after a manual rollout of the same configuration:
mtenantcontrolplane.kb.iovtenantcontrolplane.kb.iotelemetry.kamaji.clastix.ioCumulative Kamaji admission impact (
p99 × calls/s) dropped from 830 → 140 (-83%). The Kamaji webhooks moved from the top three of theapiserver_admission_webhook_admission_duration_seconds_*impact ranking down to positions #3-#4.Release note
Test plan
helm template packages/system/kamajirendersreplicas: 2,--disable-telemetry, the soft anti-affinity, and notelemetry.kamaji.clastix.ioentry in theValidatingWebhookConfiguration.patches/1.diffapplies cleanly against the freshly extracted upstream chart (patch --dry-run -p1 < patches/1.diff).make updatere-applies the patch automatically.replicaCount: 3via cozystack-values), zero pod restarts, leader election visible,kamaji-webhook-serviceEndpoints serving N IPs.Summary by CodeRabbit
New Features
Bug Fixes
Chores