Skip to content

feat(kamaji): default to 2 replicas, drop telemetry webhook#2671

Merged
Aleksei Sviridkin (lexfrei) merged 1 commit into
cozystack:mainfrom
mattia-eleuteri:feat/kamaji-ha-disable-telemetry
Jul 3, 2026
Merged

feat(kamaji): default to 2 replicas, drop telemetry webhook#2671
Aleksei Sviridkin (lexfrei) merged 1 commit into
cozystack:mainfrom
mattia-eleuteri:feat/kamaji-ha-disable-telemetry

Conversation

@mattia-eleuteri

@mattia-eleuteri mattia-eleuteri commented May 18, 2026

Copy link
Copy Markdown
Contributor

What this PR does

kamaji-controller-manager runs 9 sub-controllers per TenantControlPlane (bootstraptoken, coredns, konnectivity, kubeadm*, kubelet, kubeproxy, …) inside a single Go process, and serves the admission webhooks for tenantcontrolplanes + datastores from the same pod via kamaji-webhook-service. With the upstream default replicaCount=1, every webhook round-trip from the apiserver shares CPU with the reconciler goroutines. On a multi-tenant cluster this pushes the apiserver-observed p99 of mtenantcontrolplane.kb.io and vtenantcontrolplane.kb.io from the kamaji-side ~10ms (per controller_runtime_webhook_latency_seconds_count/sum) up to 85–95ms — purely due to CPU contention.

Three safe defaults are changed:

  • kamaji.replicaCount: 2 — HA-safe baseline. --leader-elect is already enabled upstream, so only one replica reconciles tenants; the follower's CPU is fully available for webhook handlers. On a single-node cluster both pods still schedule (just colocated).
  • kamaji.telemetry.disabled: true — drops the --disable-telemetry handler and removes the telemetry.kamaji.clastix.io entry from the ValidatingWebhookConfiguration. The latter requires patches/1.diff to wrap the telemetry block in {{- if not .Values.telemetry.disabled }} … {{- end }} inside the vendored controller-gen/validating-webhook.yaml. Without the webhook entry, the apiserver no longer round-trips to a no-op handler on every TenantControlPlane CREATE/UPDATE/DELETE.
  • kamaji.affinity (soft podAntiAffinity) — spread the two replicas across nodes when possible, while keeping single-node clusters working (preferredDuringScheduling, not required).

The patches/1.diff approach (re-applied by make update) mirrors the pattern already used in packages/system/fluxcd-operator/Makefile and packages/system/ingress-nginx/Makefile, and survives the next clastix/kamaji chart bump.

Measured impact

Hikube prod, 22 TenantControlPlanes, before/after a manual rollout of the same configuration:

webhook p99 before p99 after Δ
mtenantcontrolplane.kb.io 85 ms 25 ms -71%
vtenantcontrolplane.kb.io 92 ms 25 ms -73%
telemetry.kamaji.clastix.io 94 ms webhook removed -100%

Cumulative Kamaji admission impact (p99 × calls/s) dropped from 830 → 140 (-83%). The Kamaji webhooks moved from the top three of the apiserver_admission_webhook_admission_duration_seconds_* impact ranking down to positions #3-#4.

Release note

feat(kamaji): default to 2 replicas with soft pod anti-affinity and disable telemetry (handler + webhook entry). Reduces apiserver admission p99 for tenantcontrolplane mutating/validating webhooks by ~70% on multi-tenant clusters.

Test plan

  • helm template packages/system/kamaji renders replicas: 2, --disable-telemetry, the soft anti-affinity, and no telemetry.kamaji.clastix.io entry in the ValidatingWebhookConfiguration.
  • patches/1.diff applies cleanly against the freshly extracted upstream chart (patch --dry-run -p1 < patches/1.diff).
  • make update re-applies the patch automatically.
  • Live deploy on a 22-TCP cluster: 3 pods Running (when overriden to replicaCount: 3 via cozystack-values), zero pod restarts, leader election visible, kamaji-webhook-service Endpoints serving N IPs.

Summary by CodeRabbit

  • New Features

    • Telemetry can be disabled via configuration.
    • Controller configured for high availability with 2 replicas.
    • Pod anti-affinity added to spread replicas across nodes for better resilience.
  • Bug Fixes

    • Validating webhook is now omitted when telemetry is disabled to avoid unintended behavior.
  • Chores

    • Update process improved to re-apply chart patches after extraction for reproducible updates.

Review Change Stack

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 192b562f-791b-4894-b3cf-ea1b2eecb9f6

📥 Commits

Reviewing files that changed from the base of the PR and between 0eb45ba and 587f6a7.

📒 Files selected for processing (4)
  • packages/system/kamaji/Makefile
  • packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml
  • packages/system/kamaji/patches/1.diff
  • packages/system/kamaji/values.yaml
✅ Files skipped from review due to trivial changes (1)
  • packages/system/kamaji/Makefile
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/system/kamaji/values.yaml
  • packages/system/kamaji/patches/1.diff

📝 Walkthrough

Walkthrough

This PR configures the Kamaji controller for high-availability deployment with telemetry disabled, adds a soft pod anti-affinity, conditionally omits telemetry validating webhooks when telemetry is disabled, and re-applies Cozystack-side chart patches after extracting the upstream chart.

Changes

Kamaji HA and telemetry configuration

Layer / File(s) Summary
Values configuration for HA and telemetry
packages/system/kamaji/values.yaml
Sets replicaCount to 2, disables telemetry via telemetry.disabled: true, and adds soft pod anti-affinity (preferredDuringSchedulingIgnoredDuringExecution) to encourage replica spread across nodes.
Conditional telemetry webhook in Helm template
packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml, packages/system/kamaji/patches/1.diff
Wraps the telemetry ValidatingWebhookConfiguration stanza with {{- if not .Values.telemetry.disabled }} / {{- end }} so the webhook is omitted when telemetry is disabled.
Patch application in build process
packages/system/kamaji/Makefile
Updates the make update target to re-apply Cozystack-side chart patches (patches/1.diff) after extracting the upstream Kamaji chart tarball, documenting the extract-then-patch pattern.

🎯 2 (Simple) | ⏱️ ~10 minutes

🐰 Two replicas now dance with glee,
Telemetry quiet, as HA should be,
Patches re-apply when charts get new,
Affinity nudges pods across the crew,
I nibble logs and hop — all tests pass too!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: configuring 2 replicas for HA and removing the telemetry webhook, which directly addresses the PR's primary objective of reducing admission latency.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added area/platform Issues or PRs related to platform infrastructure (bundle, flux, talos, installer) kind/feature Categorizes issue or PR as related to a new feature labels May 18, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/system/kamaji/Makefile`:
- Line 16: The patch invocation in the Makefile uses "patch
--no-backup-if-mismatch -p1 < patches/1.diff" which only strips the leading "a/"
and leaves "packages/system/kamaji/..." paths, causing the patch to fail when
run from the packages/system/kamaji directory; update the patch command to use
"-p4" so it strips "a/packages/system/kamaji/" and targets the correct
"charts/..." paths (i.e., locate the patch invocation in the Makefile and change
the -p1 argument to -p4).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a5783c5f-3589-4fba-baba-c46af238e2c7

📥 Commits

Reviewing files that changed from the base of the PR and between d0c12c6 and 0eb45ba.

📒 Files selected for processing (4)
  • packages/system/kamaji/Makefile
  • packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml
  • packages/system/kamaji/patches/1.diff
  • packages/system/kamaji/values.yaml

# Re-apply Cozystack-side patches over the freshly extracted chart so
# `make update` stays reproducible. Mirrors the pattern used by
# packages/system/fluxcd-operator and packages/system/ingress-nginx.
patch --no-backup-if-mismatch -p1 < patches/1.diff

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

cd packages/system/kamaji

echo "== Dry-run with current flag (-p1) =="
if patch --dry-run --no-backup-if-mismatch -p1 < patches/1.diff; then
  echo "p1: applies cleanly"
else
  echo "p1: failed to apply"
fi

echo "== Dry-run with adjusted flag (-p4) =="
if patch --dry-run --no-backup-if-mismatch -p4 < patches/1.diff; then
  echo "p4: applies cleanly"
else
  echo "p4: failed to apply"
fi

Repository: cozystack/cozystack

Length of output: 527


🏁 Script executed:

head -50 packages/system/kamaji/patches/1.diff

Repository: cozystack/cozystack

Length of output: 692


🏁 Script executed:

ls -la packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml 2>&1 || echo "File not found"

Repository: cozystack/cozystack

Length of output: 188


🏁 Script executed:

cat packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml

Repository: cozystack/cozystack

Length of output: 1095


Use -p4 instead of -p1 for correct path stripping.

The patch file uses git-style paths (a/packages/system/kamaji/charts/...). When executed from packages/system/kamaji, -p1 strips only the a/ prefix, leaving a path that doesn't resolve; -p4 is required to strip through a/packages/system/kamaji/ and reach the correct charts/... target. Without this fix, the patch will fail to apply.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/system/kamaji/Makefile` at line 16, The patch invocation in the
Makefile uses "patch --no-backup-if-mismatch -p1 < patches/1.diff" which only
strips the leading "a/" and leaves "packages/system/kamaji/..." paths, causing
the patch to fail when run from the packages/system/kamaji directory; update the
patch command to use "-p4" so it strips "a/packages/system/kamaji/" and targets
the correct "charts/..." paths (i.e., locate the patch invocation in the
Makefile and change the -p1 argument to -p4).

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes the Kamaji controller deployment to mitigate performance bottlenecks caused by CPU contention during webhook processing. By increasing the replica count and disabling unnecessary telemetry handlers, the changes significantly reduce admission latency for TenantControlPlane operations. Additionally, the update improves pod scheduling resilience and streamlines the maintenance of custom patches within the repository.

Highlights

  • Performance Optimization: Increased the default replica count to 2 and disabled telemetry to reduce CPU contention, resulting in a ~70% improvement in apiserver admission p99 latency.
  • Configuration Changes: Introduced soft pod anti-affinity to improve distribution of Kamaji controller replicas across nodes.
  • Patch Management: Updated the Makefile to automatically re-apply custom patches after chart updates, ensuring reproducibility.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: **/*.diff (1)
    • packages/system/kamaji/patches/1.diff
  • Ignored by pattern: **/charts/** (1)
    • packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment Gemini (@gemini-code-assist) Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on Gemini (@gemini-code-assist) comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Kamaji package by configuring two replicas for high availability, disabling telemetry to reduce apiserver latency, and adding pod anti-affinity. It also updates the Makefile to ensure that Cozystack-side patches are re-applied after chart updates. The review identified an issue with the patch application command in the Makefile, suggesting the use of -p3 instead of -p1 to correctly handle the file paths in the diff.

# Re-apply Cozystack-side patches over the freshly extracted chart so
# `make update` stays reproducible. Mirrors the pattern used by
# packages/system/fluxcd-operator and packages/system/ingress-nginx.
patch --no-backup-if-mismatch -p1 < patches/1.diff

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The -p1 argument for patch appears to be incorrect. The patch file (patches/1.diff) contains file paths relative to the repository root (e.g., packages/system/kamaji/charts/...).

When this Makefile target runs, the working directory is packages/system/kamaji/. To correctly locate the files to be patched (e.g., charts/...), patch must strip 3 leading directory components from the paths in the diff file.

Please use -p3 instead of -p1.

	patch --no-backup-if-mismatch -p3 < patches/1.diff

@github-actions github-actions Bot added area/uncategorized PR auto-labeler could not map title scope to a known area/*; please review size/M This PR changes 30-99 lines, ignoring generated files labels May 18, 2026
@mattia-eleuteri

Copy link
Copy Markdown
Contributor Author

Thanks CodeRabbit (@coderabbitai) and Gemini (@gemini-code-assist) for the catch on the patch -p flag.

Fixed in 587f6a76 — but instead of bumping -p1 to -p4 (CodeRabbit) or -p3 (Gemini), I aligned the patch format with the existing packages/system/fluxcd-operator/patches/*.diff convention:

  • diff --git a/<full-repo-path> header (informational, matches what git diff would emit)
  • --- a/<short-path-from-chart-dir> / +++ b/<short-path-from-chart-dir> hunks

That lets the existing patch -p1 invocation in the Makefile work unchanged and keeps the convention consistent across the system packages. Verified with patch --dry-run --no-backup-if-mismatch -p1 < patches/1.diff against a freshly extracted upstream chart — applies cleanly, produces the exact same result as the in-tree modification.

Note: Gemini suggested -p3 which strips a/packages/system/ and leaves kamaji/charts/... — that would have failed (the cwd is packages/system/kamaji/, so the file is charts/kamaji/...). The correct strip count with the original long paths was -p4. The short-path version with -p1 is what's now committed.

@mattia-eleuteri mattia-eleuteri force-pushed the feat/kamaji-ha-disable-telemetry branch from 0eb45ba to 587f6a7 Compare May 18, 2026 14:42
@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@mattia-eleuteri Thanks for the thorough explanation! The approach makes perfect sense — keeping the short-path convention (--- a/<short-path-from-chart-dir>) consistent with packages/system/fluxcd-operator/patches/ is cleaner than bumping the strip count, and it means the patch -p1 call in the Makefile stays untouched across all system packages. The dry-run verification against a freshly extracted tarball is exactly the right way to confirm it.

(Marking the original comment as addressed.)

🐇 ✨


✏️ Learnings added
Learnt from: mattia-eleuteri
URL: https://github.com/cozystack/cozystack/pull/2671

Timestamp: 2026-05-18T14:42:48.676Z
Learning: In the cozystack repository, patch diff files under `packages/system/*/patches/*.diff` follow a two-part header convention:
1. `diff --git a/<full-repo-path>` — informational git-style header
2. `--- a/<short-path-from-chart-dir>` / `+++ b/<short-path-from-chart-dir>` — actual hunk paths relative to the package chart directory (e.g. `charts/kamaji/controller-gen/validating-webhook.yaml`)

This convention allows `patch --no-backup-if-mismatch -p1 < patches/1.diff` to work unchanged in the Makefile (cwd is `packages/system/<pkg>/`), and is consistent across all system packages (e.g. `packages/system/fluxcd-operator/patches/`). When reviewing patch files in this repo, verify that `--- a/` paths are relative to the chart directory (not the repo root), so `-p1` strips only the leading `a/` prefix.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@lexfrei Aleksei Sviridkin (lexfrei) removed the area/uncategorized PR auto-labeler could not map title scope to a known area/*; please review label May 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three values keys are native to the upstream chart (replicaCount, affinity, telemetry.disabled), so nothing is inert. I verified the wiring end-to-end: controller.yaml hardcodes --leader-elect, so scaling to 2 replicas is HA-safe (one reconciler, both pods serve the webhook behind the Service); --disable-telemetry is gated on .Values.telemetry.disabled and the patch removes the matching telemetry.kamaji.clastix.io entry from the ValidatingWebhookConfiguration, keeping flag and webhook config consistent — and since that webhook was failurePolicy: Ignore, dropping it is safe and removes a wasted apiserver round-trip. The soft podAntiAffinity selector (app.kubernetes.io/name: kamaji) matches the chart's actual selectorLabels, so the spread is effective and still schedules on single-node clusters. I dry-ran the patch from the package dir: it applies cleanly with the short-path convention that matches the fluxcd-operator package. LGTM.

Non-blocking items:

  • packages/system/kamaji/patches/1.diff — cosmetic: the diff --git header path disagrees with the --- a/charts/... / +++ b/charts/... short paths the patch actually uses. Harmless for patch -p1, but git apply would choke on the mismatch. Optional to align.
  • packages/system/kamaji/charts/kamaji/controller-gen/validating-webhook.yaml — the new {{- if not .Values.telemetry.disabled }} guard has no helm-unittest asserting the webhook is absent when telemetry.disabled=true. A small test would lock the behavior against future chart bumps.

The kamaji-controller-manager runs ~9 sub-controllers per
TenantControlPlane (bootstraptoken, coredns, konnectivity, kubeadm*,
kubelet, kubeproxy, etc.) inside a single Go process, AND serves the
admission webhooks for tenantcontrolplanes/datastores via the
kamaji-webhook-service ClusterIP. With the upstream default
replicaCount=1, every webhook RT shares CPU with the reconciler
goroutines. On a multi-tenant cluster (~20 TCPs), this pushes
apiserver-observed p99 for mtenantcontrolplane.kb.io and
vtenantcontrolplane.kb.io from the kamaji-side ~10ms to 85-95ms.

Three changes, each safe on small clusters and high-impact on large ones:

* replicaCount=2 — HA-safe baseline. --leader-elect is already enabled
  upstream, so only one replica reconciles tenants; the follower's CPU
  is fully available for webhook handlers, cutting p99 ~70% on busy
  clusters. On a single-node cluster both pods schedule normally.

* telemetry.disabled=true — drops both the --disable-telemetry handler
  and the telemetry.kamaji.clastix.io admission webhook entry from the
  ValidatingWebhookConfiguration. The latter requires patches/1.diff to
  wrap the telemetry block in a {{ if not .Values.telemetry.disabled }}
  conditional inside the vendored controller-gen file. Without the
  webhook entry, the apiserver no longer issues a TCP round-trip to a
  no-op handler on every TenantControlPlane CREATE/UPDATE/DELETE.

* podAntiAffinity (preferred) — spread the two replicas across nodes
  when possible. preferredDuringScheduling so single-node clusters
  still schedule (both pods colocated, which is acceptable).

Measured on a 22-TCP production cluster (Hikube, before/after):
mtenantcontrolplane p99 85 → 25 ms (-71%), vtenantcontrolplane p99
92 → 25 ms (-73%), telemetry webhook impact 291 → 0. Total Kamaji
admission impact -83% (830 → 140).

The patches/1.diff approach (re-applied by `make update`) mirrors the
pattern already used in fluxcd-operator and ingress-nginx and survives
the next clastix/kamaji chart bump.

Signed-off-by: mattia-eleuteri <mattia@hidora.io>
@lexfrei Aleksei Sviridkin (lexfrei) force-pushed the feat/kamaji-ha-disable-telemetry branch from 587f6a7 to 8bc62cc Compare July 3, 2026 16:14
@github-actions github-actions Bot added the area/uncategorized PR auto-labeler could not map title scope to a known area/*; please review label Jul 3, 2026
@lexfrei Aleksei Sviridkin (lexfrei) merged commit 10f8e2d into cozystack:main Jul 3, 2026
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/platform Issues or PRs related to platform infrastructure (bundle, flux, talos, installer) area/uncategorized PR auto-labeler could not map title scope to a known area/*; please review kind/feature Categorizes issue or PR as related to a new feature size/M This PR changes 30-99 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants