Skip to content

Rate limiting observability (metrics and tracing) PR A#5701

Open
Sanskarzz wants to merge 1 commit into
stacklok:mainfrom
Sanskarzz:ratelimiting-olly
Open

Rate limiting observability (metrics and tracing) PR A#5701
Sanskarzz wants to merge 1 commit into
stacklok:mainfrom
Sanskarzz:ratelimiting-olly

Conversation

@Sanskarzz

@Sanskarzz Sanskarzz commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Emit rate-limit decisions, Redis errors, and Lua check latency from the shared
    limiter used by both MCPServer and VirtualMCPServer.
  • Keep scope and operation metadata private to the limiter instead of expanding
    Decision or RateLimitedError.
  • Count allowed decisions once per applicable bucket and rejected decisions
    only for the first bucket rejected by the atomic Redis check.
  • Add ManualReader unit coverage, vMCP Prometheus E2E validation, and metric
    reference documentation.

Part of #4553

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (focused package tests listed below)
  • E2E tests (task test-e2e)
  • Linting (focused package lint listed below)
  • Manual testing (describe below)

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Changes

File Change
pkg/ratelimit/limiter.go Associates private check metadata and emits metrics at the Redis decision point.
pkg/ratelimit/observability.go Defines instruments, recording behavior, and bounded Redis error classification.
pkg/ratelimit/observability_test.go Verifies metric names, labels, counts, error paths, latency, and no-op behavior.
test/e2e/thv-operator/virtualmcp/virtualmcp_rate_limiting_test.go Scrapes Prometheus after allowed and rejected vMCP traffic.
docs/observability.md Documents rate-limit metric contracts and counting semantics.
go.mod Marks the existing Prometheus parser modules as direct E2E test dependencies.

Does this introduce a user-facing change?

Yes. Operators with telemetry enabled can observe rate-limit decisions, Redis errors, and Redis Lua check latency. Rate-limit enforcement, responses, Redis keys, retry timing, and fail-open behavior are unchanged.

Implementation plan

Approved implementation plan
  1. Keep scope and operation metadata private on each applicable check.
  2. Build OpenTelemetry instruments once when the limiter is constructed.
  3. Record latency and Redis errors around the existing atomic Lua call.
  4. Record allowed decisions per applicable bucket and rejection for only the
    first rejected bucket.
  5. Validate the instruments with an OTel ManualReader and the vMCP Prometheus
    endpoint.

Follow-up PRs

This is the first PR in the rate-limit observability work. To keep each review
focused, the remaining changes will be opened sequentially after the preceding
PR merges:

  1. Request-span attributes

    Annotate the existing request span rather than creating a second span.
    Allowed requests will report rate_limit.decision=allowed,
    rate_limit.rejected_by=none, and rate_limit.fail_open=false. Rejected
    requests will add a private rejection identifier to limitCheck at its
    point of use. MCPServer middleware ordering will move telemetry outside rate
    limiting so the ambient request span is active.

  2. Fail-open observability

    Add toolhive_rate_limit_fail_open and set the fail-open request-span
    attributes after a Redis error reaches the shared enforcement adapter.
    Focused tests will verify one increment per affected request and continued
    delegation through both HTTP and vMCP paths.

Each follow-up will contain the tests and documentation as needed.

Special notes for reviewers

The private metadata uses the rejected index already returned by
bucket.ConsumeAll; it does not perform another Redis check or introduce
parallel HTTP/vMCP instrumentation. Decision, RateLimitedError, and the vMCP
decorator public behavior remain unchanged.

@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label Jul 1, 2026
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.87879% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.65%. Comparing base (01d4c9a) to head (9f421bb).

Files with missing lines Patch % Lines
pkg/ratelimit/observability.go 83.72% 7 Missing and 7 partials ⚠️
pkg/ratelimit/limiter.go 95.65% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5701      +/-   ##
==========================================
+ Coverage   70.60%   70.65%   +0.04%     
==========================================
  Files         681      682       +1     
  Lines       68804    68921     +117     
==========================================
+ Hits        48579    48696     +117     
+ Misses      16675    16669       -6     
- Partials     3550     3556       +6     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jerm-dro

jerm-dro commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

suggestion: Consider instrumenting Allow directly instead of surfacing RejectedBy through the public API.

This PR's value is entirely in service of the follow-up metrics/tracing PRs — but everything those PRs need is already right here, at the one point both enforcement paths funnel through. At the rejection branch the rejected bucket is checks[rejectedIdx], the Redis error is in scope, and the request context is available. So the decision/error signal can be emitted at the source rather than threaded out as a new field on Decision and RateLimitedError:

if rejectedIdx >= 0 {
    src := checks[rejectedIdx].rejectedBy
    l.telemetry.recordRejection(ctx, src) // meter + span left to the telemetry layer
    return &Decision{Allowed: false, RetryAfter: buckets[rejectedIdx].RetryAfter()}, nil
}

How the span is handled is open — annotate the ambient request span from ctx, or start a dedicated rate-limit span; the meter can be injected at NewLimiter (nil = no-op). The point is only that the emission has everything it needs here, so it needn't be plumbed to a caller.

Two reasons this seems preferable to the metadata approach:

  1. Single emission point. There are two enforcement call sites — the vMCP decorator (pkg/vmcp/ratelimit/decorator.go) and the MCPServer middleware (pkg/ratelimit/middleware.go). Emitting from Allow covers both; surfacing RejectedBy means the follow-up PRs emit the same signals in both callers and keep them in sync.
  2. No public API growth. RejectionSource, Decision.RejectedBy, and RateLimitedError.RejectedBy become a compatibility surface that only exists to carry a value to a caller that then emits it. Kept internal, it stays an implementation detail.

Since fail-open is out of scope for these signals, the one argument I can see for the current shape is keeping pkg/ratelimit free of a telemetry dependency and Allow a pure decision function. Is that a deliberate constraint — core libraries stay telemetry-free, all instrumentation at boundaries? If so, this shape is the consistent choice and the duplication is the accepted price; worth a line in the PR saying so. If not, instrumenting Allow directly looks like the leaner path and would let this PR fold into the metrics one.

@Sanskarzz

Copy link
Copy Markdown
Contributor Author

Thanks, agreed. Keeping pkg/ratelimit telemetry-free was not a deliberate constraint; I was optimizing for a small preparatory PR and ended up introducing public plumbing that has no independent domain value.

I’ll rework this PR to remove RejectionSource and both RejectedBy fields, keep the bucket source private in limitCheck, and fold the core metrics into this PR. Decision, Redis-error, and Lua-latency signals will be emitted directly from the concrete limiter where the rejected index, error, timing boundary, and context are already available.

For tracing, I’ll annotate the ambient request span in a focused follow-up rather than introduce a dedicated rate-limit span. MCPServer also needs its telemetry middleware moved outside rate limiting so that span is active. Fail-open will remain a separate edge-level change because the raw limiter cannot know whether a caller actually proceeds after an infrastructure error.

I’ll update the implementation plan and PR description accordingly.

@Sanskarzz Sanskarzz force-pushed the ratelimiting-olly branch from 733dc33 to 437d289 Compare July 3, 2026 06:01
@github-actions github-actions Bot added size/S Small PR: 100-299 lines changed size/L Large PR: 600-999 lines changed and removed size/S Small PR: 100-299 lines changed labels Jul 3, 2026
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
@Sanskarzz Sanskarzz force-pushed the ratelimiting-olly branch from 437d289 to 9f421bb Compare July 3, 2026 06:16
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Jul 3, 2026
@Sanskarzz

Copy link
Copy Markdown
Contributor Author

@jerm-dro I've updated this PR with the suggested changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants