Rate limiting observability (metrics and tracing) PR A by Sanskarzz · Pull Request #5701 · stacklok/toolhive

Sanskarzz · 2026-07-01T20:16:52Z

Summary

Emit rate-limit decisions, Redis errors, and Lua check latency from the shared
limiter used by both MCPServer and VirtualMCPServer.
Keep scope and operation metadata private to the limiter instead of expanding
Decision or RateLimitedError.
Count allowed decisions once per applicable bucket and rejected decisions
only for the first bucket rejected by the atomic Redis check.
Add ManualReader unit coverage, vMCP Prometheus E2E validation, and metric
reference documentation.

Part of #4553

Type of change

Test plan

Unit tests (focused package tests listed below)
E2E tests (task test-e2e)
Linting (focused package lint listed below)
Manual testing (describe below)

API Compatibility

This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Changes

File	Change
`pkg/ratelimit/limiter.go`	Associates private check metadata and emits metrics at the Redis decision point.
`pkg/ratelimit/observability.go`	Defines instruments, recording behavior, and bounded Redis error classification.
`pkg/ratelimit/observability_test.go`	Verifies metric names, labels, counts, error paths, latency, and no-op behavior.
`test/e2e/thv-operator/virtualmcp/virtualmcp_rate_limiting_test.go`	Scrapes Prometheus after allowed and rejected vMCP traffic.
`docs/observability.md`	Documents rate-limit metric contracts and counting semantics.
`go.mod`	Marks the existing Prometheus parser modules as direct E2E test dependencies.

Does this introduce a user-facing change?

Yes. Operators with telemetry enabled can observe rate-limit decisions, Redis errors, and Redis Lua check latency. Rate-limit enforcement, responses, Redis keys, retry timing, and fail-open behavior are unchanged.

Implementation plan

Approved implementation plan

Keep scope and operation metadata private on each applicable check.
Build OpenTelemetry instruments once when the limiter is constructed.
Record latency and Redis errors around the existing atomic Lua call.
Record allowed decisions per applicable bucket and rejection for only the
first rejected bucket.
Validate the instruments with an OTel ManualReader and the vMCP Prometheus
endpoint.

Follow-up PRs

This is the first PR in the rate-limit observability work. To keep each review
focused, the remaining changes will be opened sequentially after the preceding
PR merges:

Request-span attributes

Annotate the existing request span rather than creating a second span.
Allowed requests will report rate_limit.decision=allowed,
rate_limit.rejected_by=none, and rate_limit.fail_open=false. Rejected
requests will add a private rejection identifier to limitCheck at its
point of use. MCPServer middleware ordering will move telemetry outside rate
limiting so the ambient request span is active.
Fail-open observability

Add toolhive_rate_limit_fail_open and set the fail-open request-span
attributes after a Redis error reaches the shared enforcement adapter.
Focused tests will verify one increment per affected request and continued
delegation through both HTTP and vMCP paths.

Each follow-up will contain the tests and documentation as needed.

Special notes for reviewers

The private metadata uses the rejected index already returned by
bucket.ConsumeAll; it does not perform another Redis check or introduce
parallel HTTP/vMCP instrumentation. Decision, RateLimitedError, and the vMCP
decorator public behavior remain unchanged.

codecov · 2026-07-01T20:23:48Z

Codecov Report

❌ Patch coverage is 87.87879% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.65%. Comparing base (01d4c9a) to head (9f421bb).

Files with missing lines	Patch %	Lines
pkg/ratelimit/observability.go	83.72%	7 Missing and 7 partials ⚠️
pkg/ratelimit/limiter.go	95.65%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5701      +/-   ##
==========================================
+ Coverage   70.60%   70.65%   +0.04%     
==========================================
  Files         681      682       +1     
  Lines       68804    68921     +117     
==========================================
+ Hits        48579    48696     +117     
+ Misses      16675    16669       -6     
- Partials     3550     3556       +6

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jerm-dro · 2026-07-02T21:00:54Z

suggestion: Consider instrumenting Allow directly instead of surfacing RejectedBy through the public API.

This PR's value is entirely in service of the follow-up metrics/tracing PRs — but everything those PRs need is already right here, at the one point both enforcement paths funnel through. At the rejection branch the rejected bucket is checks[rejectedIdx], the Redis error is in scope, and the request context is available. So the decision/error signal can be emitted at the source rather than threaded out as a new field on Decision and RateLimitedError:

if rejectedIdx >= 0 {
    src := checks[rejectedIdx].rejectedBy
    l.telemetry.recordRejection(ctx, src) // meter + span left to the telemetry layer
    return &Decision{Allowed: false, RetryAfter: buckets[rejectedIdx].RetryAfter()}, nil
}

How the span is handled is open — annotate the ambient request span from ctx, or start a dedicated rate-limit span; the meter can be injected at NewLimiter (nil = no-op). The point is only that the emission has everything it needs here, so it needn't be plumbed to a caller.

Two reasons this seems preferable to the metadata approach:

Single emission point. There are two enforcement call sites — the vMCP decorator (pkg/vmcp/ratelimit/decorator.go) and the MCPServer middleware (pkg/ratelimit/middleware.go). Emitting from Allow covers both; surfacing RejectedBy means the follow-up PRs emit the same signals in both callers and keep them in sync.
No public API growth. RejectionSource, Decision.RejectedBy, and RateLimitedError.RejectedBy become a compatibility surface that only exists to carry a value to a caller that then emits it. Kept internal, it stays an implementation detail.

Since fail-open is out of scope for these signals, the one argument I can see for the current shape is keeping pkg/ratelimit free of a telemetry dependency and Allow a pure decision function. Is that a deliberate constraint — core libraries stay telemetry-free, all instrumentation at boundaries? If so, this shape is the consistent choice and the duplication is the accepted price; worth a line in the PR saying so. If not, instrumenting Allow directly looks like the leaner path and would let this PR fold into the metrics one.

Sanskarzz · 2026-07-02T22:55:55Z

Thanks, agreed. Keeping pkg/ratelimit telemetry-free was not a deliberate constraint; I was optimizing for a small preparatory PR and ended up introducing public plumbing that has no independent domain value.

I’ll rework this PR to remove RejectionSource and both RejectedBy fields, keep the bucket source private in limitCheck, and fold the core metrics into this PR. Decision, Redis-error, and Lua-latency signals will be emitted directly from the concrete limiter where the rejected index, error, timing boundary, and context are already available.

For tracing, I’ll annotate the ambient request span in a focused follow-up rather than introduce a dedicated rate-limit span. MCPServer also needs its telemetry middleware moved outside rate limiting so that span is active. Fail-open will remain a separate edge-level change because the raw limiter cannot know whether a caller actually proceeds after an infrastructure error.

I’ll update the implementation plan and PR description accordingly.

Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>

Sanskarzz · 2026-07-03T13:17:59Z

@jerm-dro I've updated this PR with the suggested changes.

Sanskarzz requested review from ChrisJBurns, JAORMX, amirejaz, jerm-dro, jhrozek and tgrunnagle as code owners July 1, 2026 20:16

github-actions Bot added the size/S Small PR: 100-299 lines changed label Jul 1, 2026

Sanskarzz force-pushed the ratelimiting-olly branch from 733dc33 to 437d289 Compare July 3, 2026 06:01

Sanskarzz requested review from blkt, rdimitrov and reyortiz3 as code owners July 3, 2026 06:01

github-actions Bot added size/S Small PR: 100-299 lines changed size/L Large PR: 600-999 lines changed and removed size/S Small PR: 100-299 lines changed labels Jul 3, 2026

Add rate limit observability metrics

9f421bb

Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>

Sanskarzz force-pushed the ratelimiting-olly branch from 437d289 to 9f421bb Compare July 3, 2026 06:16

github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rate limiting observability (metrics and tracing) PR A#5701

Rate limiting observability (metrics and tracing) PR A#5701
Sanskarzz wants to merge 1 commit into
stacklok:mainfrom
Sanskarzz:ratelimiting-olly

Sanskarzz commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

jerm-dro commented Jul 2, 2026

Uh oh!

Sanskarzz commented Jul 2, 2026

Uh oh!

Sanskarzz commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Sanskarzz commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Test plan

API Compatibility

Changes

Does this introduce a user-facing change?

Implementation plan

Follow-up PRs

Special notes for reviewers

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jerm-dro commented Jul 2, 2026

Uh oh!

Sanskarzz commented Jul 2, 2026

Uh oh!

Sanskarzz commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sanskarzz commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading