feat: report /metrics for the OpenXLA serve path (#449 M3 Stage 2d)#485
Merged
Conversation
564a460 to
e39c3f0
Compare
The XLA serve worker received the `batch_metrics` / `batch_observability` handles but populated neither, so the `/metrics` endpoint reported all zeros for OpenXLA serving (active slots, queue depth, sequences, token throughput) even under load. The path was operationally blind. Thread both handles into `XlaServeWorker` and populate them the same way the MLX `BatchScheduler` does: - `BatchMetrics`: the active-count and queue-depth gauges each serve iteration (from the engine's `active_len` / `pending_len`), and a per-sequence completion with its generated-token count. - `BatchObservability`: `record_prefill_start` on admit (sequences started + prompt tokens), `record_decode_step` per pump with the step's token count (decode tokens + steps), and `record_sequence_completed` on finish. The cache-pool / paged gauges stay zero (this path has neither; `slots_available` already conveys the live batch size). Validation (E2E, CUDA on GB10): with `--metrics` on Qwen2.5-0.5B, three concurrent `/v1/completions` (prompts of 5/4/5 tokens, 16 tokens each) move the metrics exactly as expected: `sequences_started` and `sequences_completed` 0 -> 3, `prefill_tokens` 0 -> 14 (5+4+5), `decode_tokens` 0 -> 48 (3x16), `decode_steps` 15 (continuous batching), and the `slots_available` / `queue_depth` gauges return to 4 / 0 once drained. The MLX serving path is unchanged.
e39c3f0 to
f68373c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The XLA serve worker received the
batch_metrics/batch_observabilityhandles but populated neither, so/metricsreported all zeros for OpenXLA serving (active slots, queue depth, sequences, token throughput) even under load. This makes the path observable, mirroring how the MLXBatchSchedulerpopulates the same metrics. Part of #449 M3 Stage 2d.What changed
XlaServeWorkernow holds both handles and populates them:BatchMetrics: active-count and queue-depth gauges each serve iteration (from the engine'sactive_len/pending_len), plus per-sequence completion with the generated-token count.BatchObservability:record_prefill_starton admit (sequences started + prompt tokens),record_decode_stepper pump with the step's token count (decode tokens + steps),record_sequence_completedon finish. The cache-pool / paged gauges stay zero (this path has neither;slots_availablealready conveys the live batch size).Validation (E2E, CUDA on GB10)
With
--metricson Qwen2.5-0.5B, three concurrent/v1/completions(prompts of 5/4/5 tokens, 16 tokens each):mlxcel_batch_sequences_startedmlxcel_batch_sequences_completedmlxcel_batch_prefill_tokens_totalmlxcel_batch_decode_tokens_totalmlxcel_batch_decode_steps_totalmlxcel_slots_available/mlxcel_queue_depthThe MLX serving path is unchanged.
Refs #449.