Skip to content

feat: report /metrics for the OpenXLA serve path (#449 M3 Stage 2d)#485

Merged
inureyes merged 1 commit into
mainfrom
feat/449-xla-serve-metrics
Jun 30, 2026
Merged

feat: report /metrics for the OpenXLA serve path (#449 M3 Stage 2d)#485
inureyes merged 1 commit into
mainfrom
feat/449-xla-serve-metrics

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

The XLA serve worker received the batch_metrics / batch_observability handles but populated neither, so /metrics reported all zeros for OpenXLA serving (active slots, queue depth, sequences, token throughput) even under load. This makes the path observable, mirroring how the MLX BatchScheduler populates the same metrics. Part of #449 M3 Stage 2d.

Stacked on #484 (sharded loading) for a clean incremental diff; this PR only touches the server worker plumbing.

What changed

XlaServeWorker now holds both handles and populates them:

  • BatchMetrics: active-count and queue-depth gauges each serve iteration (from the engine's active_len/pending_len), plus per-sequence completion with the generated-token count.
  • BatchObservability: record_prefill_start on admit (sequences started + prompt tokens), record_decode_step per pump with the step's token count (decode tokens + steps), record_sequence_completed on finish. The cache-pool / paged gauges stay zero (this path has neither; slots_available already conveys the live batch size).

Validation (E2E, CUDA on GB10)

With --metrics on Qwen2.5-0.5B, three concurrent /v1/completions (prompts of 5/4/5 tokens, 16 tokens each):

metric before after
mlxcel_batch_sequences_started 0 3
mlxcel_batch_sequences_completed 0 3
mlxcel_batch_prefill_tokens_total 0 14 (5+4+5)
mlxcel_batch_decode_tokens_total 0 48 (3×16)
mlxcel_batch_decode_steps_total 0 15 (continuous batching)
mlxcel_slots_available / mlxcel_queue_depth 4 / 0 4 / 0 (drained)

The MLX serving path is unchanged.

Refs #449.

@inureyes inureyes added area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions labels Jun 29, 2026
@inureyes inureyes force-pushed the feat/449-xla-serve-metrics branch from 564a460 to e39c3f0 Compare June 29, 2026 17:48
@inureyes inureyes changed the base branch from feat/449-xla-sharded-safetensors to main June 29, 2026 17:48
The XLA serve worker received the `batch_metrics` / `batch_observability`
handles but populated neither, so the `/metrics` endpoint reported all
zeros for OpenXLA serving (active slots, queue depth, sequences, token
throughput) even under load. The path was operationally blind.

Thread both handles into `XlaServeWorker` and populate them the same way
the MLX `BatchScheduler` does:

- `BatchMetrics`: the active-count and queue-depth gauges each serve
  iteration (from the engine's `active_len` / `pending_len`), and a
  per-sequence completion with its generated-token count.
- `BatchObservability`: `record_prefill_start` on admit (sequences
  started + prompt tokens), `record_decode_step` per pump with the step's
  token count (decode tokens + steps), and `record_sequence_completed` on
  finish. The cache-pool / paged gauges stay zero (this path has neither;
  `slots_available` already conveys the live batch size).

Validation (E2E, CUDA on GB10): with `--metrics` on Qwen2.5-0.5B, three
concurrent `/v1/completions` (prompts of 5/4/5 tokens, 16 tokens each)
move the metrics exactly as expected: `sequences_started` and
`sequences_completed` 0 -> 3, `prefill_tokens` 0 -> 14 (5+4+5),
`decode_tokens` 0 -> 48 (3x16), `decode_steps` 15 (continuous batching),
and the `slots_available` / `queue_depth` gauges return to 4 / 0 once
drained. The MLX serving path is unchanged.
@inureyes inureyes force-pushed the feat/449-xla-serve-metrics branch from e39c3f0 to f68373c Compare June 29, 2026 17:50
@inureyes inureyes merged commit b6b1a0c into main Jun 30, 2026
5 checks passed
@inureyes inureyes deleted the feat/449-xla-serve-metrics branch June 30, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant