fix(vllm): stop forcing --enforce-eager on all LLM endpoints#851
fix(vllm): stop forcing --enforce-eager on all LLM endpoints#851olliestanley wants to merge 1 commit into
Conversation
diazagasatya
left a comment
There was a problem hiding this comment.
Reviewed against the code — solid fix, nice catch on the global regression. A couple of things verified + one thing to confirm before I approve.
Verified
- The removed block is the only place
enforce_eageris force-set — the only other references are the DTO default incommon/dtos/llms/vllm.pyand explicit model configs (e.g.inference/vllm/examples/v2/llama-3.2-vision/config.jsonsetsenforce_eager: true). So removal cleanly hands control back to explicitadditional_args/ vLLM's default. 👍 - The "always true" analysis checks out:
infer_addition_engine_args_from_model_nameseedsgpu_memory_utilization = 0.9(0.95 for ≥70B), sogpu_memory_utilization is not Nonewas effectively always true → eager forced on every endpoint. Accurate. - Test update is correct (not-set →
--enforce-eagerabsent → vLLM hybrid/CUDA-graph default).
One thing to confirm before approval (the OOM risk)
#524 originally forced eager specifically for llama-3-70b at 0.95 mem-util, because at that utilization there's little headroom for CUDA-graph capture → OOM. ≥70B models still default to 0.95 (infer_addition_engine_args_from_model_name), so after this change they re-enable CUDA graphs at 0.95 and could OOM on startup unless they explicitly set enforce_eager: true. You've confirmed the vision/qwen entries set it — can you confirm the ≥70B models (esp. llama-3-70b) also set enforce_eager: true explicitly (or lower their mem-util)? Otherwise this re-introduces exactly the OOM #524 guarded against for the large models. Since the change only takes effect on bundle recreation, rollout is staggered — worth watching ≥70B endpoints as they recycle.
Minor nit
Consider adding a test that an explicit enforce_eager: true still emits --enforce-eager — current test only covers the auto-add-removed (not-present) path, and that explicit path is now the supported way for models that need eager.
Net: approve once the ≥70B enforce_eager question is confirmed — the perf win (global eager was disabling CUDA graphs / hurting decode everywhere) is well worth it.
5ca087c to
e4a588c
Compare
The vLLM bundle command set enforce_eager=True whenever gpu_memory_utilization was set. Since infer_addition_engine_args_from_model_name always returns a default gpu_memory_utilization (0.9, or 0.95 for >=70B models), this forced eager mode on every llmengine vLLM endpoint and silently overrode any explicit enforce_eager=False from a user. Eager mode disables CUDA graphs, which massively slows down decode. The override originated as a targeted workaround in #524 (hardcoded for llama-3-70b, paired with --gpu-memory-utilization 0.95 to avoid OOM during CUDA graph capture at high memory utilization) and was unintentionally generalized to all models in the #637 client-data-types refactor. Restrict the auto-injection to the regime it was actually meant for: only default eager mode on when gpu_memory_utilization >= 0.95, where there is little headroom for CUDA graph capture. Below that (the default 0.9 for <70B models), CUDA graphs stay enabled for faster decode. An explicit enforce_eager from the caller (True or False) is always respected.
e4a588c to
38e8621
Compare
|
@diazagasatya I'm not sure if the OOM issue would even be relevant with modern vLLM versions and more recent large models. ~Nobody should really be deploying models without graph capture in 2026. In interests of backwards compatibility, I'll re-enable this guard by default in cases where util target |
Summary
Removes the code that unconditionally adds
--enforce-eagerto the vLLM startup command for every endpoint deployed via thellmengineprovider pathway.In
_create_vllm_bundle_command(llm_model_endpoint_use_cases.py) we had:Because
infer_addition_engine_args_from_model_name()always seeds a defaultgpu_memory_utilization(0.9, or 0.95 for ≥70B models), this condition was effectively always true. The net effect:--enforce-eager.enforce_eager: falsehad it silently clobbered back totrue— there was no way to turn eager mode off.--enforce-eagerdisables CUDA graphs and forces eager-mode PyTorch, which massively slows down decode. We were paying that cost globally.Historical context
llama-3-70b, as a coupled pair:--gpu-memory-utilization 0.95 --enforce-eager. The intent was a targeted OOM workaround — at very high memory utilization there is little headroom for vLLM's CUDA-graph capture, so eager mode was forced alongside the high mem-util for that one large model.gpu_memory_utilizationis set, forceenforce_eager." Combined with the always-on default mem-util, a 70B-specific safety hack quietly became a global default that disabled CUDA graphs on all endpoints.Change
Remove the auto-injection. Endpoints now use vLLM's default hybrid eager + CUDA graph mode (better decode throughput). Models that genuinely need eager mode (e.g. several vision / qwen entries in the internal model zoo) already set
enforce_eager: trueexplicitly viaadditional_argsand are unaffected.Risk
Re-enabling CUDA graphs requires a little extra VRAM for graph capture. At the default
gpu_memory_utilizationof 0.9–0.95 this is the scenario the original workaround guarded against, so watch for startup OOM on memory-tight models after rollout. Mitigations if needed: lower the default mem-util, or have callers setenforce_eager: truefor specific models.Testing
test_update_vllm_force_bundle_recreation_preserves_legacy_vllm_argsto reflect thatenforce_eageris no longer auto-added.pytest tests/unit/domain/test_llm_use_cases.py— 74 passed.Greptile Summary
This PR removes a historical accident where
--enforce-eagerwas silently injected on every vLLM endpoint because any non-Nonegpu_memory_utilization(which is always populated by the default inference helper) triggered the flag. The fix narrows the auto-injection to only the>= 0.95utilization band — where CUDA-graph capture is genuinely VRAM-tight — and skips it entirely when the caller has already supplied an explicitenforce_eagervalue._create_vllm_bundle_commandnow guards onenforce_eager is None and gpu_memory_utilization >= 0.95, preserving the original 70B OOM workaround while re-enabling CUDA graphs for all lower-utilization endpoints.gpu_memory_utilization=0.75no longer produces--enforce-eager, with the explicitenforce_eager: Truefixture value removed accordingly.Confidence Score: 5/5
Safe to merge; the change is a targeted reduction of scope that restores vLLM default hybrid CUDA-graph mode for endpoints below 0.95 GPU utilization while keeping the existing OOM guard for large models.
The logic change is small and well-reasoned: one compound condition replaces an over-broad one, and the new condition preserves the original safety guard exactly at the threshold it was designed for. Models that explicitly opt into eager mode via additional_args are unaffected.
No files require special attention; both changed files are straightforward and internally consistent.
Important Files Changed
gpu_memory_utilization is not None → enforce_eager = Truewith a targeted check that only auto-enables eager mode when the caller didn't set an explicit value and utilization is ≥ 0.95, restoring CUDA-graph mode for all endpoints below that threshold.enforce_eager: Truefrom the fixture input and flips the assertion fromintonot infor--enforce-eager, correctly reflecting that a 0.75 utilization no longer triggers eager mode.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[_create_vllm_bundle_command] --> B{is_worker?} B -- Yes --> Z[Skip vllm_args tuning] B -- No --> C[Set tensor_parallel_size] C --> D{enforce_eager is None?} D -- No\ncaller set it explicitly --> E[Respect caller's value] D -- Yes --> F{gpu_memory_utilization\nis not None?} F -- No --> G[Leave enforce_eager = None\nCUDA graphs ON] F -- Yes --> H{gpu_memory_utilization\n>= 0.95?} H -- No\n< 0.95 e.g. default 0.9 --> G H -- Yes\ne.g. 70B default 0.95 --> I[enforce_eager = True\nCUDA graphs OFF] G --> J[Build vllm_cmd] I --> J E --> J%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[_create_vllm_bundle_command] --> B{is_worker?} B -- Yes --> Z[Skip vllm_args tuning] B -- No --> C[Set tensor_parallel_size] C --> D{enforce_eager is None?} D -- No\ncaller set it explicitly --> E[Respect caller's value] D -- Yes --> F{gpu_memory_utilization\nis not None?} F -- No --> G[Leave enforce_eager = None\nCUDA graphs ON] F -- Yes --> H{gpu_memory_utilization\n>= 0.95?} H -- No\n< 0.95 e.g. default 0.9 --> G H -- Yes\ne.g. 70B default 0.95 --> I[enforce_eager = True\nCUDA graphs OFF] G --> J[Build vllm_cmd] I --> J E --> JReviews (2): Last reviewed commit: "fix(vllm): only force --enforce-eager at..." | Re-trigger Greptile