[Fix] Adds CUDA graph stats to execution state (#6331)

### What this PR does / why we need it? Adds a CUDA graph profiling stats field to the execution state and updates the NPU model runner to set, unpack, and forward those stats during execution. This preserves CUDA graph metrics across state transitions, improving observability for later use and diagnostics. ### Does this PR introduce _any_ user-facing change? Enable this by set ```python llm = LLM( ... disable_log_stats=False, cudagraph_metrics=True, ... ) ``` or `--cudagraph-metrics` and make sure do not disable log stats. After this, you should be able to see something like this, which is really helpful for some light debugging: ``` [loggers.py:257] Engine 000: Avg prompt throughput: 32.3 tokens/s, Avg generation throughput: 114.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% [cuda_graph.py:117] **CUDAGraph Config Settings:** [cuda_graph.py:117] [cuda_graph.py:117] - Mode: FULL_DECODE_ONLY [cuda_graph.py:117] - Capture sizes: [1, 2, 4, 8, 16, 24, 32] [cuda_graph.py:117] [cuda_graph.py:117] **CUDAGraph Stats:** [cuda_graph.py:117] [cuda_graph.py:117] | Unpadded Tokens | Padded Tokens | Num Paddings | Runtime Mode | Count | [cuda_graph.py:117] |-----------------|---------------|--------------|--------------|-------| [cuda_graph.py:117] | 4 | 4 | 0 | FULL | 18 | [cuda_graph.py:117] | 5 | 5 | 0 | NONE | 1 | [cuda_graph.py:117] | 1 | 1 | 0 | FULL | 1 | [cuda_graph.py:117] | 18 | 18 | 0 | NONE | 1 | ``` ### How was this patch tested? None. - vLLM version: v0.14.1 - vLLM main: dc917cceb8 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-01-28 16:34:20 +08:00
parent 379ce599d0
commit ac963f1519
1 changed files with 4 additions and 0 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -192,6 +192,7 @@ class ExecuteModelState(NamedTuple):
    attn_metadata: "PerLayerAttnMetadata"
    positions: torch.Tensor
    ec_connector_output: "ECConnectorOutput | None"
+    cudagraph_stats: CUDAGraphStat | None


 class NPUModelRunner(GPUModelRunner):
@@ -1353,6 +1354,7 @@ class NPUModelRunner(GPUModelRunner):
                attn_metadata,
                positions,
                ec_connector_output,
+                cudagraph_stats,
            )
            self.kv_connector_output = kv_connector_output
        return None
@@ -1389,6 +1391,7 @@ class NPUModelRunner(GPUModelRunner):
            attn_metadata,
            positions,
            ec_connector_output,
+            cudagraph_stats,
        ) = self.execute_model_state
        # Clear ephemeral state.
        self.execute_model_state = None
@@ -1466,6 +1469,7 @@ class NPUModelRunner(GPUModelRunner):
            ec_connector_output=ec_connector_output
            if self.supports_mm_inputs
            else None,
+            cudagraph_stats=cudagraph_stats,
        )

        durations = ProfileExecuteDuration().pop_captured_sync()