xc-llm-ascend

Files

linfeng-yuan 05a561129e [Graph][Bugfix] Set default cudagraph max capture size via platform defaults (#7572 )

### What this PR does / why we need it?

This PR lets NPU platform provide its own default
`max_cudagraph_capture_size` via
`NPUPlatform.apply_config_platform_defaults()`.

Previously, when cudagraph sizing was left unset, Ascend inherited
vLLM's upstream default heuristic in `_set_cudagraph_sizes()`, which
uses `max_num_seqs * decode_query_len * 2`. This PR changes Ascend's
default to `min(max_num_seqs * decode_query_len, 512)` while keeping the
rest of vLLM's cudagraph sizing logic unchanged.

### Does this PR introduce _any_ user-facing change?

Yes, but only for Ascend when users do not explicitly configure
cudagraph sizing.

If `max_cudagraph_capture_size` and `cudagraph_capture_sizes` are both
unset, we now uses `max_num_seqs * decode_query_len` (capped at `512`)
instead of the upstream `* 2` default. Explicit user settings are
unchanged.

### How was this patch tested?

Add unit tests to cover:

- default max injection via `apply_config_platform_defaults()`
- explicit `max_cudagraph_capture_size` is preserved
- explicit `cudagraph_capture_sizes` are preserved
- Ascend default max no longer uses the upstream `* 2`
- late `_set_cudagraph_sizes()` recomputation reuses the current max
input

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>

2026-03-25 17:57:19 +08:00

e2e

[310P]fused recurrent gated delta rule pytorch core and ut (#7398 )

2026-03-25 08:53:14 +08:00

[Graph][Bugfix] Set default cudagraph max capture size via platform defaults (#7572 )

2026-03-25 17:57:19 +08:00

__init__.py

[SpecDecode] Add spec decode support (#500 )

2025-04-17 20:16:32 +08:00