xc-llm-ascend

Files

Yizhou 1f25d60870 [Fix] Cap max tokens to prevent potential OOM (#3720 )

### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.

This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

2025-10-25 11:23:21 +08:00

__init__.py

[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878 )

2025-07-19 09:42:32 +08:00

block_table.py

[Bugfix] Fix zero attention output in qwen3-next (#3572 )

2025-10-25 09:47:03 +08:00

model_runner_v1.py

[Fix] Cap max tokens to prevent potential OOM (#3720 )

2025-10-25 11:23:21 +08:00

npu_input_batch.py

[1/N][Refactor] Refactor code to adapt with vllm main (#3612 )

2025-10-24 16:55:08 +08:00

worker_v1.py

[1/N][Refactor] Refactor code to adapt with vllm main (#3612 )

2025-10-24 16:55:08 +08:00