xc-llm-ascend

Files

Yizhou 1bc61031e5 [v0.11.0][Fix] Cap max tokens to prevent potential OOM (#3720 ) (#3744 )

### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.

This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

2025-10-25 15:46:56 +08:00

__init__.py

[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878 )

2025-07-19 09:42:32 +08:00

block_table.py

[HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007 )

2025-09-18 21:43:22 +08:00

model_runner_v1.py

[v0.11.0][Fix] Cap max tokens to prevent potential OOM (#3720 ) (#3744 )

2025-10-25 15:46:56 +08:00

npu_input_batch.py

Drop 0.10.2 (#3284 )

2025-10-09 10:28:38 +08:00

worker_v1.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00