xc-llm-ascend

Files

Angazenn 1d3544c887 [BugFix]converting pa get_workspace back to capturing (#5833 )

### What this PR does / why we need it?

This helps to fix a bug in for pa get_workspace. In earlier
implementation, we use `_npu_paged_attention_get_workspace` in
`_update_pa_attn_params`. However, this might cause some potential
memory problems as it dynamically allocate new memory for workspace when
calling this api. Therefor, we move this back to capturing, and use a
fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: Angazenn <supperccell@163.com>

2026-01-22 15:49:22 +08:00

model runner v2 support triton of penalty (#5854 )

2026-01-20 12:26:05 +00:00

__init__.py

[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878 )

2025-07-19 09:42:32 +08:00

block_table.py

[feature] support pcp + mtp in full graph (#4572 )

2025-12-22 16:13:39 +08:00

model_runner_v1.py

[BugFix]converting pa get_workspace back to capturing (#5833 )

2026-01-22 15:49:22 +08:00