xc-llm-ascend

Files

Mengqing Cao 900086fdc6 [HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type (#3760 )

### What this PR does / why we need it?
Part of https://github.com/vllm-project/vllm-ascend/pull/3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4

<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: MengqingCao <cmq0113@163.com>

2025-10-29 14:18:52 +08:00

test_data_parallel.py

ACLgraph enable: Test cases revisions for all features (#3388 )

2025-10-17 17:15:19 +08:00

test_expert_parallel.py

ACLgraph enable: Test cases revisions for all features (#3388 )

2025-10-17 17:15:19 +08:00

test_external_launcher.py

[Test] enable external launcher and add e2e test for sleep mode in level2 (#3344 )

2025-10-11 17:29:38 +08:00

test_full_graph_mode.py

fix pagedattention to support fullgraph. (#3436 )