xc-llm-ascend

Author	SHA1	Message	Date
meihanc	bff4fbfca5	upgrade to 0.18.0 (#7502 ) ### What this PR does / why we need it? 1. upgrade to 0.18.0 2. ensure kernel_block_sizes is int for Eagle drafter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-03-21 16:05:38 +08:00
pu-zhe	e8f7b2e3f1	[Refactor] [310p] Support Mamba Cache and support attn_head_size larger than 128 (#7372 ) ### What this PR does / why we need it? 1. Mamba Cache Support on 310P: Implemented logic to correctly initialize and allocate KV cache for Mamba models on the 310P platform, including handling of state tensors and page size alignment. 2. Increased Attention Head Size Support: Modified the attention backend to support attn_head_size larger than 128 by dynamically selecting appropriate kernel block sizes based on hardware limitations (e.g., block_size * head_size <= 16384). 3. Refactored KV Cache Allocation: Consolidated and improved the KV cache allocation mechanism, moving from separate size calculation and allocation steps to a unified _allocate_kv_cache_tensors method that handles both Attention and Mamba specific cache structures. 4. Dynamic Mamba Config Patching: Introduced conditional loading of Mamba configuration patches, specifically using patch_mamba_config_310 for the 310P platform to ensure platform-specific optimizations and validations. 5. Reserve reasonable memory to allocate KV cache to avoid OOM issue with default gpu_memory_utilization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3.5 E2E test - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-19 09:16:22 +08:00
zhangyiming	1c954ff264	[main2main] upgrade vllm to 0308 (#7213 ) ### What this PR does / why we need it? Update main2main to vllm 0308. breaks: * https://github.com/vllm-project/vllm/pull/30681 * https://github.com/vllm-project/vllm/pull/35552 remove self.cudagraph_batch_sizes * https://github.com/vllm-project/vllm/pull/35158 clear_metadata -> defer_finalize * https://github.com/vllm-project/vllm/pull/36006 remove CacheConfig.cpu_offload_gb * https://github.com/vllm-project/vllm/pull/35472 * https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder * https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens * https://github.com/vllm-project/vllm/pull/28053 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-03-18 09:24:43 +08:00
Li Wang	88c56e3bf2	[Misc] Fix main lint to make CI happy (#7204 ) ### What this PR does / why we need it? Fix lint failed due to the merging of a previous PR. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-12 18:27:48 +08:00
Shaoxu Cheng	e5343d6eb3	[310P][Bugfix]: fix ngram graph replay accuracy error (#7134 ) ### What this PR does / why we need it? On the 310P device, when running ACLGraph together with the n-gram speculative decoding algorithm, both graph capture and graph replay require `uniform_decode_query_len` and do not depend on `attention_state`. This leads to a rather interesting and unexpected issue on 310P: during decode-only, execution does not enter the graph, while in the split-fuse state (that is, the chunked prefill state), it instead enters graph execution directly. The issue can be resolved by forcibly setting `uniform_decode_query_len` to `1`, so that 310P captures only the decode-only graph, and replay is then controlled through `attention_state`. ### Does this PR introduce _any_ user-facing change? NO - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-12 17:08:08 +08:00
linfeng-yuan	68d8d20ca2	[misc] move mxfp_compat into device to decouple from quantization init chain (#6918 ) ### What this PR does / why we need it? `mxfp_compat` only provides dtype/symbol compatibility helpers for different `torch_npu` versions, but it was placed under `vllm_ascend.quantization`. Importing it from device/ops paths could trigger `quantization/__init__.py` and pull in heavy quantization method dependencies, increasing startup coupling and causing import-cycle risk (especially on 310P paths). ### Does this PR introduce _any_ user-facing change? No functional behavior change intended. ### How was this patch tested? CI passed. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-02 18:17:01 +08:00
Icey	ee59429015	upgrade main to 0212 (#6712 ) ### What this PR does / why we need it? Fixes `transformers_utils/processors/__init__` import error, due to https://github.com/vllm-project/vllm/pull/33247 Fixes Fused MoE break introduced by `MoERunner abstraction,` due to https://github.com/vllm-project/vllm/pull/32344 > delete AscendMoERunnere when https://github.com/vllm-project/vllm/pull/35178 is merged Fixes `Make Qwen3VL compatible with Transformers v5`, due to https://github.com/vllm-project/vllm/pull/34262 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-02-25 09:17:29 +08:00
pu-zhe	a8e951e6f5	[Feat] 310p supports PrefillCacheHit State (#6756 ) ### What this PR does / why we need it? This PR extends the Ascend 310P attention backend to support the `PrefillCacheHit` state. Previously, only `PrefillNoCache`, `DecodeOnly`, and `ChunkedPrefill` were supported. This PR handles this state by routing it to the existing `forward_chunked_prefill_310` implementation, which is suitable for this scenario. The changes also include refactoring the main `forward_impl` dispatch method for better clarity and updating unit tests to cover the new state and ensure correctness. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Accuracy test when chunked prefill is disabled. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-24 16:48:05 +08:00
debuger	c1618a0427	[Bugfix]Fix the compatibility issue of may_reinitialize_input_batch (#6290 ) ### What this PR does / why we need it? Added a check in the may_reinitialize_input_batch method to verify whether the backend implements the get_supported_block_size method ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? Only a few lines of code within the methods were modified, and the format check test has been passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Debuuuuger <huangzr@cmbchina.com> Signed-off-by: debuger <102402761+huangazazaz@users.noreply.github.com> Signed-off-by: Debuuuuger <12110718@mail.sustech.edu.cn> Co-authored-by: Debuuuuger <huangzr@cmbchina.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-02-02 19:16:26 +08:00
pu-zhe	57fd6e4bd9	[Refact.]: refactoring 310p-kv cache allocator, align with main branch (#6270 ) ### What this PR does / why we need it? refactoring 310p-kv cache allocator, align with main branch vLLM version: v0.14.0 vLLM main: https://github.com/vllm-project/vllm-ascend/pull/6270 Qwen2.5-7B E2E Test --------- Signed-off-by: pu-zhe <puzhe1@h-partners.com> Signed-off-by: pu-zhe <zpuaa@outlook.com> Co-authored-by: pu-zhe <puzhe1@h-partners.com>	2026-01-27 16:26:48 +08:00
Shaoxu Cheng	fbae41697e	[310P]: refactoring for 310p kvcache and some ops class (#6117 ) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-24 20:34:29 +08:00

11 Commits