xc-llm-ascend

Author	SHA1	Message	Date
pu-zhe	e8f7b2e3f1	[Refactor] [310p] Support Mamba Cache and support attn_head_size larger than 128 (#7372 ) ### What this PR does / why we need it? 1. Mamba Cache Support on 310P: Implemented logic to correctly initialize and allocate KV cache for Mamba models on the 310P platform, including handling of state tensors and page size alignment. 2. Increased Attention Head Size Support: Modified the attention backend to support attn_head_size larger than 128 by dynamically selecting appropriate kernel block sizes based on hardware limitations (e.g., block_size * head_size <= 16384). 3. Refactored KV Cache Allocation: Consolidated and improved the KV cache allocation mechanism, moving from separate size calculation and allocation steps to a unified _allocate_kv_cache_tensors method that handles both Attention and Mamba specific cache structures. 4. Dynamic Mamba Config Patching: Introduced conditional loading of Mamba configuration patches, specifically using patch_mamba_config_310 for the 310P platform to ensure platform-specific optimizations and validations. 5. Reserve reasonable memory to allocate KV cache to avoid OOM issue with default gpu_memory_utilization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3.5 E2E test - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-19 09:16:22 +08:00
Shaoxu Cheng	ddc78dbade	[300I] support decode-only aclgraph mode (#6849 ) ### What this PR does / why we need it? 310p aclgraph mode, but has some problems: - the event-id hardware limit, the num of graph will be limited. - the cann version support this feature cannot be get from external of huawei. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-02 14:15:14 +08:00
pu-zhe	e76b69b9ef	[BugFix] [310p] Fix attention accuracy issue (#6803 ) ### What this PR does / why we need it? This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-26 14:30:39 +08:00
pu-zhe	a8e951e6f5	[Feat] 310p supports PrefillCacheHit State (#6756 ) ### What this PR does / why we need it? This PR extends the Ascend 310P attention backend to support the `PrefillCacheHit` state. Previously, only `PrefillNoCache`, `DecodeOnly`, and `ChunkedPrefill` were supported. This PR handles this state by routing it to the existing `forward_chunked_prefill_310` implementation, which is suitable for this scenario. The changes also include refactoring the main `forward_impl` dispatch method for better clarity and updating unit tests to cover the new state and ensure correctness. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Accuracy test when chunked prefill is disabled. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-24 16:48:05 +08:00
pu-zhe	4f33e25046	[Refactor]refactor 310p attention impl and add ut (#6579 ) ### What this PR does / why we need it? This pull request significantly refactors the attention mechanism for the Ascend 310P hardware, enhancing its architecture by separating mask generation concerns from the core attention implementation. It introduces a dedicated mask builder class capable of handling various mask types, including causal, splitfuse, and sliding window attention masks, all optimized for the NPU's fractal data format. This change not only cleans up the codebase but also lays the groundwork for more robust and feature-rich attention operations on Ascend devices, backed by new, extensive unit tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3 and qwen3-moe - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:26:26 +08:00
Shaoxu Cheng	9fadc8df4f	[Fixbugs]: fix refactor cause to 310p chunkprefill error (#6340 ) Adapt modelrunner refactor change to make 310p work - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-28 16:41:32 +08:00
Shaoxu Cheng	fbae41697e	[310P]: refactoring for 310p kvcache and some ops class (#6117 ) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-24 20:34:29 +08:00
Li Wang	484e7c59dc	[CI] optimize lint term (#5986 ) ### What this PR does / why we need it? This patch purpose to optimize the lint check term. The main idea is to reduce unnecessary installation time. 1. The installation of vllm is not must, only append the path of vllm src to the `PATHONPATH` is effective 2. This installation of `requirements-dev.txt` is not must, we have a pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the requirements installed in advance. NOTE: the conditions for triggering image builds are: 1).Daily scheduled build; 2) Build when requirements are modified; 3) Manual build. This ensures that the dependencies in our image are up-to-date to the greatest extent possible. 3. The `mypy` was separated from the `pre-commit` hook for performance reasons; we found that integrating `mypy` into the `pre-commit` hook resulted in poor performance. 4. Reduce the CPU core consumption from 16 -> 8 ### Does this PR introduce _any_ user-facing change? The end-to-end lint time was optimized from 20min/per PR to 8min/per PR ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 15:46:59 +08:00
Shaoxu Cheng	1ffca8673f	[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 ) ### What this PR does / why we need it? Add basic 310p support. Only dense models work with eager mode now. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-01-17 11:49:18 +08:00

9 Commits