xc-llm-ascend

Author	SHA1	Message	Date
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
pu-zhe	e76b69b9ef	[BugFix] [310p] Fix attention accuracy issue (#6803 ) ### What this PR does / why we need it? This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-26 14:30:39 +08:00
pu-zhe	a8e951e6f5	[Feat] 310p supports PrefillCacheHit State (#6756 ) ### What this PR does / why we need it? This PR extends the Ascend 310P attention backend to support the `PrefillCacheHit` state. Previously, only `PrefillNoCache`, `DecodeOnly`, and `ChunkedPrefill` were supported. This PR handles this state by routing it to the existing `forward_chunked_prefill_310` implementation, which is suitable for this scenario. The changes also include refactoring the main `forward_impl` dispatch method for better clarity and updating unit tests to cover the new state and ensure correctness. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Accuracy test when chunked prefill is disabled. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-24 16:48:05 +08:00
pu-zhe	4f33e25046	[Refactor]refactor 310p attention impl and add ut (#6579 ) ### What this PR does / why we need it? This pull request significantly refactors the attention mechanism for the Ascend 310P hardware, enhancing its architecture by separating mask generation concerns from the core attention implementation. It introduces a dedicated mask builder class capable of handling various mask types, including causal, splitfuse, and sliding window attention masks, all optimized for the NPU's fractal data format. This change not only cleans up the codebase but also lays the groundwork for more robust and feature-rich attention operations on Ascend devices, backed by new, extensive unit tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3 and qwen3-moe - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:26:26 +08:00

4 Commits