xc-llm-ascend

Author SHA1 Message Date

Author	SHA1	Message	Date
pu-zhe	e76b69b9ef	[BugFix] [310p] Fix attention accuracy issue (#6803 ) ### What this PR does / why we need it? This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-26 14:30:39 +08:00
pu-zhe	4f33e25046	[Refactor]refactor 310p attention impl and add ut (#6579 ) ### What this PR does / why we need it? This pull request significantly refactors the attention mechanism for the Ascend 310P hardware, enhancing its architecture by separating mask generation concerns from the core attention implementation. It introduces a dedicated mask builder class capable of handling various mask types, including causal, splitfuse, and sliding window attention masks, all optimized for the NPU's fractal data format. This change not only cleans up the codebase but also lays the groundwork for more robust and feature-rich attention operations on Ascend devices, backed by new, extensive unit tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3 and qwen3-moe - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:26:26 +08:00
Shaoxu Cheng	1ffca8673f	[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 ) ### What this PR does / why we need it? Add basic 310p support. Only dense models work with eager mode now. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-01-17 11:49:18 +08:00

pu-zhe

e76b69b9ef

[BugFix] [310p] Fix attention accuracy issue (#6803 )

### What this PR does / why we need it?
This pull request resolves an attention accuracy issue by enhancing the
AttentionMaskBuilder310 to correctly handle the maximum model length.
The change ensures that the attention mask generation process is
properly parameterized by the model's configuration, rather than relying
on a fixed internal value. This leads to more accurate attention mask
creation, which is crucial for the correct functioning of the attention
mechanism.
Update fused_moe to main branch.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Qwen3 dense mode & moe model e2e test
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>

2026-02-26 14:30:39 +08:00

pu-zhe

4f33e25046

[Refactor]refactor 310p attention impl and add ut (#6579 )

### What this PR does / why we need it?
This pull request significantly refactors the attention mechanism for
the Ascend 310P hardware, enhancing its architecture by separating mask
generation concerns from the core attention implementation. It
introduces a dedicated mask builder class capable of handling various
mask types, including causal, splitfuse, and sliding window attention
masks, all optimized for the NPU's fractal data format. This change not
only cleans up the codebase but also lays the groundwork for more robust
and feature-rich attention operations on Ascend devices, backed by new,
extensive unit tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E test with qwen3 and qwen3-moe
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>

2026-02-07 09:26:26 +08:00

Shaoxu Cheng

1ffca8673f

[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 )

### What this PR does / why we need it?
Add basic 310p support. Only dense models work with eager mode now.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>

2026-01-17 11:49:18 +08:00

3 Commits