xc-llm-ascend

Files

ApsarasX 643e6f5486 [Bugfix] Fix accuracy problem caused by mask pollution (#1678 )

### What this PR does / why we need it?
If a small batch of short requests is sent first, forming a chunk with a
length <128, it will corrupt the `attn_mask_cache`, causing subsequent
requests that do not form a chunk to have accuracy issues.

The root cause of this problem is the use of in-place multiplication.
Modifying it to use out-of-place multiplication will resolve the
accuracy problem.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Yes.

- vLLM version: v0.9.2
- vLLM main:
ad6c2e1a0b

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>

2025-07-10 14:06:49 +08:00

__init__.py

[Core] Make V1 work and enable V1 engine test (#389 )

2025-03-28 19:34:23 +08:00

attention_mask.py

[Bugfix] Fix accuracy problem caused by mask pollution (#1678 )

2025-07-10 14:06:49 +08:00

attention_v1_torchair.py

[BugFix] Fix max_num_tokens_across_dp calculation bugs in attention_v1_torchair (#1636 )

2025-07-07 20:03:02 +08:00

attention_v1.py

Fix W8A8 fused moe bug (#1529 )