xc-llm-ascend

Files

Yizhou 5b179c53f1 [FEAT] Support DeepSeek-V3.2 with FULL_DECODE_ONLY mode (#4706 )

### What this PR does / why we need it?
The first commit support `FULL_DECODE_ONLY`:
- Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for
slicing slots and positions, ensuring fixed tensor shapes.
- Implement padding logic for `query_start_loc` in `NPUModelRunner` to
support uniform decode in full graph mode, aligning with GPU runner
behavior.
- Adjust MLA cosine cache allocation to occur independently of graph
mode and switch to using device-resident sequence lengths for attention
metadata.
- Remove redundant slicing of hidden states and outputs in
`AscendSFAImpl` and optimize `sin`/`cos` cache updates.

The second commit take MTP into account:
- Update `AscendSFAMetadataBuilder` to use `num_input_tokens` for
slicing slots and positions, ensuring fixed tensor shapes.
- Implement padding logic for `query_start_loc` in `NPUModelRunner` to
support uniform decode in full graph mode, aligning with GPU runner
behavior.
- Adjust MLA cosine cache allocation to occur independently of graph
mode and switch to using device-resident sequence lengths for attention
metadata.
- Remove redundant slicing of hidden states and outputs in
`AscendSFAImpl` and optimize `sin`/`cos` cache updates.

And the rest of them are just bugfix.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases needed.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

2025-12-10 20:11:09 +08:00

test_attention_mask.py

[Refactor] 2/N Unify all mask generation methods and cache mask (#4779 )

2025-12-09 18:51:00 +08:00

test_attention_v1.py

support async mtp (#4511 )

2025-12-06 17:15:57 +08:00

test_mla_v1.py

support async mtp (#4511 )

2025-12-06 17:15:57 +08:00

test_sfa_v1.py

[FEAT] Support DeepSeek-V3.2 with FULL_DECODE_ONLY mode (#4706 )

2025-12-10 20:11:09 +08:00