xc-llm-ascend

Files

xuyexiong eff3e5fc6f [FEAT] Refactor spec decode to support efficient padded speculation (#3528 )

### What this PR does / why we need it?
1. Refactor the file `mtp_proposer.py`, splits torchair related codes
into `mtp_torchair_proposer.py`
2. According to https://github.com/vllm-project/vllm/pull/24539,
implements padded speculative decoding as described in
https://github.com/vllm-project/vllm/issues/21984.
### Does this PR introduce _any_ user-facing change?
User can use `disable_padded_drafter_batch` to disable/enable padded
speculation, default is `False`.
offline example:
```
speculative_config={"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False}
```

### How was this patch tested?

- [x] egaer with pad/unpad:
- [x] aclgraph with pad/unpad
- [x] torchair with pad/unpad

performance test of deepseek-r1 with tp16、dp1
aclgraph with pad ITL: 168ms
aclgraph with unpad ITL: 169ms
original: 178ms


- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>

2025-10-30 16:53:05 +08:00

test_v1_mtp_correctness.py

[FEAT] Refactor spec decode to support efficient padded speculation (#3528 )

2025-10-30 16:53:05 +08:00

test_v1_mtp_torchair_correctness.py

[Feat]Make full graph mode compalible with MTP (#3276 )

2025-10-17 20:19:56 +08:00

test_v1_spec_decode.py

ACLgraph enable: Test cases revisions for all features (#3388 )

2025-10-17 17:15:19 +08:00