[FEAT] Refactor spec decode to support efficient padded speculation (#3528)

### What this PR does / why we need it?
1. Refactor the file `mtp_proposer.py`, splits torchair related codes
into `mtp_torchair_proposer.py`
2. According to https://github.com/vllm-project/vllm/pull/24539,
implements padded speculative decoding as described in
https://github.com/vllm-project/vllm/issues/21984.
### Does this PR introduce _any_ user-facing change?
User can use `disable_padded_drafter_batch` to disable/enable padded
speculation, default is `False`.
offline example:
```
speculative_config={"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False}
```

### How was this patch tested?

- [x] egaer with pad/unpad:
- [x] aclgraph with pad/unpad
- [x] torchair with pad/unpad

performance test of deepseek-r1 with tp16、dp1
aclgraph with pad ITL: 168ms
aclgraph with unpad ITL: 169ms
original: 178ms


- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
This commit is contained in:
xuyexiong
2025-10-30 16:53:05 +08:00
committed by GitHub
parent 10772d94e3
commit eff3e5fc6f
7 changed files with 1203 additions and 440 deletions

View File

@@ -34,6 +34,7 @@ from vllm.logger import logger
import vllm_ascend.envs as envs_ascend
from vllm_ascend.ascend_config import get_ascend_config
from vllm_ascend.platform import NPUPlatform
from vllm_ascend.spec_decode import get_spec_decode_method
from vllm_ascend.torchair.utils import (
TORCHAIR_CACHE_DIR, TorchairCommonAttentionMetadata,
check_torchair_cache_exist, converting_weight_acl_format,
@@ -83,6 +84,20 @@ class NPUTorchairModelRunner(NPUModelRunner):
self._check_batch_sizes_consistency()
def _set_up_drafter(self):
super()._set_up_drafter()
if self.speculative_config:
# Torchair do not support disable_padded_drafter_batch
# Enforce to disable this feature
self.speculative_config.disable_padded_drafter_batch = True
def _get_drafter(self):
return get_spec_decode_method(self.speculative_config.method,
self.vllm_config,
self.device,
self,
is_torchair_graph=True)
def _may_pad_kv_consumer_num_seq(self):
# pd disaggregation scenario need redundant_batch_sizes to avoid each batch's seq_len exceed 16 tokens
# self.max_num_reqs here is greater than the actual maximum request number