[FEAT] Refactor spec decode to support efficient padded speculation (#3528)
### What this PR does / why we need it?
1. Refactor the file `mtp_proposer.py`, splits torchair related codes
into `mtp_torchair_proposer.py`
2. According to https://github.com/vllm-project/vllm/pull/24539,
implements padded speculative decoding as described in
https://github.com/vllm-project/vllm/issues/21984.
### Does this PR introduce _any_ user-facing change?
User can use `disable_padded_drafter_batch` to disable/enable padded
speculation, default is `False`.
offline example:
```
speculative_config={"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False}
```
### How was this patch tested?
- [x] egaer with pad/unpad:
- [x] aclgraph with pad/unpad
- [x] torchair with pad/unpad
performance test of deepseek-r1 with tp16、dp1
aclgraph with pad ITL: 168ms
aclgraph with unpad ITL: 169ms
original: 178ms
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: xuyexiong <xuyexiong@huawei.com>
This commit is contained in:
@@ -34,6 +34,7 @@ from vllm.logger import logger
|
||||
import vllm_ascend.envs as envs_ascend
|
||||
from vllm_ascend.ascend_config import get_ascend_config
|
||||
from vllm_ascend.platform import NPUPlatform
|
||||
from vllm_ascend.spec_decode import get_spec_decode_method
|
||||
from vllm_ascend.torchair.utils import (
|
||||
TORCHAIR_CACHE_DIR, TorchairCommonAttentionMetadata,
|
||||
check_torchair_cache_exist, converting_weight_acl_format,
|
||||
@@ -83,6 +84,20 @@ class NPUTorchairModelRunner(NPUModelRunner):
|
||||
|
||||
self._check_batch_sizes_consistency()
|
||||
|
||||
def _set_up_drafter(self):
|
||||
super()._set_up_drafter()
|
||||
if self.speculative_config:
|
||||
# Torchair do not support disable_padded_drafter_batch
|
||||
# Enforce to disable this feature
|
||||
self.speculative_config.disable_padded_drafter_batch = True
|
||||
|
||||
def _get_drafter(self):
|
||||
return get_spec_decode_method(self.speculative_config.method,
|
||||
self.vllm_config,
|
||||
self.device,
|
||||
self,
|
||||
is_torchair_graph=True)
|
||||
|
||||
def _may_pad_kv_consumer_num_seq(self):
|
||||
# pd disaggregation scenario need redundant_batch_sizes to avoid each batch's seq_len exceed 16 tokens
|
||||
# self.max_num_reqs here is greater than the actual maximum request number
|
||||
|
||||
Reference in New Issue
Block a user