[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
SILONG ZENG
2026-01-15 09:06:01 +08:00
committed by GitHub
parent 96edd4673f
commit 4811ba62e0
75 changed files with 711 additions and 308 deletions

View File

@@ -1,9 +1,11 @@
# Multi Token Prediction (MTP)
## Why We Need MTP
MTP boosts inference performance by parallelizing the prediction of multiple tokens, shifting from single-token to multi-token generation. This approach significantly increases generation throughput and achieves multiplicative acceleration in inference speed—all without compromising output quality.
## How to Use MTP
To enable MTP for DeepSeek-V3 models, add the following parameter when starting the service:
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
@@ -15,7 +17,7 @@ To enable MTP for DeepSeek-V3 models, add the following parameter when starting
### Module Architecture
```
```shell
vllm_ascend
├── sample
│ ├── rejection_sample.py
@@ -28,7 +30,7 @@ vllm_ascend
- *rejection_sample.py*: During decoding, the main model processes the previous rounds output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus We employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
```
```shell
rejection_sample.py
├── AscendRejectionSampler
│ ├── forward
@@ -37,9 +39,10 @@ rejection_sample.py
**2. spec_decode**
This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token ids. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by deepseek mtp layer.
```
```shell
mtp_proposer.py
├── Proposer
│ ├── load_model
@@ -52,6 +55,7 @@ mtp_proposer.py
### Algorithm
**1. Reject_Sample**
- *Greedy Strategy*
Verify whether the token generated by the main model matches the speculative token predicted by MTP in the previous round. If they match exactly, accept the bonus token; otherwise, reject it and any subsequent tokens derived from that speculation.
@@ -76,7 +80,7 @@ If the bonus token is accepted, the MTP model performs inference for (num_specul
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
```
```python
def get_spec_decode_method(method,
vllm_config,
device,
@@ -93,9 +97,10 @@ def get_spec_decode_method(method,
```
### Integer Validation
- The current npu_fused_infer_attention_score operator only supports integers less than 16 per decode round. Therefore, the maximum supported value for MTP is 15. If a value greater than 15 is provided, the code will raise an error and alert the user.
```
```python
if self.speculative_config:
spec_token_num = self.speculative_config.num_speculative_tokens
self.decode_threshold += spec_token_num
@@ -105,5 +110,6 @@ if self.speculative_config:
```
## Limitation
- Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15.
- In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1).