[main][Docs] Fix typos across documentation (#6728)
## Summary
Fix typos and improve grammar consistency across 50 documentation files.
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
@@ -10,7 +10,7 @@ To enable MTP for DeepSeek-V3 models, add the following parameter when starting
|
||||
|
||||
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
|
||||
|
||||
- `num_speculative_tokens`: The number of speculative tokens which enable model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
|
||||
- `num_speculative_tokens`: The number of speculative tokens that enables the model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
|
||||
- `disable_padded_drafter_batch`: Disable input padding for speculative decoding. If set to True, speculative input batches can contain sequences of different lengths, which may only be supported by certain attention backends. This currently only affects the MTP method of speculation, default is False.
|
||||
|
||||
## How It Works
|
||||
@@ -28,7 +28,7 @@ vllm_ascend
|
||||
|
||||
**1. sample**
|
||||
|
||||
- *rejection_sample.py*: During decoding, the main model processes the previous round’s output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus We employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
|
||||
- *rejection_sample.py*: During decoding, the main model processes the previous round’s output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus we employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
|
||||
|
||||
```shell
|
||||
rejection_sample.py
|
||||
@@ -38,9 +38,9 @@ rejection_sample.py
|
||||
|
||||
**2. spec_decode**
|
||||
|
||||
This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token ids. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
|
||||
This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token IDs. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
|
||||
|
||||
- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by deepseek mtp layer.
|
||||
- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by DeepSeek MTP layer.
|
||||
|
||||
```shell
|
||||
mtp_proposer.py
|
||||
@@ -54,7 +54,7 @@ mtp_proposer.py
|
||||
|
||||
### Algorithm
|
||||
|
||||
**1. Reject_Sample**
|
||||
**1. Rejection Sampling**
|
||||
|
||||
- *Greedy Strategy*
|
||||
|
||||
@@ -68,17 +68,17 @@ For each draft token, acceptance is determined by verifying whether the inequali
|
||||
|
||||
The decision logic for each draft token is as follows: if the inequality `P_target / P_draft ≥ U` holds, the draft token is accepted as output; conversely, if `P_target / P_draft < U`, the draft token is rejected.
|
||||
|
||||
When a draft token is rejected, a recovery sampling process is triggered where a "recovered token" is resampled from the adjusted probability distribution defined as `Q = max(P_target - P_draft, 0)`. In the current MTP implementation, since `P_draft` is not provided and defaults to 1, the formulas simplify such that token acceptance occurs when `P_target ≥ U,` and the recovery distribution becomes `Q = max(P_target - 1, 0)`.
|
||||
When a draft token is rejected, a recovery sampling process is triggered where a "recovered token" is resampled from the adjusted probability distribution defined as `Q = max(P_target - P_draft, 0)`. In the current MTP implementation, since `P_draft` is not provided and defaults to 1, the formulas simplify such that token acceptance occurs when `P_target ≥ U` and the recovery distribution becomes `Q = max(P_target - 1, 0)`.
|
||||
|
||||
**2. Performance**
|
||||
|
||||
If the bonus token is accepted, the MTP model performs inference for (num_speculative +1) tokens, including original main model output token and bonus token. If rejected, inference is performed for less token, determining on how many tokens accepted.
|
||||
If the bonus token is accepted, the MTP model performs inference for (num_speculative + 1) tokens, including original main model output token and bonus token. If rejected, inference is performed for fewer tokens, depending on how many tokens are accepted.
|
||||
|
||||
## DFX
|
||||
|
||||
### Method Validation
|
||||
|
||||
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
|
||||
- Currently, the spec_decode scenario only supports methods such as n-gram, EAGLE, EAGLE3, and MTP. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
|
||||
|
||||
```python
|
||||
def get_spec_decode_method(method,
|
||||
@@ -112,4 +112,4 @@ if self.speculative_config:
|
||||
## Limitations
|
||||
|
||||
- Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15.
|
||||
- In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1).
|
||||
- In the fullgraph mode with MTP > 1, the capture size of each ACLGraph must be an integer multiple of (num_speculative_tokens + 1).
|
||||
|
||||
Reference in New Issue
Block a user