[Doc] Add the setting description of cudagraph_capture_sizes in speculative decoding user guide (#5637)

### What this PR does / why we need it?
Add the setting description of cudagraph_capture_sizes, guide users to
avoid the common mistakes frequently made when using the EAGLE overlay
fullgraph.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need for testing
- vLLM version: v0.13.0
- vLLM main:
8be6432bda

---------

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This commit is contained in:
zhaomingyu13
2026-01-23 23:22:44 +08:00
committed by GitHub
parent a2f022f9b6
commit 2dd68652bc

View File

@@ -84,6 +84,11 @@ A few important things to consider when using the EAGLE based draft models:
3. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".
That is, to specify `"method": "eagle3"` in `speculative_config`.
4. After enabling EAGLE, the main model needs to verify `(1 + K)` tokens generated by the main model and the draft model in one decoding process.
And the fullgraph mode will fix the number of tokens during the verification stage,
so `cudagraph_capture_sizes` must be a list of capture sizes, where each size is calculated as `n * (K + 1)` for each batch size `n` you want to support.
For instance, to support batch sizes from 1 to 4 with `num_speculative_tokens = 4`, `cudagraph_capture_sizes` should be set to `[5, 10, 15, 20]`.
## Speculating using MTP speculators
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)