xc-llm-ascend/docs/source/user_guide/feature_guide/speculative_decoding.md

# Speculative Decoding Guide

This guide shows how to use Speculative Decoding with vLLM Ascend. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.

## Speculating by matching n-grams in the prompt

The following code configures vLLM Ascend to use speculative decoding where proposals are generated by matching n-grams in the prompt.

- Offline inference

    ```python
    from vllm import LLM, SamplingParams

    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        tensor_parallel_size=1,
        speculative_config={
            "method": "ngram",
            "num_speculative_tokens": 5,
            "prompt_lookup_max": 4,
        },
    )
    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    ```

## Speculating using EAGLE based draft models

The following code configures vLLM Ascend to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.

In v0.12.0rc1 of vLLM Ascend, the async scheduler is more stable and ready to be enabled. We have adapted it to support EAGLE, and you can use it by setting `async_scheduling=True` as follows. If you encounter any issues, please feel free to open an issue on GitHub. As a workaround, you can disable this feature by unsetting `async_scheduling=True` when initializing the model.

- Offline inference

    ```python
    from vllm import LLM, SamplingParams

    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        tensor_parallel_size=4,
        distributed_executor_backend="mp",
        enforce_eager=True,
        async_scheduling=True,
        speculative_config={
            "method": "eagle",
            "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
            "draft_tensor_parallel_size": 1,
            "num_speculative_tokens": 2,
        },
    )

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    ```

A few important things to consider when using the EAGLE based draft models:

1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
   be loaded and used directly by vLLM. This functionality was added in PR [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893).
   If you are using a vLLM version released before this pull request was merged, please update to a more recent version.

2. The EAGLE based draft models need to be run without tensor parallelism
   (i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
   it is possible to run the main model using tensor parallelism (see example above).

3. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".
   That is, to specify `"method": "eagle3"` in `speculative_config`.

4. After enabling EAGLE, the main model needs to verify `(1 + K)` tokens generated by the main model and the draft model in one decoding process.
   And the fullgraph mode will fix the number of tokens during the verification stage,
   so `cudagraph_capture_sizes` must be a list of capture sizes, where each size is calculated as `n * (K + 1)` for each batch size `n` you want to support.
   For instance, to support batch sizes from 1 to 4 with `num_speculative_tokens = 4`, `cudagraph_capture_sizes` should be set to `[5, 10, 15, 20]`.

## Speculating using MTP speculators

The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)

- Online inference

    ```shell
    vllm serve /deepseek-ai/DeepSeek-V3.2-Exp-W8A8 \
    --port 20004 \
    --data-parallel-size 1 \
    --tensor-parallel-size 16 \
    --enable-expert-parallel \
    --seed 1024 \
    --served-model-name dsv3 \
    --max-model-len 36768 \
    --max-num-batched-tokens 5000 \
    --max-num-seqs 10 \
    --quantization ascend \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp", "disable_padded_drafter_batch": "False"}'
    ```

## Speculating using Suffix Decoding

The following code configures vLLM to use speculative decoding where proposals are generated using Suffix Decoding [(SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications)](https://arxiv.org/abs/2411.04975).

Like n-gram, Suffix Decoding can generate draft tokens by pattern-matching using the last `n` generated tokens. Unlike n-gram, Suffix Decoding (1) can pattern-match against both the prompt and previous generations, (2) uses frequency counts to propose the most likely continuations, and (3) speculates an adaptive number of tokens for each request at each iteration to get better acceptance rates.

Suffix Decoding can achieve better performance for tasks with high repetition, such as code-editing, agentic loops (e.g. self-reflection, self-consistency), and RL rollouts.

> [!NOTE]
> Suffix Decoding requires Arctic Inference. You can install it with `pip install arctic-inference`.

- Offline inference
  
  ```python
    from vllm import LLM, SamplingParams

    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        tensor_parallel_size=1,
        enforce_eager=True,
        speculative_config={
            "method": "suffix",
            "num_speculative_tokens": 15,
        },
    )

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    ```
[Doc] Add user guide of speculative decoding (#5074) ### What this PR does / why we need it? Add user guide of speculative decoding that includes n-grams, EAGLE, MTP, and suffix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> 2025-12-16 17:01:44 +08:00			`# Speculative Decoding Guide`

			`This guide shows how to use Speculative Decoding with vLLM Ascend. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.`

			`## Speculating by matching n-grams in the prompt`
[Lint]Style: reformat markdown files via markdownlint (#5884) ### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain> 2026-01-15 09:06:01 +08:00
[Doc] Add user guide of speculative decoding (#5074) ### What this PR does / why we need it? Add user guide of speculative decoding that includes n-grams, EAGLE, MTP, and suffix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> 2025-12-16 17:01:44 +08:00			`The following code configures vLLM Ascend to use speculative decoding where proposals are generated by matching n-grams in the prompt.`

			`- Offline inference`

			```python
			`from vllm import LLM, SamplingParams`

			`prompts = [`
			`"The future of AI is",`
			`]`
			`sampling_params = SamplingParams(temperature=0.8, top_p=0.95)`

			`llm = LLM(`
			`model="meta-llama/Meta-Llama-3.1-8B-Instruct",`
			`tensor_parallel_size=1,`
			`speculative_config={`
			`"method": "ngram",`
			`"num_speculative_tokens": 5,`
			`"prompt_lookup_max": 4,`
			`},`
			`)`
			`outputs = llm.generate(prompts, sampling_params)`

			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")`
			```

			`## Speculating using EAGLE based draft models`

			`The following code configures vLLM Ascend to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.`

			In v0.12.0rc1 of vLLM Ascend, the async scheduler is more stable and ready to be enabled. We have adapted it to support EAGLE, and you can use it by setting `async_scheduling=True` as follows. If you encounter any issues, please feel free to open an issue on GitHub. As a workaround, you can disable this feature by unsetting `async_scheduling=True` when initializing the model.

			`- Offline inference`

			```python
			`from vllm import LLM, SamplingParams`

			`prompts = [`
			`"The future of AI is",`
			`]`
			`sampling_params = SamplingParams(temperature=0.8, top_p=0.95)`

			`llm = LLM(`
			`model="meta-llama/Meta-Llama-3.1-8B-Instruct",`
			`tensor_parallel_size=4,`
			`distributed_executor_backend="mp",`
			`enforce_eager=True,`
			`async_scheduling=True,`
			`speculative_config={`
			`"method": "eagle",`
			`"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",`
			`"draft_tensor_parallel_size": 1,`
			`"num_speculative_tokens": 2,`
			`},`
			`)`

			`outputs = llm.generate(prompts, sampling_params)`

			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")`
			```

			`A few important things to consider when using the EAGLE based draft models:`

			`1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should`
			`be loaded and used directly by vLLM. This functionality was added in PR [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893).`
			`If you are using a vLLM version released before this pull request was merged, please update to a more recent version.`

			`2. The EAGLE based draft models need to be run without tensor parallelism`
			(i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
			`it is possible to run the main model using tensor parallelism (see example above).`

			`3. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".`
			That is, to specify `"method": "eagle3"` in `speculative_config`.

[Doc] Add the setting description of cudagraph_capture_sizes in speculative decoding user guide (#5637) ### What this PR does / why we need it? Add the setting description of cudagraph_capture_sizes, guide users to avoid the common mistakes frequently made when using the EAGLE overlay fullgraph. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need for testing - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/8be6432bdaf6275664d857b1e5e9bf8ed1ce299e --------- Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> 2026-01-23 23:22:44 +08:00			4. After enabling EAGLE, the main model needs to verify `(1 + K)` tokens generated by the main model and the draft model in one decoding process.
			`And the fullgraph mode will fix the number of tokens during the verification stage,`
			so `cudagraph_capture_sizes` must be a list of capture sizes, where each size is calculated as `n * (K + 1)` for each batch size `n` you want to support.
			For instance, to support batch sizes from 1 to 4 with `num_speculative_tokens = 4`, `cudagraph_capture_sizes` should be set to `[5, 10, 15, 20]`.

[Doc] Add user guide of speculative decoding (#5074) ### What this PR does / why we need it? Add user guide of speculative decoding that includes n-grams, EAGLE, MTP, and suffix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> 2025-12-16 17:01:44 +08:00			`## Speculating using MTP speculators`

			`The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)`

			`- Online inference`

			```shell
			`vllm serve /deepseek-ai/DeepSeek-V3.2-Exp-W8A8 \`
			`--port 20004 \`
			`--data-parallel-size 1 \`
			`--tensor-parallel-size 16 \`
			`--enable-expert-parallel \`
			`--seed 1024 \`
			`--served-model-name dsv3 \`
			`--max-model-len 36768 \`
			`--max-num-batched-tokens 5000 \`
			`--max-num-seqs 10 \`
			`--quantization ascend \`
			`--trust-remote-code \`
			`--gpu-memory-utilization 0.9 \`
			`--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \`
			`--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp", "disable_padded_drafter_batch": "False"}'`
			```

			`## Speculating using Suffix Decoding`

			`The following code configures vLLM to use speculative decoding where proposals are generated using Suffix Decoding [(SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications)](https://arxiv.org/abs/2411.04975).`

			Like n-gram, Suffix Decoding can generate draft tokens by pattern-matching using the last `n` generated tokens. Unlike n-gram, Suffix Decoding (1) can pattern-match against both the prompt and previous generations, (2) uses frequency counts to propose the most likely continuations, and (3) speculates an adaptive number of tokens for each request at each iteration to get better acceptance rates.

			`Suffix Decoding can achieve better performance for tasks with high repetition, such as code-editing, agentic loops (e.g. self-reflection, self-consistency), and RL rollouts.`

			`> [!NOTE]`
			> Suffix Decoding requires Arctic Inference. You can install it with `pip install arctic-inference`.

			`- Offline inference`

			```python
			`from vllm import LLM, SamplingParams`

			`prompts = [`
			`"The future of AI is",`
			`]`
			`sampling_params = SamplingParams(temperature=0.8, top_p=0.95)`

			`llm = LLM(`
			`model="meta-llama/Meta-Llama-3.1-8B-Instruct",`
			`tensor_parallel_size=1,`
			`enforce_eager=True,`
			`speculative_config={`
			`"method": "suffix",`
			`"num_speculative_tokens": 15,`
			`},`
			`)`

			`outputs = llm.generate(prompts, sampling_params)`

			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")`
[main][Docs] Fix typos across documentation (#6728) ## Summary Fix typos and improve grammar consistency across 50 documentation files. ### Changes include: - Spelling corrections (e.g., "Facotory" → "Factory", "certainty" → "determinism") - Grammar improvements (e.g., "multi-thread" → "multi-threaded", "re-routed" → "re-run") - Punctuation fixes (semicolon consistency in filter parameters) - Code style fixes (correct flag name `--num-prompts` instead of `--num-prompt`) - Capitalization consistency (e.g., "python" → "Python", "ascend" → "Ascend") - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> 2026-02-13 15:50:05 +08:00			```