This guide shows how to use Speculative Decoding with vLLM Ascend. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.
In v0.12.0rc1 of vLLM Ascend, the async scheduler is more stable and ready to be enabled. We have adapted it to support EAGLE, and you can use it by setting `async_scheduling=True` as follows. If you encounter any issues, please feel free to open an issue on GitHub. As a workaround, you can disable this feature by unsetting `async_scheduling=True` when initializing the model.
4. After enabling EAGLE, the main model needs to verify `(1 + K)` tokens generated by the main model and the draft model in one decoding process.
And the fullgraph mode will fix the number of tokens during the verification stage,
so `cudagraph_capture_sizes` must be a list of capture sizes, where each size is calculated as `n * (K + 1)` for each batch size `n` you want to support.
For instance, to support batch sizes from 1 to 4 with `num_speculative_tokens = 4`, `cudagraph_capture_sizes` should be set to `[5, 10, 15, 20]`.
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [Multi_Token_Prediction](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)
The following code configures vLLM to use speculative decoding where proposals are generated using Suffix Decoding [(SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications)](https://arxiv.org/abs/2411.04975).
Like n-gram, Suffix Decoding can generate draft tokens by pattern-matching using the last `n` generated tokens. Unlike n-gram, Suffix Decoding (1) can pattern-match against both the prompt and previous generations, (2) uses frequency counts to propose the most likely continuations, and (3) speculates an adaptive number of tokens for each request at each iteration to get better acceptance rates.
Suffix Decoding can achieve better performance for tasks with high repetition, such as code-editing, agentic loops (e.g. self-reflection, self-consistency), and RL rollouts.
> [!NOTE]
> Suffix Decoding requires Arctic Inference. You can install it with `pip install arctic-inference`.