### What this PR does / why we need it?
Suffix Decoding is a CPU-based speculative decoding optimization that
accelerates inference by pattern matching and frequency-based prediction
from both prompts and generated content.
This document provides a step-by-step guide for deploying and evaluating
**Suffix Speculative Decoding** on the **Ascend** platform. By analyzing
performance gains across diverse datasets, it demonstrates the
significant advantages of this technology in inference acceleration. Our
goal is to empower developers to achieve high-efficiency model
optimization using Ascend hardware.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
853 B
Tutorials
:::{toctree} :caption: Models :maxdepth: 1 Qwen2.5-Omni.md Qwen2.5-7B.md Qwen3-Dense.md Qwen-VL-Dense.md Qwen3-30B-A3B.md Qwen3-235B-A22B.md Qwen3-VL-30B-A3B-Instruct.md Qwen3-VL-235B-A22B-Instruct.md Qwen3-Coder-30B-A3B.md Qwen3_embedding.md Qwen3-VL-Embedding.md Qwen3_reranker.md Qwen3-VL-Reranker.md Qwen3-8B-W4A8.md Qwen3-32B-W4A4.md Qwen3-Next.md Qwen3-Omni-30B-A3B-Thinking.md DeepSeek-V3.1.md DeepSeek-V3.2.md DeepSeek-R1.md GLM4.x.md Kimi-K2-Thinking.md PaddleOCR-VL.md :::
:::{toctree} :caption: Features :maxdepth: 1 pd_colocated_mooncake_multi_instance.md pd_disaggregation_mooncake_single_node.md pd_disaggregation_mooncake_multi_node.md long_sequence_context_parallel_single_node.md long_sequence_context_parallel_multi_node.md suffix_speculative_decoding.md ray :::
:::{toctree} :caption: Hardware :maxdepth: 1 310p.md :::