### What this PR does / why we need it?
Suffix Decoding is a CPU-based speculative decoding optimization that
accelerates inference by pattern matching and frequency-based prediction
from both prompts and generated content.
This document provides a step-by-step guide for deploying and evaluating
**Suffix Speculative Decoding** on the **Ascend** platform. By analyzing
performance gains across diverse datasets, it demonstrates the
significant advantages of this technology in inference acceleration. Our
goal is to empower developers to achieve high-efficiency model
optimization using Ascend hardware.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: zhangmuzhibangde <1037640609@qq.com>
48 lines
853 B
Markdown
48 lines
853 B
Markdown
# Tutorials
|
|
|
|
:::{toctree}
|
|
:caption: Models
|
|
:maxdepth: 1
|
|
Qwen2.5-Omni.md
|
|
Qwen2.5-7B.md
|
|
Qwen3-Dense.md
|
|
Qwen-VL-Dense.md
|
|
Qwen3-30B-A3B.md
|
|
Qwen3-235B-A22B.md
|
|
Qwen3-VL-30B-A3B-Instruct.md
|
|
Qwen3-VL-235B-A22B-Instruct.md
|
|
Qwen3-Coder-30B-A3B.md
|
|
Qwen3_embedding.md
|
|
Qwen3-VL-Embedding.md
|
|
Qwen3_reranker.md
|
|
Qwen3-VL-Reranker.md
|
|
Qwen3-8B-W4A8.md
|
|
Qwen3-32B-W4A4.md
|
|
Qwen3-Next.md
|
|
Qwen3-Omni-30B-A3B-Thinking.md
|
|
DeepSeek-V3.1.md
|
|
DeepSeek-V3.2.md
|
|
DeepSeek-R1.md
|
|
GLM4.x.md
|
|
Kimi-K2-Thinking.md
|
|
PaddleOCR-VL.md
|
|
:::
|
|
|
|
:::{toctree}
|
|
:caption: Features
|
|
:maxdepth: 1
|
|
pd_colocated_mooncake_multi_instance.md
|
|
pd_disaggregation_mooncake_single_node.md
|
|
pd_disaggregation_mooncake_multi_node.md
|
|
long_sequence_context_parallel_single_node.md
|
|
long_sequence_context_parallel_multi_node.md
|
|
suffix_speculative_decoding.md
|
|
ray
|
|
:::
|
|
|
|
:::{toctree}
|
|
:caption: Hardware
|
|
:maxdepth: 1
|
|
310p.md
|
|
:::
|