[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
SILONG ZENG
2026-01-15 09:06:01 +08:00
committed by GitHub
parent 96edd4673f
commit 4811ba62e0
75 changed files with 711 additions and 308 deletions

View File

@@ -5,6 +5,7 @@
This guide shows how to use Context Parallel, a long sequence inference optimization technique. Context Parallel includes `PCP` (Prefill Context Parallel) and `DCP` (Decode Context Parallel), which reduces NPU memory usage and improves inference speed in long sequence LLM inference.
## Benefits of Context Parallel
Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are:
- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
@@ -13,13 +14,16 @@ Context parallel mainly solves the problem of serving long context requests. As
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md).
## Supported Scenarios
Currently context parallel can be used together with most other features, supported features are as follows:
| | Eager | Graph | Prefix <br> Cache | Chunked <br> Prefill | SpecDecode <br> (MTP) | PD <br> disaggregation | MLAPO |
| ------- | ----- | ----- | ------ | ------ | ----- | ----- | ----- |
| **PCP** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅|
| **DCP** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
## How to use Context Parallel
You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_context_parallel_size`, refer to the following example:
- Offline example:
@@ -53,6 +57,7 @@ You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_co
The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`, so the examples above need 4 NPUs for each.
## Constraints
- While using DCP, the following constraints must be met:
- For MLA based model, such as Deepseek-R1:
- `tensor_parallel_size >= decode_context_parallel_size`
@@ -63,7 +68,7 @@ The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`
- While using Context Parallel in KV cache transfer needed scenario (e.g. KV pooling, PD-disaggregation), to simplify KV cache transmission, `cp_kv_cache_interleave_size` must be set to the same value of KV cache `block_size`(default: 128), which specify cp to split KV cache in a block-interleave style. For example:
```
```shell
vllm serve deepseek-ai/DeepSeek-V2-Lite \
--tensor-parallel-size 2 \
--decode-context-parallel-size 2 \
@@ -73,16 +78,19 @@ The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`
```
## Experimental Results
To evaluate the effectiveness of Context Parallel in in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD-disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows.
- DeepSeek-R1-W8A8:
| Configuration | Input length <br> 32k | Input length <br> 64k | Input length <br> 128k |
| ----------------------------- | ------------------------- | ------------------------- | ------------------------- |
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 9.3s <br> TPOT: 72ms | TTFT: 22.8s <br> TPOT: 74ms | TTFT: 73.2s <br> TPOT: 82ms |
| P node: (PCP2 TP8 DCP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 7.9s <br> TPOT: 74ms | TTFT: 15.9s <br> TPOT: 78ms | TTFT: 46.0s <br> TPOT: 83ms |
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 9.3s <br> TPOT: 72ms | TTFT: 22.8s <br> TPOT: 74ms | TTFT: 73.2s <br> TPOT: 82ms |
| P node: (PCP2 TP8 DCP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 7.9s <br> TPOT: 74ms | TTFT: 15.9s <br> TPOT: 78ms | TTFT: 46.0s <br> TPOT: 83ms |
- Qwen3-235B:
| Configuration | Input length <br> 32k | Input length <br> 64k | Input length <br> 120k |
| ----------------------------- | ------------------------- | ------------------------- | ------------------------- |
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 5.1s <br> TPOT: 65ms | TTFT: 13.1s <br> TPOT: 85ms | TTFT: 33.9s <br> TPOT: 120ms |
| P node: (PCP2 TP8 DCP2 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 3.0s <br> TPOT: 66ms | TTFT: 8.9s <br> TPOT: 86ms | TTFT: 22.7s <br> TPOT: 121ms |
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 5.1s <br> TPOT: 65ms | TTFT: 13.1s <br> TPOT: 85ms | TTFT: 33.9s <br> TPOT: 120ms |
| P node: (PCP2 TP8 DCP2 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 3.0s <br> TPOT: 66ms | TTFT: 8.9s <br> TPOT: 86ms | TTFT: 22.7s <br> TPOT: 121ms |