diff --git a/docs/source/developer_guide/feature_guide/context_parallel.md b/docs/source/developer_guide/feature_guide/context_parallel.md index fda8aa2d..c36c9699 100644 --- a/docs/source/developer_guide/feature_guide/context_parallel.md +++ b/docs/source/developer_guide/feature_guide/context_parallel.md @@ -20,8 +20,7 @@ Its main objective is to eliminate duplicated storage of the KV cache by shardin DCP primarily influences the Decode logic, as well as the logic for chunked prefill and cached prefill. ## How to Use CP? - -Please refer to the CP section of the user guide for detailed information: [How to Use CP] +Please refer to the [context parallel user guide](../../user_guide/feature_guide/context_parallel.md) for detailed information. ## How It Works? diff --git a/docs/source/user_guide/feature_guide/context_parallel.md b/docs/source/user_guide/feature_guide/context_parallel.md new file mode 100644 index 00000000..c7df7df8 --- /dev/null +++ b/docs/source/user_guide/feature_guide/context_parallel.md @@ -0,0 +1,88 @@ +# Context Parallel Guide + +## Overview + +This guide shows how to use Context Parallel, a long sequence inference optimization technique. Context Parallel includes `PCP` (Prefill Context Parallel) and `DCP` (Decode Context Parallel), which reduces NPU memory usage and improves inference speed in long sequence LLM inference. + +## Benefits of Context Parallel +Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are: + +- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens. +- For long context decode, we can use context parallel to reduce KV cache duplication and offer more space for KV cache to increase the batchsize (and hence the throughput). + +To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md). + +## Supported Scenarios +Currently context parallel can be used together with most other features, supported features are as follows: +| | Eager | Graph | Prefix
Cache | Chunked
Prefill | SpecDecode
(MTP) | PD
disaggregation | MLAPO | +| ------- | ----- | ----- | ------ | ------ | ----- | ----- | ----- | +| **PCP** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | +| **DCP** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | + +## How to use Context Parallel +You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_context_parallel_size`, refer to the following example: + +- Offline example: + + ```python + from vllm import LLM, SamplingParams + + prompts = [ + "The future of AI is", + ] + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + llm = LLM( + model="deepseek-ai/DeepSeek-V2-Lite", + tensor_parallel_size=2, + decode_context_parallel_size=2, + prefill_context_parallel_size=2, + ) + outputs = llm.generate(prompts, sampling_params) + ``` + +- Online example: + + ```bash + vllm serve deepseek-ai/DeepSeek-V2-Lite \ + --tensor-parallel-size 2 \ + --decode-context-parallel-size 2 \ + --prefill-context-parallel-size 2 \ + ``` + +The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`, so the examples above need 4 NPUs for each. + +## Constraints +- While using DCP, the following constraints must be met: + - For MLA based model, such as Deepseek-R1: + - `tensor_parallel_size >= decode_context_parallel_size` + - `tensor_parallel_size % decode_context_parallel_size == 0` + - For GQA based model, such as Qwen3-235B: + - `(tensor_parallel_size // num_key_value_heads) >= decode_context_parallel_size` + - `(tensor_parallel_size // num_key_value_heads) % decode_context_parallel_size == 0` + +- While using Context Parallel in KV cache transfer needed scenario (e.g. KV pooling, PD-disaggregation), to simplify KV cache transmission, `cp_kv_cache_interleave_size` must be set to the same value of KV cache `block_size`(default: 128), which specify cp to split KV cache in a block-interleave style. For example: + + ``` + vllm serve deepseek-ai/DeepSeek-V2-Lite \ + --tensor-parallel-size 2 \ + --decode-context-parallel-size 2 \ + --prefill-context-parallel-size 2 \ + --cp-kv-cache-interleave-size 128 \ + --kv-transfer-config {...} \ + ``` + +## Experimental Results +To evaluate the effectiveness of Context Parallel in in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD-disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows. + +- DeepSeek-R1-W8A8: + | Configuration | Input length
32k | Input length
64k | Input length
128k | + | ----------------------------- | ------------------------- | ------------------------- | ------------------------- | + | P node: (DP2 TP8 EP16) *2
D node: (DP32 EP32) *1 | TTFT: 9.3s
TPOT: 72ms | TTFT: 22.8s
TPOT: 74ms | TTFT: 73.2s
TPOT: 82ms | + | P node: (PCP2 TP8 DCP8 EP16) *2
D node: (DP32 EP32) *1 | TTFT: 7.9s
TPOT: 74ms | TTFT: 15.9s
TPOT: 78ms | TTFT: 46.0s
TPOT: 83ms | + +- Qwen3-235B: + | Configuration | Input length
32k | Input length
64k | Input length
120k | + | ----------------------------- | ------------------------- | ------------------------- | ------------------------- | + | P node: (DP2 TP8 EP16) *2
D node: (DP32 EP32) *1 | TTFT: 5.1s
TPOT: 65ms | TTFT: 13.1s
TPOT: 85ms | TTFT: 33.9s
TPOT: 120ms | + | P node: (PCP2 TP8 DCP2 EP16) *2
D node: (DP32 EP32) *1 | TTFT: 3.0s
TPOT: 66ms | TTFT: 8.9s
TPOT: 86ms | TTFT: 22.7s
TPOT: 121ms | diff --git a/docs/source/user_guide/feature_guide/index.md b/docs/source/user_guide/feature_guide/index.md index f2fabfe6..a209828d 100644 --- a/docs/source/user_guide/feature_guide/index.md +++ b/docs/source/user_guide/feature_guide/index.md @@ -20,4 +20,5 @@ large_scale_ep ucm_deployment Fine_grained_TP speculative_decoding +context_parallel ::: diff --git a/docs/source/user_guide/support_matrix/supported_features.md b/docs/source/user_guide/support_matrix/supported_features.md index 5b8a1461..bcfe18fc 100644 --- a/docs/source/user_guide/support_matrix/supported_features.md +++ b/docs/source/user_guide/support_matrix/supported_features.md @@ -28,6 +28,7 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th | Quantization | 🟢 Functional | W8A8 available; working on more quantization method support (W4A8, etc) | | Graph Mode | 🔵 Experimental| Experimental, see detailed note: [vllm-ascend#767][graph_mode] | | Sleep Mode | 🟢 Functional | | +| Context Parallel | 🔵 Experimental| | - 🟢 Functional: Fully operational, with ongoing optimizations. - 🔵 Experimental: Experimental support, interfaces and functions may change. diff --git a/docs/source/user_guide/support_matrix/supported_models.md b/docs/source/user_guide/support_matrix/supported_models.md index 3e38d2a1..66b5c062 100644 --- a/docs/source/user_guide/support_matrix/supported_models.md +++ b/docs/source/user_guide/support_matrix/supported_models.md @@ -6,41 +6,41 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160 ### Generative Models -| Model | Support | Note | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Cache | LoRA | Speculative Decoding | Async Scheduling | Tensor Parallel | Pipeline Parallel | Expert Parallel | Data Parallel | Prefill-decode Disaggregation | Piecewise AclGraph | Fullgraph AclGraph | max-model-len | MLP Weight Prefetch | Doc | -|-------------------------------|-----------|----------------------------------------------------------------------|------|--------------------|------|-----------------|------------------------|------|----------------------|------------------|-----------------|-------------------|-----------------|---------------|-------------------------------|--------------------|--------------------|---------------|---------------------|-----| -| DeepSeek V3/3.1 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ || ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 240k || [DeepSeek-V3.1](../../tutorials/DeepSeek-V3.1.md) | -| DeepSeek V3.2 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 160k | ✅ | [DeepSeek-V3.2](../../tutorials/DeepSeek-V3.2.md) | -| DeepSeek R1 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ || ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 128k || [DeepSeek-R1](../../tutorials/DeepSeek-R1.md) | -| DeepSeek Distill (Qwen/Llama) | ✅ | ||||||||||||||||||| -| Qwen3 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ ||| ✅ || ✅ | ✅ | 128k | ✅ | [Qwen3-Dense](../../tutorials/Qwen3-Dense.md) | -| Qwen3-based | ✅ | ||||||||||||||||||| -| Qwen3-Coder | ✅ | | ✅ | A2/A3 ||✅|✅|✅|||✅|✅|✅|✅||||||[Qwen3-Coder-30B-A3B tutorial](../../tutorials/Qwen3-Coder-30B-A3B.md)| -| Qwen3-Moe | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | 256k || [Qwen3-235B-A22B](../../tutorials/Qwen3-235B-A22B.md) | -| Qwen3-Next | ✅ | | ✅ | A2/A3 | ✅ |||||| ✅ ||| ✅ || ✅ | ✅ ||| [Qwen3-Next](../../tutorials/Qwen3-Next.md) | -| Qwen2.5 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ |||| ✅ ||| ✅ |||||| [Qwen2.5-7B](../../tutorials/Qwen2.5-7B.md) | -| Qwen2 | ✅ | ||||||||||||||||||| -| Qwen2-based | ✅ | ||||||||||||||||||| -| QwQ-32B | ✅ | ||||||||||||||||||| -| Llama2/3/3.1 | ✅ | ||||||||||||||||||| -| Internlm | ✅ | [#1962](https://github.com/vllm-project/vllm-ascend/issues/1962) ||||||||||||||||||| -| Baichuan | ✅ | ||||||||||||||||||| -| Baichuan2 | ✅ | ||||||||||||||||||| -| Phi-4-mini | ✅ | ||||||||||||||||||| -| MiniCPM | ✅ | ||||||||||||||||||| -| MiniCPM3 | ✅ | ||||||||||||||||||| -| Ernie4.5 | ✅ | ||||||||||||||||||| -| Ernie4.5-Moe | ✅ | ||||||||||||||||||| -| Gemma-2 | ✅ | ||||||||||||||||||| -| Gemma-3 | ✅ | ||||||||||||||||||| -| Phi-3/4 | ✅ | ||||||||||||||||||| -| Mistral/Mistral-Instruct | ✅ | ||||||||||||||||||| -| GLM-4.5 | ✅ | ||||||||||||||||||| -| GLM-4 | ❌ | [#2255](https://github.com/vllm-project/vllm-ascend/issues/2255) ||||||||||||||||||| -| GLM-4-0414 | ❌ | [#2258](https://github.com/vllm-project/vllm-ascend/issues/2258) ||||||||||||||||||| -| ChatGLM | ❌ | [#554](https://github.com/vllm-project/vllm-ascend/issues/554) ||||||||||||||||||| -| DeepSeek V2.5 | 🟡 | Need test ||||||||||||||||||| -| Mllama | 🟡 | Need test ||||||||||||||||||| -| MiniMax-Text | 🟡 | Need test ||||||||||||||||||| +| Model | Support | Note | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Cache | LoRA | Speculative Decoding | Async Scheduling | Tensor Parallel | Pipeline Parallel | Expert Parallel | Data Parallel | Prefill-decode Disaggregation | Piecewise AclGraph | Fullgraph AclGraph | max-model-len | MLP Weight Prefetch | Context Parallel | Doc | +|-------------------------------|-----------|----------------------------------------------------------------------|------|--------------------|------|-----------------|------------------------|------|----------------------|------------------|-----------------|-------------------|-----------------|---------------|-------------------------------|--------------------|--------------------|---------------|---------------------|-----|-----| +| DeepSeek V3/3.1 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ || ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 240k || ✅ | [DeepSeek-V3.1](../../tutorials/DeepSeek-V3.1.md) | +| DeepSeek V3.2 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 160k | ✅ || [DeepSeek-V3.2](../../tutorials/DeepSeek-V3.2.md) | +| DeepSeek R1 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ || ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 128k || ✅ | [DeepSeek-R1](../../tutorials/DeepSeek-R1.md) | +| DeepSeek Distill (Qwen/Llama) | ✅ | |||||||||||||||||||| +| Qwen3 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ ||| ✅ || ✅ | ✅ | 128k | ✅ | ✅ | [Qwen3-Dense](../../tutorials/Qwen3-Dense.md) | +| Qwen3-based | ✅ | |||||||||||||||||||| +| Qwen3-Coder | ✅ | | ✅ | A2/A3 ||✅|✅|✅|||✅|✅|✅|✅|||||| ✅ | [Qwen3-Coder-30B-A3B tutorial](../../tutorials/Qwen3-Coder-30B-A3B.md)| +| Qwen3-Moe | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | 256k || ✅ | [Qwen3-235B-A22B](../../tutorials/Qwen3-235B-A22B.md) | +| Qwen3-Next | ✅ | | ✅ | A2/A3 | ✅ |||||| ✅ ||| ✅ || ✅ | ✅ |||| [Qwen3-Next](../../tutorials/Qwen3-Next.md) | +| Qwen2.5 | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ |||| ✅ ||| ✅ ||||||| [Qwen2.5-7B](../../tutorials/Qwen2.5-7B.md) | +| Qwen2 | ✅ | |||||||||||||||||||| +| Qwen2-based | ✅ | |||||||||||||||||||| +| QwQ-32B | ✅ | |||||||||||||||||||| +| Llama2/3/3.1 | ✅ | |||||||||||||||||||| +| Internlm | ✅ | [#1962](https://github.com/vllm-project/vllm-ascend/issues/1962) |||||||||||||||||||| +| Baichuan | ✅ | |||||||||||||||||||| +| Baichuan2 | ✅ | |||||||||||||||||||| +| Phi-4-mini | ✅ | |||||||||||||||||||| +| MiniCPM | ✅ | |||||||||||||||||||| +| MiniCPM3 | ✅ | |||||||||||||||||||| +| Ernie4.5 | ✅ | |||||||||||||||||||| +| Ernie4.5-Moe | ✅ | |||||||||||||||||||| +| Gemma-2 | ✅ | |||||||||||||||||||| +| Gemma-3 | ✅ | |||||||||||||||||||| +| Phi-3/4 | ✅ | |||||||||||||||||||| +| Mistral/Mistral-Instruct | ✅ | |||||||||||||||||||| +| GLM-4.5 | ✅ | |||||||||||||||||||| +| GLM-4 | ❌ | [#2255](https://github.com/vllm-project/vllm-ascend/issues/2255) |||||||||||||||||||| +| GLM-4-0414 | ❌ | [#2258](https://github.com/vllm-project/vllm-ascend/issues/2258) |||||||||||||||||||| +| ChatGLM | ❌ | [#554](https://github.com/vllm-project/vllm-ascend/issues/554) |||||||||||||||||||| +| DeepSeek V2.5 | 🟡 | Need test |||||||||||||||||||| +| Mllama | 🟡 | Need test |||||||||||||||||||| +| MiniMax-Text | 🟡 | Need test |||||||||||||||||||| ### Pooling Models