[Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
@@ -25,6 +25,7 @@ Together, these effects allow practitioners to better balance memory, communicat
|
||||
## Supported Scenarios
|
||||
|
||||
### Models
|
||||
|
||||
Finegrained TP is **model-agnostic** and supports all standard dense transformer architectures, including Llama, Qwen, DeepSeek (base/dense variants), and others.
|
||||
|
||||
### Component & Execution Mode Support
|
||||
@@ -37,20 +38,24 @@ Finegrained TP is **model-agnostic** and supports all standard dense transformer
|
||||
| **LMhead** | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
|
||||
> ⚠️ Note:
|
||||
>
|
||||
> - `o_proj` TP is only supported in Graph mode during Decode, because dummy_run in eager mode will not trigger o_proj.
|
||||
> - `mlp` TP supports dense models, or dense layers in MoE models. For example, the first three dense layers of DeepSeek-R1.
|
||||
|
||||
### Configuration Limit:
|
||||
### Configuration Limit
|
||||
|
||||
The Fine-Grained TP size for any component must:
|
||||
- Be **≤ the data-parallel (DP) size**, and
|
||||
- **Evenly divide the DP size** (i.e., `dp_size % tp_size == 0`) to ensure valid device assignment and communication grouping.
|
||||
|
||||
- Be **≤ the data-parallel (DP) size**, and
|
||||
- **Evenly divide the DP size** (i.e., `dp_size % tp_size == 0`) to ensure valid device assignment and communication grouping.
|
||||
|
||||
> ⚠️ Violating these constraints will result in runtime errors or undefined behavior.
|
||||
|
||||
---
|
||||
|
||||
## How to Use Finegrained TP
|
||||
|
||||
### Configuration Format:
|
||||
### Configuration Format
|
||||
|
||||
Finegrained TP is controlled via the `finegrained_tp_config` field inside `--additional-config`.
|
||||
|
||||
@@ -65,7 +70,7 @@ Finegrained TP is controlled via the `finegrained_tp_config` field inside `--add
|
||||
}'
|
||||
```
|
||||
|
||||
### Example Usage:
|
||||
### Example Usage
|
||||
|
||||
```bash
|
||||
vllm serve deepseek-ai/DeepSeek-R1 \
|
||||
@@ -96,6 +101,7 @@ To evaluate the effectiveness of fine-grained TP in large-scale service scenario
|
||||
| **Total** | **9.72 GB** | — |
|
||||
|
||||
- We achieved significant gains in terms of high memory capacity on a single card, as well as the benefits of TPOT.
|
||||
|
||||
---
|
||||
|
||||
## ✅ Deployment Recommendations
|
||||
|
||||
@@ -1,9 +1,11 @@
|
||||
# Multi Token Prediction (MTP)
|
||||
|
||||
## Why We Need MTP
|
||||
|
||||
MTP boosts inference performance by parallelizing the prediction of multiple tokens, shifting from single-token to multi-token generation. This approach significantly increases generation throughput and achieves multiplicative acceleration in inference speed—all without compromising output quality.
|
||||
|
||||
## How to Use MTP
|
||||
|
||||
To enable MTP for DeepSeek-V3 models, add the following parameter when starting the service:
|
||||
|
||||
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
|
||||
@@ -15,7 +17,7 @@ To enable MTP for DeepSeek-V3 models, add the following parameter when starting
|
||||
|
||||
### Module Architecture
|
||||
|
||||
```
|
||||
```shell
|
||||
vllm_ascend
|
||||
├── sample
|
||||
│ ├── rejection_sample.py
|
||||
@@ -28,7 +30,7 @@ vllm_ascend
|
||||
|
||||
- *rejection_sample.py*: During decoding, the main model processes the previous round’s output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus We employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
|
||||
|
||||
```
|
||||
```shell
|
||||
rejection_sample.py
|
||||
├── AscendRejectionSampler
|
||||
│ ├── forward
|
||||
@@ -37,9 +39,10 @@ rejection_sample.py
|
||||
**2. spec_decode**
|
||||
|
||||
This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token ids. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
|
||||
|
||||
- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by deepseek mtp layer.
|
||||
|
||||
```
|
||||
```shell
|
||||
mtp_proposer.py
|
||||
├── Proposer
|
||||
│ ├── load_model
|
||||
@@ -52,6 +55,7 @@ mtp_proposer.py
|
||||
### Algorithm
|
||||
|
||||
**1. Reject_Sample**
|
||||
|
||||
- *Greedy Strategy*
|
||||
|
||||
Verify whether the token generated by the main model matches the speculative token predicted by MTP in the previous round. If they match exactly, accept the bonus token; otherwise, reject it and any subsequent tokens derived from that speculation.
|
||||
@@ -76,7 +80,7 @@ If the bonus token is accepted, the MTP model performs inference for (num_specul
|
||||
|
||||
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
|
||||
|
||||
```
|
||||
```python
|
||||
def get_spec_decode_method(method,
|
||||
vllm_config,
|
||||
device,
|
||||
@@ -93,9 +97,10 @@ def get_spec_decode_method(method,
|
||||
```
|
||||
|
||||
### Integer Validation
|
||||
|
||||
- The current npu_fused_infer_attention_score operator only supports integers less than 16 per decode round. Therefore, the maximum supported value for MTP is 15. If a value greater than 15 is provided, the code will raise an error and alert the user.
|
||||
|
||||
```
|
||||
```python
|
||||
if self.speculative_config:
|
||||
spec_token_num = self.speculative_config.num_speculative_tokens
|
||||
self.decode_threshold += spec_token_num
|
||||
@@ -105,5 +110,6 @@ if self.speculative_config:
|
||||
```
|
||||
|
||||
## Limitation
|
||||
|
||||
- Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15.
|
||||
- In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1).
|
||||
|
||||
@@ -5,6 +5,7 @@
|
||||
This guide shows how to use Context Parallel, a long sequence inference optimization technique. Context Parallel includes `PCP` (Prefill Context Parallel) and `DCP` (Decode Context Parallel), which reduces NPU memory usage and improves inference speed in long sequence LLM inference.
|
||||
|
||||
## Benefits of Context Parallel
|
||||
|
||||
Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are:
|
||||
|
||||
- For long context prefill, we can use context parallel to reduce TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
|
||||
@@ -13,13 +14,16 @@ Context parallel mainly solves the problem of serving long context requests. As
|
||||
To learn more about the theory and implementation details of context parallel, please refer to the [context parallel developer guide](../../developer_guide/feature_guide/context_parallel.md).
|
||||
|
||||
## Supported Scenarios
|
||||
|
||||
Currently context parallel can be used together with most other features, supported features are as follows:
|
||||
|
||||
| | Eager | Graph | Prefix <br> Cache | Chunked <br> Prefill | SpecDecode <br> (MTP) | PD <br> disaggregation | MLAPO |
|
||||
| ------- | ----- | ----- | ------ | ------ | ----- | ----- | ----- |
|
||||
| **PCP** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅|
|
||||
| **DCP** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
|
||||
## How to use Context Parallel
|
||||
|
||||
You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_context_parallel_size`, refer to the following example:
|
||||
|
||||
- Offline example:
|
||||
@@ -53,6 +57,7 @@ You can enable `PCP` and `DCP` by `prefill_context_parallel_size` and `decode_co
|
||||
The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`, so the examples above need 4 NPUs for each.
|
||||
|
||||
## Constraints
|
||||
|
||||
- While using DCP, the following constraints must be met:
|
||||
- For MLA based model, such as Deepseek-R1:
|
||||
- `tensor_parallel_size >= decode_context_parallel_size`
|
||||
@@ -63,7 +68,7 @@ The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`
|
||||
|
||||
- While using Context Parallel in KV cache transfer needed scenario (e.g. KV pooling, PD-disaggregation), to simplify KV cache transmission, `cp_kv_cache_interleave_size` must be set to the same value of KV cache `block_size`(default: 128), which specify cp to split KV cache in a block-interleave style. For example:
|
||||
|
||||
```
|
||||
```shell
|
||||
vllm serve deepseek-ai/DeepSeek-V2-Lite \
|
||||
--tensor-parallel-size 2 \
|
||||
--decode-context-parallel-size 2 \
|
||||
@@ -73,16 +78,19 @@ The total world_size is `tensor_parallel_size` * `prefill_context_parallel_size`
|
||||
```
|
||||
|
||||
## Experimental Results
|
||||
|
||||
To evaluate the effectiveness of Context Parallel in in long sequence LLM inference scenarios, we use **DeepSeek-R1-W8A8** and **Qwen3-235B**, deploy PD-disaggregate instances in the environment of 64 cards Ascend 910C*64G (A3), the configuration and performance data are as follows.
|
||||
|
||||
- DeepSeek-R1-W8A8:
|
||||
|
||||
| Configuration | Input length <br> 32k | Input length <br> 64k | Input length <br> 128k |
|
||||
| ----------------------------- | ------------------------- | ------------------------- | ------------------------- |
|
||||
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 9.3s <br> TPOT: 72ms | TTFT: 22.8s <br> TPOT: 74ms | TTFT: 73.2s <br> TPOT: 82ms |
|
||||
| P node: (PCP2 TP8 DCP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 7.9s <br> TPOT: 74ms | TTFT: 15.9s <br> TPOT: 78ms | TTFT: 46.0s <br> TPOT: 83ms |
|
||||
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 9.3s <br> TPOT: 72ms | TTFT: 22.8s <br> TPOT: 74ms | TTFT: 73.2s <br> TPOT: 82ms |
|
||||
| P node: (PCP2 TP8 DCP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 7.9s <br> TPOT: 74ms | TTFT: 15.9s <br> TPOT: 78ms | TTFT: 46.0s <br> TPOT: 83ms |
|
||||
|
||||
- Qwen3-235B:
|
||||
|
||||
| Configuration | Input length <br> 32k | Input length <br> 64k | Input length <br> 120k |
|
||||
| ----------------------------- | ------------------------- | ------------------------- | ------------------------- |
|
||||
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 5.1s <br> TPOT: 65ms | TTFT: 13.1s <br> TPOT: 85ms | TTFT: 33.9s <br> TPOT: 120ms |
|
||||
| P node: (PCP2 TP8 DCP2 EP16) *2 <br> D node: (DP32 EP32) *1 | TTFT: 3.0s <br> TPOT: 66ms | TTFT: 8.9s <br> TPOT: 86ms | TTFT: 22.7s <br> TPOT: 121ms |
|
||||
| P node: (DP2 TP8 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 5.1s <br> TPOT: 65ms | TTFT: 13.1s <br> TPOT: 85ms | TTFT: 33.9s <br> TPOT: 120ms |
|
||||
| P node: (PCP2 TP8 DCP2 EP16) *2 <br> D node: (DP32 EP32)*1 | TTFT: 3.0s <br> TPOT: 66ms | TTFT: 8.9s <br> TPOT: 86ms | TTFT: 22.7s <br> TPOT: 121ms |
|
||||
|
||||
@@ -29,9 +29,11 @@ We are working on further improvements and this feature will support more XPUs i
|
||||
```
|
||||
|
||||
### Supported Models
|
||||
|
||||
So far, dynamic batch performs better on several dense models including Qwen and Llama (from 8B to 32B) with `tensor_parallel_size=8`. For different models, a proper `SLO_limits_for_dynamic_batch` parameter is needed. The empirical value of this parameter is generally `35, 50, or 75`. Therefore, some additional tests are needed to select the best parameter.
|
||||
|
||||
## Usage
|
||||
|
||||
Dynamic batch is used in the online inference. A fully executable example is as follows:
|
||||
|
||||
```shell
|
||||
|
||||
@@ -14,9 +14,12 @@ Expert balancing for MoE models in LLM serving is essential for optimal performa
|
||||
|
||||
## Support Scenarios
|
||||
|
||||
### Models:
|
||||
### Models
|
||||
|
||||
DeepseekV3/V3.1/R1、Qwen3-MOE
|
||||
### MOE QuantType:
|
||||
|
||||
### MOE QuantType
|
||||
|
||||
W8A8-dynamic
|
||||
|
||||
## How to Use EPLB
|
||||
@@ -37,6 +40,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
```
|
||||
|
||||
### Static EPLB
|
||||
|
||||
#### Initial Setup (Record Expert Map)
|
||||
|
||||
We need to add environment variable `export EXPERT_MAP_RECORD="true"` to record expert map.Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
|
||||
@@ -54,6 +58,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
```
|
||||
|
||||
#### Subsequent Deployments (Use Recorded Map)
|
||||
|
||||
Load the pre-recorded expert map for consistent performance. This avoids recalculating distributions at runtime.
|
||||
|
||||
```shell
|
||||
@@ -66,6 +71,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
```
|
||||
|
||||
## Critical Considerations
|
||||
|
||||
1. Parameter Tuning:
|
||||
- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
|
||||
- num_wait_worker_iterations: Should be ≥ 30 to avoid premature balancing during startup.
|
||||
|
||||
@@ -13,33 +13,36 @@ The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_p
|
||||
|
||||
This tutorial will introduce the usage of them.
|
||||
|
||||
### Prerequisites:
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- Install dependencies needed by load-balance proxy server:
|
||||
|
||||
```
|
||||
```shell
|
||||
pip install fastapi httpx uvicorn
|
||||
```
|
||||
|
||||
## Starting Exeternal DP Servers
|
||||
|
||||
First you need to have at least two vLLM servers running in data parallel. These can be mock servers or actual vLLM servers. Note that this proxy also works with only one vLLM server running, but will fall back to direct request forwarding which is meaningless.
|
||||
|
||||
You can start external vLLM dp servers one-by-one manually or using the launch script in `examples/external_online_dp`. For scenarios of large dp size across multi nodes, we recommend using our launch script for convenience.
|
||||
|
||||
### Manually Launch
|
||||
|
||||
```
|
||||
```shell
|
||||
# This example shows how to manually launch a vLLM service with DP size 2 in one node.
|
||||
vllm serve --host 0.0.0.0 --port 8100 --data-parallel-size 2 --data-parallel-rank 0 ... # vLLM DP0
|
||||
vllm serve --host 0.0.0.0 --port 8101 --data-parallel-size 2 --data-parallel-rank 1 ... # vLLM DP1
|
||||
```
|
||||
|
||||
### Use Launch Script
|
||||
|
||||
Firstly, you need to modify the `examples/external_online_dp/run_dp_template.sh` according to your vLLM configuration. Then you can use `examples/external_online_dp/launch_online_dp.py` to launch multiple vLLM instances in one command each node. It will internally call `examples/external_online_dp/run_dp_template.sh` for each DP rank with proper DP-related parameters.
|
||||
|
||||
An example of running external DP in one single node:
|
||||
|
||||
```
|
||||
```shell
|
||||
cd examples/external_online_dp
|
||||
# running DP4 TP4 in a node with 16 NPUs
|
||||
python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address x.x.x.x --dp-rpc-port 12342
|
||||
@@ -47,7 +50,7 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 4 --dp-rank-s
|
||||
|
||||
An example of running external DP in two nodes:
|
||||
|
||||
```
|
||||
```shell
|
||||
cd examples/external_online_dp
|
||||
# running DP4 TP4 in two nodes with 8 NPUs each
|
||||
# Node 0 holds DP0 DP1 and node 1 holds DP2 DP3
|
||||
@@ -61,16 +64,18 @@ python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 2 --dp-rank-s
|
||||
```
|
||||
|
||||
## Starting Load-balance Proxy Server
|
||||
|
||||
After all vLLM DP instances are launched, you can now launch the load-balance proxy server which serves as entrypoint for coming requests and load balance them between vLLM DP instances.
|
||||
|
||||
The proxy server has following features:
|
||||
|
||||
- Load balances requests to multiple vLLM servers based on request length.
|
||||
- Supports OpenAI-compatible `/v1/completions` and `/v1/chat/completions` endpoints.
|
||||
- Streams responses from backend servers to clients.
|
||||
|
||||
To run the proxy server, you need to specify the host and port for each vLLM DP Instance:
|
||||
|
||||
```
|
||||
```shell
|
||||
# For example, we have already started two DP instances in single node:
|
||||
# python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address x.x.x.x --dp-rpc-port 12342
|
||||
# By default, launch_online_dp.py will launch vLLM instances from starting port 9000,
|
||||
|
||||
@@ -11,10 +11,12 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
|
||||
|
||||
There are two kinds for graph mode supported by vLLM Ascend:
|
||||
|
||||
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and Deepseek series models are well tested.
|
||||
- **XliteGraph**: This is the openeuler xlite graph mode. In v0.11.0, only Llama, Qwen dense series models, and Qwen3-vl are supported.
|
||||
|
||||
## Using ACLGraph
|
||||
|
||||
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
|
||||
|
||||
Offline example:
|
||||
|
||||
@@ -3,19 +3,21 @@
|
||||
## Environmental Dependencies
|
||||
|
||||
* Software:
|
||||
* Python >= 3.10, < 3.12
|
||||
* CANN == 8.3.rc2
|
||||
* PyTorch == 2.8.0, torch-npu == 2.8.0
|
||||
* vLLM:main branch
|
||||
* vLLM-Ascend:main branch
|
||||
* Python >= 3.10, < 3.12
|
||||
* CANN == 8.3.rc2
|
||||
* PyTorch == 2.8.0, torch-npu == 2.8.0
|
||||
* vLLM:main branch
|
||||
* vLLM-Ascend:main branch
|
||||
|
||||
### KV Pool Parameter Description
|
||||
|
||||
**kv_connector_extra_config**: Additional Configurable Parameters for Pooling.
|
||||
**lookup_rpc_port**: Port for RPC Communication Between Pooling Scheduler Process and Worker Process: Each Instance Requires a Unique Port Configuration.
|
||||
**load_async**: Whether to Enable Asynchronous Loading. The default value is false.
|
||||
**backend**: Set the storage backend for kvpool, with the default being mooncake.
|
||||
|
||||
### Environment Variable Configuration
|
||||
|
||||
To guarantee uniform hash generation, it is required to synchronize the PYTHONHASHSEED environment variable across all nodes upon enabling KV Pool.
|
||||
|
||||
```bash
|
||||
@@ -23,6 +25,7 @@ export PYTHONHASHSEED=0
|
||||
```
|
||||
|
||||
## Example of using Mooncake as a KV Pool backend
|
||||
|
||||
* Software:
|
||||
* Check NPU HCCN Configuration:
|
||||
|
||||
@@ -35,7 +38,7 @@ export PYTHONHASHSEED=0
|
||||
* Install Mooncake
|
||||
|
||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
|
||||
Installation and Compilation Guide: https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries.
|
||||
Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
|
||||
First, we need to obtain the Mooncake project. Refer to the following command:
|
||||
|
||||
```shell
|
||||
@@ -75,8 +78,8 @@ export PYTHONHASHSEED=0
|
||||
|
||||
**Note:**
|
||||
|
||||
- Adjust the Python path according to your specific Python installation
|
||||
- Ensure `/usr/local/lib` and `/usr/local/lib64` are in your `LD_LIBRARY_PATH`
|
||||
* Adjust the Python path according to your specific Python installation
|
||||
* Ensure `/usr/local/lib` and `/usr/local/lib64` are in your `LD_LIBRARY_PATH`
|
||||
|
||||
```shell
|
||||
export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
@@ -88,7 +91,7 @@ export PYTHONHASHSEED=0
|
||||
|
||||
The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path where mooncake.json is located.
|
||||
|
||||
```
|
||||
```shell
|
||||
{
|
||||
"metadata_server": "P2PHANDSHAKE",
|
||||
"protocol": "ascend",
|
||||
@@ -108,7 +111,7 @@ The environment variable **MOONCAKE_CONFIG_PATH** is configured to the full path
|
||||
|
||||
Under the mooncake folder:
|
||||
|
||||
```
|
||||
```shell
|
||||
mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1
|
||||
```
|
||||
|
||||
@@ -122,13 +125,13 @@ Using `MultiConnector` to simultaneously utilize both `MooncakeConnectorV1` and
|
||||
|
||||
`prefill` Node:
|
||||
|
||||
```
|
||||
```shell
|
||||
bash multi_producer.sh
|
||||
```
|
||||
|
||||
The content of the multi_producer.sh script:
|
||||
|
||||
```
|
||||
```shell
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
|
||||
export PYTHONHASHSEED=0
|
||||
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
|
||||
@@ -195,13 +198,13 @@ python3 -m vllm.entrypoints.openai.api_server \
|
||||
|
||||
`decode` Node:
|
||||
|
||||
```
|
||||
```shell
|
||||
bash multi_consumer.sh
|
||||
```
|
||||
|
||||
The content of multi_consumer.sh:
|
||||
|
||||
```
|
||||
```shell
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
|
||||
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
|
||||
export PYTHONHASHSEED=0
|
||||
@@ -259,7 +262,7 @@ python3 -m vllm.entrypoints.openai.api_server \
|
||||
|
||||
Currently, the key-value pool in PD Disaggregate only stores the kv cache generated by the Prefill node by default. In models using MLA, it is now supported that the Decode node stores the kv cache for use by the Prefill node, enabled by adding `consumer_is_to_put: true` to the AscendStoreConnector. If the Prefill node enables PP, `prefill_pp_size` or `prefill_pp_layer_partition` also needs to be set. Example as follows:
|
||||
|
||||
```
|
||||
```python
|
||||
{
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_consumer",
|
||||
@@ -273,9 +276,9 @@ Currently, the key-value pool in PD Disaggregate only stores the kv cache genera
|
||||
}
|
||||
```
|
||||
|
||||
#### 2、Start proxy_server.
|
||||
#### 2、Start proxy_server
|
||||
|
||||
```
|
||||
```shell
|
||||
python vllm-ascend/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py \
|
||||
--host localhost\
|
||||
--prefiller-hosts localhost \
|
||||
@@ -292,13 +295,13 @@ Configure the localhost, port, and model weight path in the command to your own
|
||||
|
||||
Short question:
|
||||
|
||||
```
|
||||
```shell
|
||||
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
|
||||
```
|
||||
|
||||
Long question:
|
||||
|
||||
```
|
||||
```shell
|
||||
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
|
||||
```
|
||||
|
||||
@@ -306,13 +309,13 @@ curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json"
|
||||
|
||||
#### 1.Run Mixed Department Script
|
||||
|
||||
```
|
||||
```shell
|
||||
bash mixed_department.sh
|
||||
```
|
||||
|
||||
Content of mixed_department.sh:
|
||||
|
||||
```
|
||||
```shell
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
|
||||
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
|
||||
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
|
||||
@@ -351,12 +354,12 @@ Configure the localhost, port, and model weight path in the command to your own
|
||||
|
||||
Short question:
|
||||
|
||||
```
|
||||
```shell
|
||||
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
|
||||
```
|
||||
|
||||
Long question:
|
||||
|
||||
```
|
||||
```shell
|
||||
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
|
||||
```
|
||||
|
||||
@@ -7,12 +7,12 @@ Take the deepseek model as an example, use 8 Atlas 800T A3 servers to deploy the
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements:
|
||||
### Physical Layer Requirements
|
||||
|
||||
- The physical machines must be located on the same WLAN, with network connectivity.
|
||||
- All NPUs must be interconnected. For the Atlas A2 generation, intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA. For the Atlas A3 generation, both intra-node and inter-node connectivity are via HCCS.
|
||||
|
||||
### Verification Process:
|
||||
### Verification Process
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} A3
|
||||
|
||||
@@ -5,12 +5,14 @@
|
||||
**Layer Shard Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.
|
||||
|
||||
Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
|
||||
|
||||
- The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group.
|
||||
- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand via asynchronous broadcast** during the forward pass.
|
||||
|
||||
As illustrated in the figure below, this design enables broadcast to reach weights: while the current layer (e.g., MLA or MOE) is being computed, the system **asynchronously broadcasts the next layer's weight** in the background. Because the attention computation in the MLA module is sufficiently latency-bound, the weight transfer for `o_proj` is **fully overlapped with computation**, making the communication **latency-free from the perspective of end-to-end inference**.
|
||||
|
||||
This approach **preserves exact computational semantics** while **significantly reducing NPU memory footprint**, especially critical for:
|
||||
|
||||
- Extremely deep architectures (e.g., DeepSeek-V3/R1 with 61 layers);
|
||||
- Models using **[DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702)** or **[FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188)**, where the full `O` (output) projection matrix must reside in memory per layer;
|
||||
- Scenarios where **attention computation latency fully overlaps** (hides) the communication cost of weight broadcasting.
|
||||
@@ -18,6 +20,7 @@ This approach **preserves exact computational semantics** while **significantly
|
||||
---
|
||||
|
||||
### Flowchart
|
||||
|
||||

|
||||
|
||||
> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast pre-fetches the next layer's weight while the current layer computes—enabling zero-overhead weight loading.
|
||||
@@ -68,4 +71,4 @@ vllm serve \
|
||||
--additional-config '{
|
||||
"layer_sharding": ["q_b_proj", "o_proj"]
|
||||
}'
|
||||
```
|
||||
```
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
# LoRA Adapters Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Like vLLM, vllm-ascend supports LoRA as well. The usage and more details can be found in [vLLM official document](https://docs.vllm.ai/en/latest/features/lora.html).
|
||||
|
||||
You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) to find which models support LoRA in vLLM.
|
||||
@@ -8,11 +9,12 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor
|
||||
You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance.
|
||||
|
||||
Address for downloading models:\
|
||||
base model: https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files \
|
||||
base model: <https://www.modelscope.cn/models/vllm-ascend/Llama-2-7b-hf/files> \
|
||||
lora model:
|
||||
https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files
|
||||
<https://www.modelscope.cn/models/vllm-ascend/llama-2-7b-sql-lora-test/files>
|
||||
|
||||
## Example
|
||||
|
||||
We provide a simple LoRA example here, which enables the ACLGraph mode by default.
|
||||
|
||||
```shell
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
This guide shows how to use Speculative Decoding with vLLM Ascend. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
||||
## Speculating by matching n-grams in the prompt
|
||||
|
||||
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by matching n-grams in the prompt.
|
||||
|
||||
- Offline inference
|
||||
|
||||
@@ -89,6 +89,7 @@ INFO: Application startup complete.
|
||||
Congratulations, you have successfully started the vLLM server with UCM connector!
|
||||
|
||||
## Evaluating UCM Prefix Caching Performance
|
||||
|
||||
After launching the vLLM server with `UCMConnector` enabled, the easiest way to observe the prefix caching effect is to run the built-in `vllm bench` CLI. Executing the following command **twice** in a separate terminal shows the improvement clearly.
|
||||
|
||||
```bash
|
||||
@@ -109,32 +110,34 @@ vllm bench serve \
|
||||
```
|
||||
|
||||
### After the first execution
|
||||
|
||||
The `vllm bench` terminal prints the benchmark result:
|
||||
|
||||
```
|
||||
```shell
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 15323.87
|
||||
```
|
||||
|
||||
Inspecting the vLLM server logs reveals entries like:
|
||||
|
||||
```
|
||||
```shell
|
||||
INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 0
|
||||
```
|
||||
|
||||
This indicates that for the first inference request, UCM did not hit any cached KV blocks. As a result, the full 16K-token prefill must be computed, leading to a relatively large TTFT.
|
||||
|
||||
### After the second execution
|
||||
|
||||
Running the same benchmark again produces:
|
||||
|
||||
```
|
||||
```shell
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 1920.68
|
||||
```
|
||||
|
||||
The vLLM server logs now contain similar entries:
|
||||
|
||||
```
|
||||
```shell
|
||||
INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user