[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
zhangxinyuehfad
2025-10-29 11:32:12 +08:00
committed by GitHub
parent 6188450269
commit 75de3fa172
49 changed files with 724 additions and 701 deletions

View File

@@ -3,18 +3,18 @@
## Getting Start
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
Each DP rank is deployed as a separate core engine process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
Each DP rank is deployed as a separate "core engine" process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
- Use **Data Parallel (DP)** for attention layers, which are replicated across devices and handle separate batches.
- Use **Expert or Tensor Parallel (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
This division enables attention layers to be replicated across DP ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using DP/TP, maximizing hardware utilization and efficiency.
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
For MoE models, when any requests are in progress in any rank, we must ensure that empty dummy forward passes are performed in all ranks which dont currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP `Coordinator` process, which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
## Verify Multi-Node Communication Environment
@@ -56,8 +56,8 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
hccn_tool -i 0 -ping -g address 10.20.0.20
```
## Run with docker
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multi-node.
## Run with Docker
Assume you have two Atlas 800 A2 (64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multiple nodes.
```{code-block} bash
:substitutions:
@@ -66,7 +66,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
@@ -91,13 +91,13 @@ docker run --rm \
-it $IMAGE bash
```
Run the following scripts on two nodes respectively
Run the following scripts on two nodes respectively.
:::{note}
Before launch the inference server, ensure the following environment variables are set for multi node communication
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
:::
**node0**
**Node 0**
```shell
#!/bin/sh
@@ -116,8 +116,8 @@ export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export HCCL_BUFFSIZE=1024
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--host 0.0.0.0 \
--port 8004 \
@@ -139,7 +139,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
**node1**
**Node 1**
```shell
#!/bin/sh
@@ -185,7 +185,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
The deployment view looks like:
![alt text](../assets/multi_node_dp_deepseek.png)
Once your server is started, you can query the model with input prompts:
@@ -201,8 +201,8 @@ curl http://{ node0 ip:8004 }/v1/completions \
}'
```
## Run benchmarks
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
## Run Benchmarks
For details, refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
```shell
export VLLM_USE_MODELSCOPE=true