[v0.11.0][Doc] Update doc (#3852)
### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
@@ -3,18 +3,18 @@
|
||||
## Getting Start
|
||||
vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization.
|
||||
|
||||
Each DP rank is deployed as a separate “core engine” process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
|
||||
Each DP rank is deployed as a separate "core engine" process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size.
|
||||
|
||||
For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended:
|
||||
|
||||
- Use **Data Parallelism (DP)** for attention layers, which are replicated across devices and handle separate batches.
|
||||
- Use **Expert or Tensor Parallelism (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
|
||||
- Use **Data Parallel (DP)** for attention layers, which are replicated across devices and handle separate batches.
|
||||
- Use **Expert or Tensor Parallel (EP/TP)** for expert layers, which are sharded across devices to distribute the computation.
|
||||
|
||||
This division enables attention layers to be replicated across Data Parallel (DP) ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using Expert or Tensor Parallelism(DP*TP), maximizing hardware utilization and efficiency.
|
||||
This division enables attention layers to be replicated across DP ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using DP/TP, maximizing hardware utilization and efficiency.
|
||||
|
||||
In these cases the data parallel ranks are not completely independent, forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
|
||||
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks.
|
||||
|
||||
For MoE models, when any requests are in progress in any rank, we must ensure that empty “dummy” forward passes are performed in all ranks which don’t currently have any requests scheduled. This is handled via a separate DP `Coordinator` process which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
|
||||
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP `Coordinator` process, which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
@@ -56,8 +56,8 @@ for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
```
|
||||
|
||||
## Run with docker
|
||||
Assume you have two Atlas 800 A2(64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multi-node.
|
||||
## Run with Docker
|
||||
Assume you have two Atlas 800 A2 (64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multiple nodes.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -66,7 +66,7 @@ export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
@@ -91,13 +91,13 @@ docker run --rm \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
Run the following scripts on two nodes respectively
|
||||
Run the following scripts on two nodes respectively.
|
||||
|
||||
:::{note}
|
||||
Before launch the inference server, ensure the following environment variables are set for multi node communication
|
||||
Before launching the inference server, ensure the following environment variables are set for multi-node communication.
|
||||
:::
|
||||
|
||||
**node0**
|
||||
**Node 0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -116,8 +116,8 @@ export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=100
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
|
||||
# If you want to the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
|
||||
# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8
|
||||
# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html
|
||||
vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
@@ -139,7 +139,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
**node1**
|
||||
**Node 1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
@@ -185,7 +185,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
The Deployment view looks like:
|
||||
The deployment view looks like:
|
||||

|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
@@ -201,8 +201,8 @@ curl http://{ node0 ip:8004 }/v1/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
## Run benchmarks
|
||||
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
|
||||
## Run Benchmarks
|
||||
For details, refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
|
||||
Reference in New Issue
Block a user