# Distributed DP Server With Large Scale Expert Parallelism
## Getting Start
vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large scale **Expert Parallelism (EP)** scenario. To achieve better performance,the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
Take the deepseek model as an example, use 8 Atlas 800T A3 servers to deploy the model. Assume the ip of the servers start from 192.0.0.1, and end by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes deployed as master node independently, the decoder nodes set 192.0.0.5 node to be the master node.
- The physical machines must be located on the same WLAN, with network connectivity.
- All NPUs must be interconnected. For the Atlas A2 generation, intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA. For the Atlas A3 generation, both intra-node and inter-node connectivity are via HCCS.
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
2. Get NPU IP Addresses
```bash
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
```
3. Get superpodid and SDID
```bash
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
```
4. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
```
::::
::::{tab-item} A2
1. Single Node Verification:
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
```bash
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
```
2. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g;done
```
3. Cross-Node PING Test
```bash
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
```
::::
:::::
## Large Scale EP model deployment
### Generate script with configurations
In the PD separation scenario, we provide a optimized configuration. You can use the following shell script for configuring the prefiller and decoder nodes respectively.
:::::{tab-set}
::::{tab-item} Prefiller node
```shell
# run_dp_template.sh
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
processes.append(process)
process.start()
for process in processes:
process.join()
```
::::
:::::
Note that the prefiller nodes and the decoder nodes may have different configurations. In this example, each prefiller node deployed as master node independently, but all decoder nodes take the first node as the master node. So it leads to difference in 'dp_size_local' and 'dp_rank_start'
## Example proxy for Distributed DP Server
In the PD separation scenario, we need a proxy to distribute requests. Execute the following commands to enable the example proxy:
```shell
python load_balance_proxy_server_example.py \
--port 8000 \
--host 0.0.0.0 \
--prefiller-hosts \
192.0.0.1 \
192.0.0.2 \
192.0.0.3 \
192.0.0.4 \
--prefiller-hosts-num \
2 2 2 2 \
--prefiller-ports \
9000 9000 9000 9000 \
--prefiller-ports-inc \
2 2 2 2\
--decoder-hosts \
192.0.0.5 \
192.0.0.6 \
192.0.0.7 \
192.0.0.8 \
--decoder-hosts-num \
16 16 16 16 \
--decoder-ports \
9000 9000 9000 9000 \
--decoder-ports-inc \
16 16 16 16 \
```
|Parameter | meaning |
| --- | --- |
| --port | Proxy service Port |
| --host | Proxy service Host IP|
| --prefiller-hosts | Hosts of prefiller nodes |
| --prefiller-hosts-num | Number of repetitions for prefiller node hosts |
| --prefiller-ports | Ports of prefiller nodes |
| --prefiller-ports-inc | Number of increments for prefiller node ports |
| --decoder-hosts | Hosts of decoder nodes |
| --decoder-hosts-num | Number of repetitions for decoder node hosts |
| --decoder-ports | Ports of decoder nodes |
| --decoder-ports-inc | Number of increments for decoder node ports |
You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)
## Benchmark
We recommend use aisbench tool to assess performance. [aisbench](https://gitee.com/aisbench/benchmark) Execute the following commands to install aisbench
You need to canncel the http proxy before assessing performance, as following
```shell
# unset proxy
unset http_proxy
unset https_proxy
```
- You can place your datasets in the dir: `benchmark/ais_bench/datasets`
- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples
```python
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="vllm-ascend/DeepSeek-R1-W8A8",
model="dsr1",
request_rate = 28,
retry = 2,
host_ip = "192.0.0.1", # Proxy service host IP
host_port = 8000, # Proxy service Port
max_out_len = 10,
batch_size=1536,
trust_remote_code=True,
generation_kwargs = dict(
temperature = 0,
seed = 1024,
ignore_eos=False,
)
)
]
```
- Take gsm8k dataset for example, execute the following commands to assess performance.
For example,if the average input length is 3.5k, and the output length is 1.1k, the context length is 16k, the max length of the input dataset is 7K. In this scenario, we give a recommended configuration for distributed DP server with high EP. Here we use 4 nodes for prefill and 4 nodes for decode.
Note that these configurations are not related to optimization. You need to adjust these parameters based on actual scenarios.
:::
## FAQ
### 1. Prefiller nodes need to warmup
Since the computation of some NPU operators requires several rounds of warm-up to achieve best performance, we recommend preheating the service with some requests before conducting performance tests to achieve the best end-to-end throughput.