[P/D][main]Offline the llmdatadist connector related parts of the code and files. (#4780)
### What this PR does / why we need it?
As support for the mooncake connector is now available, the llmdatadist
connector is no longer being maintained, so the llmdatadist-related
files need to be retired.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
This commit is contained in:
@@ -1,238 +0,0 @@
|
||||
# Disaggregated Prefill-Decode Deployment Guide
|
||||
|
||||
## Overview
|
||||
This demo document provides instructions for running a disaggregated vLLM-ascend service with separate prefill and decode stages across 4 nodes, uses 16 Ascend NPUs for two prefill nodes (P1/P2) and 16 Ascend NPUS for two decode nodes (D1/D2).
|
||||
|
||||
## Prerequisites
|
||||
- Ascend NPU environment with vLLM 0.9.1 installed
|
||||
- Network interfaces configured for distributed communication (eg: eth0)
|
||||
- Model weights located at `/models/deepseek_r1_w8a8`
|
||||
|
||||
## Rank table generation
|
||||
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. The following command generates a rank table for all nodes with 16 cards prefill and 16 cards decode:
|
||||
|
||||
Run the following command on every node to generate the rank table:
|
||||
```shell
|
||||
cd /vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/
|
||||
bash gen_ranktable.sh --ips 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 \
|
||||
--npus-per-node 8 --network-card-name eth0 --prefill-device-cnt 16 --decode-device-cnt 16
|
||||
```
|
||||
Rank table will generated at `/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json`
|
||||
|
||||
## Start disaggregated vLLM-ascend service
|
||||
For demonstration purposes, we will utilize the quantized version of Deepseek-R1. Recommended Parallelization Strategies:
|
||||
- P-node: DP2-TP8-EP16 (Data Parallelism 2, Tensor Parallelism 8, Expert Parallelism 16)
|
||||
- D-node: DP4-TP4-EP16 (Data Parallelism 4, Tensor Parallelism 4, Expert Parallelism 16)
|
||||
|
||||
Execution Sequence
|
||||
- 4 configured node ip are: 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36
|
||||
- Start Prefill on Node 1 (P1)
|
||||
- Start Prefill on Node 2 (P2)
|
||||
- Start Decode on Node 1 (D1)
|
||||
- Start Decode on Node 2 (D2)
|
||||
- Start proxy server on Node1
|
||||
|
||||
Run prefill server P1 on first node:
|
||||
```shell
|
||||
export HCCL_IF_IP=172.19.32.175 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_ASCEND_LLMDD_RPC_PORT=5559
|
||||
|
||||
vllm serve /models/deepseek_r1_w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 20002 \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--api-server-count 2 \
|
||||
--data-parallel-address 172.19.32.175 \
|
||||
--data-parallel-rpc-port 13356 \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 32768 \
|
||||
--max-num-seqs 256 \
|
||||
--trust-remote-code \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}'
|
||||
```
|
||||
|
||||
Run prefill server P2 on second node:
|
||||
```shell
|
||||
export HCCL_IF_IP=172.19.241.49
|
||||
export GLOO_SOCKET_IFNAME="eth0"
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_ASCEND_LLMDD_RPC_PORT=5659
|
||||
|
||||
vllm serve /models/deepseek_r1_w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 20002 \
|
||||
--headless \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-start-rank 1 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-address 172.19.32.175 \
|
||||
--data-parallel-rpc-port 13356 \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 32768 \
|
||||
--max-num-seqs 256 \
|
||||
--trust-remote-code \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}'
|
||||
```
|
||||
|
||||
Run decode server d1 on third node:
|
||||
|
||||
* In the D node, the `max-num-batched-tokens` parameter can be set to a smaller value since the D node processes at most `max-num-seqs` batches concurrently. As the `profile_run` only needs to handle `max-num-seqs` sequences at a time, we can safely set `max-num-batched-tokens` equal to `max-num-seqs`. This optimization will help reduce activation memory consumption.
|
||||
```shell
|
||||
export HCCL_IF_IP=172.19.123.51
|
||||
export GLOO_SOCKET_IFNAME="eth0"
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_ASCEND_LLMDD_RPC_PORT=5759
|
||||
|
||||
vllm serve /models/deepseek_r1_w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 20002 \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-size-local 2 \
|
||||
--api-server-count 2 \
|
||||
--data-parallel-address 172.19.123.51 \
|
||||
--data-parallel-rpc-port 13356 \
|
||||
--tensor-parallel-size 4 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 256 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"torchair_graph_config": {"enabled":true}}'
|
||||
```
|
||||
|
||||
Run decode server d2 on last node:
|
||||
```shell
|
||||
export HCCL_IF_IP=172.19.190.36
|
||||
export GLOO_SOCKET_IFNAME="eth0"
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1/ranktable.json
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export VLLM_ASCEND_LLMDD_RPC_PORT=5859
|
||||
|
||||
vllm serve /models/deepseek_r1_w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 20002 \
|
||||
--headless \
|
||||
--data-parallel-size 4 \
|
||||
--data-parallel-start-rank 2 \
|
||||
--data-parallel-size-local 2 \
|
||||
--data-parallel-address 172.19.123.51 \
|
||||
--data-parallel-rpc-port 13356 \
|
||||
--tensor-parallel-size 4 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek \
|
||||
--max-model-len 32768 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 256 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": "0",
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"torchair_graph_config": {"enabled":true}}'
|
||||
```
|
||||
|
||||
Run proxy server on the first node:
|
||||
```shell
|
||||
cd /vllm-workspace/vllm-ascend/examples/disaggregated_prefill_v1
|
||||
python load_balance_proxy_server_example.py --host 172.19.32.175 --port 1025 --prefiller-hosts 172.19.241.49 --prefiller-port 20002 --decoder-hosts 172.19.123.51 --decoder-ports 20002
|
||||
```
|
||||
|
||||
Verification
|
||||
Check service health using the proxy server endpoint:
|
||||
```shell
|
||||
curl http://localhost:1025/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "deepseek",
|
||||
"prompt": "Who are you?",
|
||||
"max_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
Performance
|
||||
Test performance with vllm benchmark:
|
||||
```shell
|
||||
cd /vllm-workspace/vllm/benchmarks
|
||||
python3 benchmark_serving.py \
|
||||
--backend vllm \
|
||||
--dataset-name random \
|
||||
--random-input-len 4096 \
|
||||
--random-output-len 1536 \
|
||||
--num-prompts 256 \
|
||||
--ignore-eos \
|
||||
--model deepseek \
|
||||
--tokenizer /models/deepseek_r1_w8a8 \
|
||||
--host localhost \
|
||||
--port 1025 \
|
||||
--endpoint /v1/completions \
|
||||
--max-concurrency 4 \
|
||||
--request-rate 4
|
||||
```
|
||||
@@ -1,144 +0,0 @@
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
|
||||
import torch.distributed as dist
|
||||
|
||||
from vllm_ascend.utils import AscendDeviceType, get_ascend_device_type
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Arguments of rank table generator", )
|
||||
parser.add_argument("--local-host", type=str, required=True, help="local ip")
|
||||
parser.add_argument("--prefill-device-cnt",
|
||||
type=int,
|
||||
required=True,
|
||||
help="number of prefill devices")
|
||||
parser.add_argument("--decode-device-cnt",
|
||||
type=int,
|
||||
required=True,
|
||||
help="number of decode devices")
|
||||
parser.add_argument("--local-device-ids",
|
||||
type=str,
|
||||
required=False,
|
||||
help="local device ids")
|
||||
parser.add_argument("--ranktable-path",
|
||||
type=str,
|
||||
default="./ranktable.json",
|
||||
help="output rank table path")
|
||||
args = parser.parse_args()
|
||||
local_host = args.local_host
|
||||
prefill_device_cnt = args.prefill_device_cnt
|
||||
decode_device_cnt = args.decode_device_cnt
|
||||
|
||||
print("enter py")
|
||||
|
||||
hccn_tool_path = os.environ.get("HCCN_TOOL_PATH",
|
||||
"/usr/local/Ascend/driver/tools/hccn_tool")
|
||||
master_addr = os.environ.get("MASTER_ADDR")
|
||||
master_port = os.environ.get("MASTER_PORT")
|
||||
rank = os.environ.get("RANK")
|
||||
local_rank = os.environ.get("LOCAL_RANK")
|
||||
# This variable is set by torchrun,
|
||||
# and is different from WORLD_SIZE in gen_rank_table.sh.
|
||||
world_size = os.environ.get("WORLD_SIZE")
|
||||
|
||||
device_type = get_ascend_device_type()
|
||||
|
||||
|
||||
def get_cmd_stdout(cmd):
|
||||
import subprocess
|
||||
return subprocess.run(cmd, capture_output=True,
|
||||
shell=True).stdout.decode("utf-8").strip()
|
||||
|
||||
|
||||
print(f"local_host: {local_host}")
|
||||
print("gen ranktable.json")
|
||||
|
||||
num_cards = get_cmd_stdout("npu-smi info -l | grep \"Total Count\"").split(
|
||||
":")[1].strip()
|
||||
num_cards = int(num_cards)
|
||||
chips_per_card = get_cmd_stdout("npu-smi info -l | grep \"Chip Count\"").split(
|
||||
"\n")[0].split(":")[1].strip()
|
||||
chips_per_card = int(chips_per_card)
|
||||
|
||||
if args.local_device_ids:
|
||||
try:
|
||||
local_device_ids = [int(id_str) for id_str in args.local_device_ids.split(',')]
|
||||
except ValueError:
|
||||
print(f"Error: --local-device-ids must be a comma-separated list of integers. Received: '{args.local_device_ids}'")
|
||||
exit(1)
|
||||
else:
|
||||
local_device_ids = []
|
||||
for card_id in range(num_cards):
|
||||
for chip_id in range(chips_per_card):
|
||||
device_id = card_id * chips_per_card + chip_id
|
||||
local_device_ids.append(device_id)
|
||||
|
||||
# generate local device list for local rank 0, and gather it to all ranks
|
||||
local_device_list: list[dict[str, str]] = list()
|
||||
if local_rank == "0":
|
||||
super_pod_id = "0"
|
||||
for idx in range(len(local_device_ids)):
|
||||
device_id = local_device_ids[idx]
|
||||
chip_id = device_id % chips_per_card
|
||||
card_id = device_id // chips_per_card
|
||||
if device_type == AscendDeviceType._910_93:
|
||||
device_ip = get_cmd_stdout(
|
||||
f"{hccn_tool_path} -i {device_id} -vnic -g | grep ipaddr"
|
||||
).split(":")[1].strip()
|
||||
super_device_id = get_cmd_stdout(
|
||||
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID"
|
||||
).split(":")[1].strip()
|
||||
super_pod_id = get_cmd_stdout(
|
||||
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\""
|
||||
).split(":")[1].strip()
|
||||
else:
|
||||
device_ip = get_cmd_stdout(
|
||||
f"{hccn_tool_path} -i {device_id} -ip -g | grep ipaddr"
|
||||
).split(":")[1].strip()
|
||||
|
||||
device_info = {
|
||||
"server_id": local_host,
|
||||
"device_id": str(device_id),
|
||||
"device_ip": str(device_ip),
|
||||
}
|
||||
if device_type == AscendDeviceType._910_93:
|
||||
device_info.update({
|
||||
"super_pod_id": str(super_pod_id),
|
||||
"super_device_id": str(super_device_id)
|
||||
})
|
||||
local_device_list.append(device_info)
|
||||
|
||||
dist.init_process_group(backend=dist.Backend.GLOO)
|
||||
global_device_list = [None] * dist.get_world_size()
|
||||
dist.all_gather_object(global_device_list, local_device_list)
|
||||
global_device_list = [
|
||||
device_info for device_list in global_device_list
|
||||
for device_info in device_list # type: ignore[attr-defined]
|
||||
]
|
||||
cnt = 1
|
||||
for device_info in global_device_list: # type: ignore[assignment]
|
||||
device_info["cluster_id"] = str(cnt)
|
||||
cnt += 1
|
||||
assert (prefill_device_cnt + decode_device_cnt) <= len(global_device_list), \
|
||||
"prefill_device_cnt + decode_device_cnt must be less than or equal to number of all devices in cluster"
|
||||
ranktable = {
|
||||
"version":
|
||||
"1.2",
|
||||
"server_count":
|
||||
str(world_size),
|
||||
"prefill_device_list":
|
||||
global_device_list[:prefill_device_cnt],
|
||||
"decode_device_list":
|
||||
global_device_list[prefill_device_cnt:prefill_device_cnt +
|
||||
decode_device_cnt],
|
||||
"status":
|
||||
"completed"
|
||||
}
|
||||
|
||||
if local_rank == '0':
|
||||
os.makedirs(os.path.dirname(args.ranktable_path), exist_ok=True)
|
||||
with open(args.ranktable_path, "w") as f:
|
||||
json.dump(ranktable, f, indent=4)
|
||||
|
||||
print("gen ranktable.json done")
|
||||
@@ -1,89 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
|
||||
|
||||
NPUS_PER_NODE=8
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--ips)
|
||||
shift
|
||||
while [[ $# -gt 0 && ! "$1" == --* ]]; do
|
||||
IPs+=("$1")
|
||||
shift
|
||||
done
|
||||
;;
|
||||
--npus-per-node)
|
||||
shift
|
||||
NPUS_PER_NODE="$1"
|
||||
shift
|
||||
;;
|
||||
--network-card-name)
|
||||
shift
|
||||
NETWORK_CARD_NAME="$1"
|
||||
shift
|
||||
;;
|
||||
--prefill-device-cnt)
|
||||
shift
|
||||
PREFILL_DEVICE_CNT="$1"
|
||||
shift
|
||||
;;
|
||||
--decode-device-cnt)
|
||||
shift
|
||||
DECODE_DEVICE_CNT="$1"
|
||||
shift
|
||||
;;
|
||||
--local-device-ids)
|
||||
shift
|
||||
LOCAL_DEVICE_IDS="$1"
|
||||
shift
|
||||
;;
|
||||
esac
|
||||
done
|
||||
LOCAL_HOSTS=($(hostname -I))
|
||||
LOCAL_HOST="127.0.0.1"
|
||||
MASTER_ADDR=${IPs[0]}
|
||||
MASTER_PORT=6657
|
||||
NNODES=${#IPs[@]}
|
||||
NODE_RANK="8"
|
||||
for i in "${!IPs[@]}"; do
|
||||
ip="${IPs[$i]}"
|
||||
for local_host in "${LOCAL_HOSTS[@]}"; do
|
||||
if [[ "$local_host" == "$ip" ]]; then
|
||||
LOCAL_HOST=$local_host
|
||||
NODE_RANK=$i
|
||||
break 2
|
||||
fi
|
||||
done
|
||||
done
|
||||
|
||||
if [[ $NODE_RANK == "" ]];then
|
||||
echo "[Error] para \"NODE_RANK\" must be defined"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
WORLD_SIZE=$(($NPUS_PER_NODE * $NNODES))
|
||||
RANKSTART=`expr $NPUS_PER_NODE \* $NODE_RANK`
|
||||
|
||||
echo "========>param:"
|
||||
echo "LOCAL_HOST": $LOCAL_HOST
|
||||
echo "WORLD_SIZE: " $WORLD_SIZE
|
||||
echo "RANKSTART": $RANKSTART
|
||||
echo "NNODES": $NNODES
|
||||
echo "NODE_RANK": $NODE_RANK
|
||||
echo "==============="
|
||||
|
||||
if [ -n "$LOCAL_DEVICE_IDS" ]; then
|
||||
OPTIONAL_SECTION=" --local-device-ids $LOCAL_DEVICE_IDS"
|
||||
fi
|
||||
|
||||
if [[ -n "${GEN_RANKTABLE}" || ! -e ${PWD}/ranktable.json ]]; then
|
||||
timeout 180s \
|
||||
GLOO_SOCKET_IFNAME=$NETWORK_CARD_NAME torchrun \
|
||||
--nproc_per_node 1 \
|
||||
--nnodes ${NNODES} \
|
||||
--node_rank ${NODE_RANK} \
|
||||
--master_addr ${MASTER_ADDR} \
|
||||
--master_port ${MASTER_PORT} \
|
||||
gen_ranktable.py --local-host $LOCAL_HOST --prefill-device-cnt $PREFILL_DEVICE_CNT --decode-device-cnt $DECODE_DEVICE_CNT $OPTIONAL_SECTION
|
||||
fi
|
||||
@@ -1,30 +0,0 @@
|
||||
export HCCL_IF_IP=141.61.39.117
|
||||
export GLOO_SOCKET_IFNAME="enp48s3u1u1"
|
||||
export TP_SOCKET_IFNAME="enp48s3u1u1"
|
||||
export HCCL_SOCKET_IFNAME="enp48s3u1u1"
|
||||
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=path-to-rank-table
|
||||
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
|
||||
vllm serve model_path \
|
||||
--host 0.0.0.0 \
|
||||
--port 20002 \
|
||||
--tensor-parallel-size 1\
|
||||
--seed 1024 \
|
||||
--served-model-name dsv3 \
|
||||
--max-model-len 2000 \
|
||||
---max-num-batched-tokens 2000 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "LLMDataDistCMgrConnector",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_parallel_size": 1,
|
||||
"kv_port": "20001",
|
||||
"engine_id": 0,
|
||||
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_connector_v1_a3"
|
||||
}' \
|
||||
--additional-config \
|
||||
'{"enable_graph_mode": "True"}'\
|
||||
Reference in New Issue
Block a user