[DOC]Add Memcache Usage Guide (#6476)
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
This commit is contained in:
@@ -42,7 +42,7 @@ export PYTHONHASHSEED=0
|
||||
First, we need to obtain the Mooncake project. Refer to the following command:
|
||||
|
||||
```shell
|
||||
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
|
||||
git clone -b v0.3.7.post2 --depth 1 https://github.com/kvcache-ai/Mooncake.git
|
||||
```
|
||||
|
||||
(Optional) Replace go install url if the network is poor
|
||||
@@ -369,3 +369,717 @@ Note: For MooncakeStore, it is recommended to perform a warm-up phase before run
|
||||
This is because HCCL one-sided communication connections are created lazily after the instance is launched when Device-to-Device communication is involved. Currently, full-mesh connections between all devices are required. Establishing these connections introduces a one-time time overhead and persistent device memory consumption (4 MB of device memory per connection).
|
||||
|
||||
**For warm-up, it is recommended to issue requests with an input sequence length of 8K and an output sequence length of 1, with the total number of requests being 2–3× the number of devices (cards/dies).**
|
||||
|
||||
## Example of using Memcache as a KV Pool backend
|
||||
|
||||
### Installing Memcache
|
||||
|
||||
**MemCache depends on MemFabric. Therefore, MemFabric must be installed.Installing the memcache after the memfabric is installed.**
|
||||
|
||||
* **memfabric_hybrid**: <https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>
|
||||
|
||||
* **memcache**: <https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>
|
||||
|
||||
### Configuring the memcache Config File
|
||||
|
||||
config Path:/usr/local/memcache_hybrid/latest/config/
|
||||
**Configuration item description**:<https://gitcode.com/Ascend/memcache/blob/develop/doc/memcache_config.md>
|
||||
|
||||
Set TLS certificate configurations. If TLS is disabled, you do not need to upload a certificate. If TLS is enabled, you need to upload a certificate.
|
||||
|
||||
```shell
|
||||
# mmc-meta.conf
|
||||
ock.mmc.tls.enable = false
|
||||
ock.mmc.config_store.tls.enable = false
|
||||
|
||||
# mmc-local.conf
|
||||
ock.mmc.tls.enable = false
|
||||
ock.mmc.config_store.tls.enable = false
|
||||
ock.mmc.local_service.hcom.tls.enable = false
|
||||
```
|
||||
|
||||
You are advised to copy mmc-local.conf and mmc-meta.conf to your own path and modify them, and set the MMC_META_CONFIG_PATH environment variable to the path of your own mmc-meta.conf file.
|
||||
|
||||
**mmc-meta.conf:**
|
||||
|
||||
```shell
|
||||
# Meta service start-up url
|
||||
# It will automatically modified to PodIP at Pod startup in K8s meta service cluster master-standby high availability scenario
|
||||
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
|
||||
# config store url, It will automatically modified to PodIP at Pod startup in K8s
|
||||
ock.mmc.meta_service.config_store_url = tcp://xx.xx.xx.xx:6000
|
||||
# Enable or disable high availability deployment
|
||||
ock.mmc.meta.ha.enable = false
|
||||
# Log level: debug, info, warn, error
|
||||
ock.mmc.log_level = error
|
||||
# Log directory path, supports both relative and absolute paths, the system will automatically append 'logs' directory.
|
||||
# The absolute log path at default value is '/path/to/mmc_meta_service/../logs'
|
||||
# If the path of mmc_meta_service is '/usr/local/mxc/memfabric_hybrid/latest/aarch64-linux/bin'
|
||||
# Then the path of log is '/usr/local/mxc/memfabric_hybrid/latest/aarch64-linux/logs'
|
||||
ock.mmc.log_path = .
|
||||
# Log rotation file size, unit is MB, value range [1,500]
|
||||
ock.mmc.log_rotation_file_size = 20
|
||||
# Log rotation file count, value range [1,50]
|
||||
ock.mmc.log_rotation_file_count = 50
|
||||
|
||||
# The threshold that triggers eviction, measured as a percentage of space usage
|
||||
# 'put' operation will trigger eviction when the threshold is exceeded
|
||||
ock.mmc.evict_threshold_high = 90
|
||||
# The target threshold of eviction, measured as a percentage of space usage
|
||||
ock.mmc.evict_threshold_low = 80
|
||||
|
||||
# TLS configuration for metaservice
|
||||
ock.mmc.tls.enable = false
|
||||
ock.mmc.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
|
||||
ock.mmc.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
|
||||
ock.mmc.tls.cert.path = /opt/ock/security/certs/server.cert.pem
|
||||
ock.mmc.tls.key.path = /opt/ock/security/certs/server.private.key.pem
|
||||
ock.mmc.tls.key.pass.path = /opt/ock/security/certs/server.passphrase
|
||||
ock.mmc.tls.package.path = /opt/ock/security/libs/
|
||||
ock.mmc.tls.decrypter.path =
|
||||
|
||||
# TLS configuration for config store
|
||||
ock.mmc.config_store.tls.enable = false
|
||||
ock.mmc.config_store.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
|
||||
ock.mmc.config_store.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
|
||||
ock.mmc.config_store.tls.cert.path = /opt/ock/security/certs/server.cert.pem
|
||||
ock.mmc.config_store.tls.key.path = /opt/ock/security/certs/server.private.key.pem
|
||||
ock.mmc.config_store.tls.key.pass.path = /opt/ock/security/certs/server.passphrase
|
||||
ock.mmc.config_store.tls.package.path = /opt/ock/security/libs/
|
||||
ock.mmc.config_store.tls.decrypter.path =
|
||||
```
|
||||
|
||||
**Key Focuses:**
|
||||
|
||||
* ock.mmc.meta_service_url:Configure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
|
||||
* ock.mmc.meta_service.config_store_url:Configure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
|
||||
* To disable TLS authentication modification, set the following parameters to false:ock.mmc.meta.ha.enable、ock.mmc.config_store.tls.enable
|
||||
|
||||
**mmc-local.conf:**
|
||||
|
||||
```shell
|
||||
# Meta service start-up url
|
||||
# K8s meta service cluster master-standby high availability scenario: ClusterIP address
|
||||
# Non-HA scenario: keep consistent with the same name configuration in mmc-meta.conf
|
||||
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
|
||||
# Log level: debug, info, warn, error
|
||||
ock.mmc.log_level = error
|
||||
|
||||
# TLS configurations for metaservice
|
||||
ock.mmc.tls.enable = false
|
||||
ock.mmc.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
|
||||
ock.mmc.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
|
||||
ock.mmc.tls.cert.path = /opt/ock/security/certs/client.cert.pem
|
||||
ock.mmc.tls.key.path = /opt/ock/security/certs/client.private.key.pem
|
||||
ock.mmc.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
|
||||
ock.mmc.tls.package.path = /opt/ock/security/libs/
|
||||
ock.mmc.tls.decrypter.path =
|
||||
|
||||
# Total count of local service
|
||||
ock.mmc.local_service.world_size = 32
|
||||
# config store url, it will automatically modified to PodIP at Pod startup in HA scenario
|
||||
# keep consistent with the same name configuration in mmc-meta.conf
|
||||
ock.mmc.local_service.config_store_url = tcp://xx.xx.xx.xx:6000
|
||||
# TLS configurations for config_store
|
||||
ock.mmc.config_store.tls.enable = false
|
||||
ock.mmc.config_store.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
|
||||
ock.mmc.config_store.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
|
||||
ock.mmc.config_store.tls.cert.path = /opt/ock/security/certs/client.cert.pem
|
||||
ock.mmc.config_store.tls.key.path = /opt/ock/security/certs/client.private.key.pem
|
||||
ock.mmc.config_store.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
|
||||
ock.mmc.config_store.tls.package.path = /opt/ock/security/libs/
|
||||
ock.mmc.config_store.tls.decrypter.path =
|
||||
|
||||
# Data transfer protocol, 'host_rdma': rdma over host; 'host_tcp': tcp over host; 'device_rdma': rdma over device; 'device_sdma': sdma over device
|
||||
ock.mmc.local_service.protocol = device_sdma
|
||||
# HBM/DRAM space usage, configuration type supports 134217728, 2048KB/2048K, 200MB/200mb/200m, 2.5GB or 1TB, case-insensitive, the maximum value is 1TB
|
||||
# The system automatically calculates and aligns downwards to 2MB (host_sdma or host_tcp) or 1GB (device_sdma or device_rdma)
|
||||
# After alignment, the HBM size and DRAM size cannot both be 0 at the same time
|
||||
ock.mmc.local_service.dram.size = 2GB
|
||||
ock.mmc.local_service.hbm.size = 0
|
||||
|
||||
# If the protocol is host_rdma, the ip needs to be set as RDMA network card ip. Use 'show_gids' command to query it
|
||||
ock.mmc.local_service.hcom_url = tcp://127.0.0.1:7000
|
||||
# HCOM TLS config
|
||||
ock.mmc.local_service.hcom.tls.enable = false
|
||||
ock.mmc.local_service.hcom.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
|
||||
ock.mmc.local_service.hcom.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
|
||||
ock.mmc.local_service.hcom.tls.cert.path = /opt/ock/security/certs/client.cert.pem
|
||||
ock.mmc.local_service.hcom.tls.key.path = /opt/ock/security/certs/client.private.key.pem
|
||||
ock.mmc.local_service.hcom.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
|
||||
ock.mmc.local_service.hcom.tls.decrypter.path =
|
||||
|
||||
# The total retry duration (retry interval is 200ms) when client requests meta service and the connection does not exist
|
||||
# Default value is 0, means no-retry and return immediately, value range [0, 600000]
|
||||
ock.mmc.client.retry_milliseconds = 0
|
||||
|
||||
ock.mmc.client.timeout.seconds = 60
|
||||
|
||||
# read/write thread pool size, value range [1, 64]
|
||||
ock.mmc.client.read_thread_pool.size = 16
|
||||
ock.mmc.client.write_thread_pool.size = 2
|
||||
```
|
||||
|
||||
**Key Focuses:**
|
||||
|
||||
* ock.mmc.meta_service_url:Configure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
|
||||
* ock.mmc.local_service.config_store_url:Configure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
|
||||
* ock.mmc.local_service.world_size:Total number of cards for starting services.
|
||||
* ock.mmc.local_service.protocol:host_rdma (default), device_rdma (supported for A2 and A3 when device ROCE available, recommended for A2), device_sdma (supported for A3 when HCCS available, recommended for A3)
|
||||
* ock.mmc.local_service.dram.size:Sets the size of the memory occupied by the master. The configured value is the size of the memory occupied by each card.
|
||||
* To disable TLS authentication modification, set the following parameters to false::ock.mmc.meta.ha.enable、ock.mmc.config_store.tls.enable
|
||||
|
||||
### Memcache environment variables
|
||||
|
||||
```shell
|
||||
source /usr/local/memcache_hybrid/set_env.sh
|
||||
source /usr/local/memfabric_hybrid/set_env.sh
|
||||
# Configuring Environment Variables in the Configuration File
|
||||
export MMC_META_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-meta.conf
|
||||
```
|
||||
|
||||
### Run Memcache Master
|
||||
|
||||
Starting the MetaService service.
|
||||
|
||||
```shell
|
||||
1. Set environment variables for the configuration file.
|
||||
export MMC_META_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-meta.conf
|
||||
|
||||
2. Access the Python console or compile the following Python script to start the process:
|
||||
from memcache_hybrid import MetaService
|
||||
MetaService.main()
|
||||
```
|
||||
|
||||
Method 2 for starting the MetaService service.
|
||||
|
||||
```shell
|
||||
source /usr/local/memcache_hybrid/set_env.sh
|
||||
source /usr/local/memfabric_hybrid/set_env.sh
|
||||
export MMC_META_CONFIG_PATH=/home/memcache/shell/mmc-meta.conf # Set it to the path of your own configuration file.
|
||||
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/python3.11.10/lib/
|
||||
/usr/local/memcache_hybrid/latest/aarch64-linux/bin/mmc_meta_service
|
||||
```
|
||||
|
||||
### PD Disaggregation Scenario
|
||||
|
||||
#### 1.Run `prefill` Node and `decode` Node
|
||||
|
||||
Using `MultiConnector` to simultaneously utilize both `MooncakeConnectorV1` and `AscendStoreConnector`. `MooncakeConnectorV1` performs kv_transfer, while `AscendStoreConnector` enables KV Cache Pool
|
||||
|
||||
#### 800I A2/800T A2 Series
|
||||
|
||||
`prefill` Node:
|
||||
|
||||
```shell
|
||||
rm -rf /root/ascend/log/*
|
||||
|
||||
source /usr/local/memfabric_hybrid/set_env.sh
|
||||
source /usr/local/memcache_hybrid/set_env.sh
|
||||
|
||||
# memcache:
|
||||
echo 200000 > /proc/sys/vm/nr_hugepages
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
|
||||
|
||||
# nic_name can be looked up in ifconfig
|
||||
nic_name="xxxxxx"
|
||||
local_ip="xx.xx.xx.xx"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export PYTHONHASHSEED=0
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
rm -rf ./connector.log
|
||||
vllm serve xxxxxxx/Qwen3-32B \
|
||||
--host 0.0.0.0 \
|
||||
--port 30050 \
|
||||
--enforce-eager \
|
||||
--data-parallel-size 2 \
|
||||
--tensor-parallel-size 4 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num_seqs 20 \
|
||||
--no-enable-prefix-caching \
|
||||
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
|
||||
--kv-transfer-config \
|
||||
'{
|
||||
"kv_connector": "MultiConnector",
|
||||
"kv_role": "kv_producer",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"connectors": [
|
||||
{
|
||||
"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_rank": 0,
|
||||
"kv_port": "20001",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 4
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_connector_extra_config":{
|
||||
"backend": "memcache",
|
||||
"lookup_rpc_port":"0"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}' > log_p.log 2>&1
|
||||
```
|
||||
|
||||
`decode` Node:
|
||||
|
||||
```shell
|
||||
rm -rf /root/ascend/log/*
|
||||
|
||||
source /usr/local/memfabric_hybrid/set_env.sh
|
||||
source /usr/local/memcache_hybrid/set_env.sh
|
||||
|
||||
# memcache:
|
||||
echo 200000 > /proc/sys/vm/nr_hugepages
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
|
||||
|
||||
# nic_name can be looked up in ifconfig
|
||||
nic_name="xxxxxx"
|
||||
local_ip="xx.xx.xx.xx"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export PYTHONHASHSEED=0
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
rm -rf ./connector.log
|
||||
vllm serve xxxxxxx/Qwen3-32B \
|
||||
--host 0.0.0.0 \
|
||||
--port 30060 \
|
||||
--enforce-eager \
|
||||
--data-parallel-size 2 \
|
||||
--tensor-parallel-size 4 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen3 \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num_seqs 20 \
|
||||
--no-enable-prefix-caching \
|
||||
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
|
||||
--kv-transfer-config \
|
||||
'{
|
||||
"kv_connector": "MultiConnector",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_connector_extra_config": {
|
||||
"connectors": [
|
||||
{
|
||||
"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_rank": 1,
|
||||
"kv_port": "20002",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 4
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 4
|
||||
}
|
||||
}
|
||||
} ,
|
||||
{
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_connector_extra_config":{
|
||||
"backend": "memcache",
|
||||
"lookup_rpc_port":"1"
|
||||
}
|
||||
}
|
||||
|
||||
]
|
||||
}
|
||||
}' > log_d.log 2>&1
|
||||
```
|
||||
|
||||
#### 800I A3/800T A3 Series
|
||||
|
||||
`prefill` Node:
|
||||
|
||||
```shell
|
||||
rm -rf /root/ascend/log/*
|
||||
|
||||
# memcache:
|
||||
echo 200000 > /proc/sys/vm/nr_hugepages
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf
|
||||
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
|
||||
export ACL_OP_INIT_MODE=1
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
|
||||
export PYTHONHASHSEED=0
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model=xxxxxxxxx/DeepSeek-R1 \
|
||||
--served-model-name dsv3 \
|
||||
--trust-remote-code \
|
||||
--enforce-eager \
|
||||
--data-parallel-size 2 \
|
||||
--tensor-parallel-size 8 \
|
||||
--port 30050 \
|
||||
--max-num_seqs 28 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
|
||||
--enable_expert_parallel \
|
||||
--quantization ascend \
|
||||
--gpu-memory-utilization 0.90 \
|
||||
--no-enable-prefix-caching \
|
||||
--kv-transfer-config \
|
||||
'{
|
||||
"kv_connector": "MultiConnector",
|
||||
"kv_role": "kv_producer",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"connectors": [
|
||||
{
|
||||
"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_rank": 0,
|
||||
"kv_port": "20001",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_connector_extra_config":{
|
||||
"backend": "memcache",
|
||||
"lookup_rpc_port":"0"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}' > log_p.log 2>&1
|
||||
```
|
||||
|
||||
`decode` Node:
|
||||
|
||||
```shell
|
||||
rm -rf /root/ascend/log/*
|
||||
|
||||
# memcache:
|
||||
echo 200000 > /proc/sys/vm/nr_hugepages
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf
|
||||
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
|
||||
export ACL_OP_INIT_MODE=1
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
|
||||
export PYTHONHASHSEED=0
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model=xxxxxxxxxxxxxxxx/DeepSeek \
|
||||
--served-model-name dsv3 \
|
||||
--trust-remote-code \
|
||||
--data-parallel-size 2 \
|
||||
--tensor-parallel-size 8 \
|
||||
--port 30060 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 5200 \
|
||||
--enforce-eager\
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--max-num_seqs 28 \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--enable_expert_parallel \
|
||||
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config \
|
||||
'{
|
||||
"kv_connector": "MultiConnector",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_connector_extra_config": {
|
||||
"connectors": [
|
||||
{
|
||||
"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_buffer_device": "npu",
|
||||
"kv_rank": 1,
|
||||
"kv_port": "20002",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_connector_extra_config":{
|
||||
"backend": "memcache",
|
||||
"lookup_rpc_port":"1"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}' > log_d.log 2>&1
|
||||
```
|
||||
|
||||
#### [2、Start proxy_server](#2start-proxy_server)
|
||||
|
||||
#### [3、run-inference](#3run-inference)
|
||||
|
||||
### PD-Mixed Scenario
|
||||
|
||||
#### 1.Run Mixed Department Script
|
||||
|
||||
#### 800I A2/800T A2 Series
|
||||
|
||||
The deepseek model needs to be run in a two-node cluster.
|
||||
|
||||
**Run_hunbu_1.sh:**
|
||||
|
||||
```shell
|
||||
rm -rf /root/ascend/log/*
|
||||
|
||||
source /usr/local/memfabric_hybrid/set_env.sh
|
||||
source /usr/local/memcache_hybrid/set_env.sh
|
||||
|
||||
# memcache:
|
||||
echo 200000 > /proc/sys/vm/nr_hugepages
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
|
||||
|
||||
# nic_name can be looked up in ifconfig
|
||||
nic_name="xxxxxxx"
|
||||
local_ip="xx.xx.xx.xx"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
|
||||
export PYTHONHASHSEED=0
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
rm -rf ./connector.log
|
||||
vllm serve xxxxxxx/DeepSeek-R1 \
|
||||
--host 0.0.0.0 \
|
||||
--port 30050 \
|
||||
--enforce-eager \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--api-server-count 2 \
|
||||
--data-parallel-address 141.61.33.167 \
|
||||
--data-parallel-rpc-port 13348 \
|
||||
--tensor-parallel-size 8 \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--max-num_seqs 20 \
|
||||
--enable-expert-parallel \
|
||||
--no-enable-prefix-caching \
|
||||
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
|
||||
--kv-transfer-config \
|
||||
'{
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_both",
|
||||
"kv_connector_extra_config": {
|
||||
"backend": "memcache",
|
||||
"lookup_rpc_port":"0"
|
||||
}
|
||||
}' > log_hunbu_1.log 2>&1
|
||||
|
||||
```
|
||||
|
||||
**Run_hunbu_2.sh:**
|
||||
|
||||
```shell
|
||||
rm -rf /root/ascend/log/*
|
||||
|
||||
source /usr/local/memfabric_hybrid/set_env.sh
|
||||
source /usr/local/memcache_hybrid/set_env.sh
|
||||
|
||||
# memcache:
|
||||
echo 200000 > /proc/sys/vm/nr_hugepages
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
|
||||
|
||||
# nic_name can be looked up in ifconfig
|
||||
nic_name="xxxxxxx"
|
||||
local_ip="xx.xx.xx.xx"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
|
||||
export PYTHONHASHSEED=0
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export VLLM_USE_V1=1
|
||||
# export VLLM_TORCH_PROFILER_DIR="./vllm-profiling"
|
||||
# export VLLM_TORCH_PROFILER_WITH_STACK=0
|
||||
|
||||
rm -rf ./connector.log
|
||||
vllm serve xxxxxxx/DeepSeek-R1 \
|
||||
--host 0.0.0.0 \
|
||||
--port 30050 \
|
||||
--headless \
|
||||
--enforce-eager \
|
||||
--data-parallel-size 2 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-start-rank 1 \
|
||||
--data-parallel-address 141.61.33.167 \
|
||||
--data-parallel-rpc-port 13348 \
|
||||
--tensor-parallel-size 8 \
|
||||
--seed 1024 \
|
||||
--served-model-name deepseek \
|
||||
--max-model-len 65536 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--max-num_seqs 20 \
|
||||
--enable-expert-parallel \
|
||||
--no-enable-prefix-caching \
|
||||
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "chunked_prefill_for_mla":true}' \
|
||||
--kv-transfer-config \
|
||||
'{
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_both",
|
||||
"kv_connector_extra_config": {
|
||||
"backend": "memcache",
|
||||
"mooncake_rpc_port":"0"
|
||||
}
|
||||
}' > log_hunbu_2.log 2>&1
|
||||
|
||||
```
|
||||
|
||||
#### 800I A3/800T A3 Series
|
||||
|
||||
```shell
|
||||
bash mixed_department.sh
|
||||
```
|
||||
|
||||
Content of mixed_department.sh:
|
||||
|
||||
```shell
|
||||
rm -rf /root/ascend/log/*
|
||||
|
||||
# memcache:
|
||||
echo 200000 > /proc/sys/vm/nr_hugepages
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf
|
||||
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
|
||||
export ACL_OP_INIT_MODE=1
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
|
||||
export PYTHONHASHSEED=0
|
||||
export HCCL_BUFFSIZE=1024
|
||||
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model=xxxxxxx/DeepSeek-R1 \
|
||||
--served-model-name dsv3 \
|
||||
--trust-remote-code \
|
||||
--enforce-eager \
|
||||
-dp 2 \
|
||||
-tp 8 \
|
||||
--port 30050 \
|
||||
--max-num_seqs 28 \
|
||||
--max-model-len 16384 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
|
||||
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "chunked_prefill_for_mla":true}' \
|
||||
--enable_expert_parallel \
|
||||
--quantization ascend \
|
||||
--gpu-memory-utilization 0.90 \
|
||||
--no-enable-prefix-caching \
|
||||
--kv-transfer-config \
|
||||
'{
|
||||
"kv_connector": "AscendStoreConnector",
|
||||
"kv_role": "kv_both",
|
||||
"kv_connector_extra_config": {
|
||||
"backend": "memcache",
|
||||
"mooncake_rpc_port":"0"
|
||||
}
|
||||
}' > log_hunbu.log 2>&1
|
||||
|
||||
```
|
||||
|
||||
#### [2.Run Inference](#2run-inference)
|
||||
|
||||
Reference in New Issue
Block a user