[DOC]Add Memcache Usage Guide (#6476)

### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
This commit is contained in:
DreamerLeader
2026-02-09 21:55:00 +08:00
committed by GitHub
parent 9564c6bb5d
commit 905f0764e0

View File

@@ -42,7 +42,7 @@ export PYTHONHASHSEED=0
First, we need to obtain the Mooncake project. Refer to the following command:
```shell
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
git clone -b v0.3.7.post2 --depth 1 https://github.com/kvcache-ai/Mooncake.git
```
(Optional) Replace go install url if the network is poor
@@ -369,3 +369,717 @@ Note: For MooncakeStore, it is recommended to perform a warm-up phase before run
This is because HCCL one-sided communication connections are created lazily after the instance is launched when Device-to-Device communication is involved. Currently, full-mesh connections between all devices are required. Establishing these connections introduces a one-time time overhead and persistent device memory consumption (4 MB of device memory per connection).
**For warm-up, it is recommended to issue requests with an input sequence length of 8K and an output sequence length of 1, with the total number of requests being 23× the number of devices (cards/dies).**
## Example of using Memcache as a KV Pool backend
### Installing Memcache
**MemCache depends on MemFabric. Therefore, MemFabric must be installed.Installing the memcache after the memfabric is installed.**
* **memfabric_hybrid**: <https://gitcode.com/Ascend/memfabric_hybrid/tree/master/doc/build.md>
* **memcache**: <https://gitcode.com/Ascend/memcache/blob/master/doc/build.md>
### Configuring the memcache Config File
config Path/usr/local/memcache_hybrid/latest/config/
    **Configuration item description**<https://gitcode.com/Ascend/memcache/blob/develop/doc/memcache_config.md>
    Set TLS certificate configurations. If TLS is disabled, you do not need to upload a certificate. If TLS is enabled, you need to upload a certificate.
```shell
# mmc-meta.conf
ock.mmc.tls.enable = false
ock.mmc.config_store.tls.enable = false
# mmc-local.conf
ock.mmc.tls.enable = false
ock.mmc.config_store.tls.enable = false
ock.mmc.local_service.hcom.tls.enable = false
```
You are advised to copy mmc-local.conf and mmc-meta.conf to your own path and modify them, and set the MMC_META_CONFIG_PATH environment variable to the path of your own mmc-meta.conf file.
**mmc-meta.conf**
```shell
# Meta service start-up url
# It will automatically modified to PodIP at Pod startup in K8s meta service cluster master-standby high availability scenario
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
# config store url, It will automatically modified to PodIP at Pod startup in K8s
ock.mmc.meta_service.config_store_url = tcp://xx.xx.xx.xx:6000
# Enable or disable high availability deployment
ock.mmc.meta.ha.enable = false
# Log level: debug, info, warn, error
ock.mmc.log_level = error
# Log directory path, supports both relative and absolute paths, the system will automatically append 'logs' directory.
# The absolute log path at default value is '/path/to/mmc_meta_service/../logs'
# If the path of mmc_meta_service is '/usr/local/mxc/memfabric_hybrid/latest/aarch64-linux/bin'
# Then the path of log is '/usr/local/mxc/memfabric_hybrid/latest/aarch64-linux/logs'
ock.mmc.log_path = .
# Log rotation file size, unit is MB, value range [1,500]
ock.mmc.log_rotation_file_size = 20
# Log rotation file count, value range [1,50]
ock.mmc.log_rotation_file_count = 50
# The threshold that triggers eviction, measured as a percentage of space usage
# 'put' operation will trigger eviction when the threshold is exceeded
ock.mmc.evict_threshold_high = 90
# The target threshold of eviction, measured as a percentage of space usage
ock.mmc.evict_threshold_low = 80
# TLS configuration for metaservice
ock.mmc.tls.enable = false
ock.mmc.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.tls.cert.path = /opt/ock/security/certs/server.cert.pem
ock.mmc.tls.key.path = /opt/ock/security/certs/server.private.key.pem
ock.mmc.tls.key.pass.path = /opt/ock/security/certs/server.passphrase
ock.mmc.tls.package.path = /opt/ock/security/libs/
ock.mmc.tls.decrypter.path =
# TLS configuration for config store
ock.mmc.config_store.tls.enable = false
ock.mmc.config_store.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.config_store.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.config_store.tls.cert.path = /opt/ock/security/certs/server.cert.pem
ock.mmc.config_store.tls.key.path = /opt/ock/security/certs/server.private.key.pem
ock.mmc.config_store.tls.key.pass.path = /opt/ock/security/certs/server.passphrase
ock.mmc.config_store.tls.package.path = /opt/ock/security/libs/
ock.mmc.config_store.tls.decrypter.path =
```
**Key Focuses**
* ock.mmc.meta_service_urlConfigure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
* ock.mmc.meta_service.config_store_urlConfigure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
* To disable TLS authentication modification, set the following parameters to false:ock.mmc.meta.ha.enable、ock.mmc.config_store.tls.enable
**mmc-local.conf**
```shell
# Meta service start-up url
# K8s meta service cluster master-standby high availability scenario: ClusterIP address
# Non-HA scenario: keep consistent with the same name configuration in mmc-meta.conf
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
# Log level: debug, info, warn, error
ock.mmc.log_level = error
# TLS configurations for metaservice
ock.mmc.tls.enable = false
ock.mmc.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.tls.cert.path = /opt/ock/security/certs/client.cert.pem
ock.mmc.tls.key.path = /opt/ock/security/certs/client.private.key.pem
ock.mmc.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
ock.mmc.tls.package.path = /opt/ock/security/libs/
ock.mmc.tls.decrypter.path =
# Total count of local service
ock.mmc.local_service.world_size = 32
# config store url, it will automatically modified to PodIP at Pod startup in HA scenario
# keep consistent with the same name configuration in mmc-meta.conf
ock.mmc.local_service.config_store_url = tcp://xx.xx.xx.xx:6000
# TLS configurations for config_store
ock.mmc.config_store.tls.enable = false
ock.mmc.config_store.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.config_store.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.config_store.tls.cert.path = /opt/ock/security/certs/client.cert.pem
ock.mmc.config_store.tls.key.path = /opt/ock/security/certs/client.private.key.pem
ock.mmc.config_store.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
ock.mmc.config_store.tls.package.path = /opt/ock/security/libs/
ock.mmc.config_store.tls.decrypter.path =
# Data transfer protocol, 'host_rdma': rdma over host; 'host_tcp': tcp over host; 'device_rdma': rdma over device; 'device_sdma': sdma over device
ock.mmc.local_service.protocol = device_sdma
# HBM/DRAM space usage, configuration type supports 134217728, 2048KB/2048K, 200MB/200mb/200m, 2.5GB or 1TB, case-insensitive, the maximum value is 1TB
# The system automatically calculates and aligns downwards to 2MB (host_sdma or host_tcp) or 1GB (device_sdma or device_rdma)
# After alignment, the HBM size and DRAM size cannot both be 0 at the same time
ock.mmc.local_service.dram.size = 2GB
ock.mmc.local_service.hbm.size = 0
# If the protocol is host_rdma, the ip needs to be set as RDMA network card ip. Use 'show_gids' command to query it
ock.mmc.local_service.hcom_url = tcp://127.0.0.1:7000
# HCOM TLS config
ock.mmc.local_service.hcom.tls.enable = false
ock.mmc.local_service.hcom.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.local_service.hcom.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.local_service.hcom.tls.cert.path = /opt/ock/security/certs/client.cert.pem
ock.mmc.local_service.hcom.tls.key.path = /opt/ock/security/certs/client.private.key.pem
ock.mmc.local_service.hcom.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
ock.mmc.local_service.hcom.tls.decrypter.path =
# The total retry duration (retry interval is 200ms) when client requests meta service and the connection does not exist
# Default value is 0, means no-retry and return immediately, value range [0, 600000]
ock.mmc.client.retry_milliseconds = 0
ock.mmc.client.timeout.seconds = 60
# read/write thread pool size, value range [1, 64]
ock.mmc.client.read_thread_pool.size = 16
ock.mmc.client.write_thread_pool.size = 2
```
**Key Focuses**
* ock.mmc.meta_service_urlConfigure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
* ock.mmc.local_service.config_store_urlConfigure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
* ock.mmc.local_service.world_sizeTotal number of cards for starting services.
* ock.mmc.local_service.protocolhost_rdma (default), device_rdma (supported for A2 and A3 when device ROCE available, recommended for A2), device_sdma (supported for A3 when HCCS available, recommended for A3)
* ock.mmc.local_service.dram.sizeSets the size of the memory occupied by the master. The configured value is the size of the memory occupied by each card.
* To disable TLS authentication modification, set the following parameters to false:ock.mmc.meta.ha.enable、ock.mmc.config_store.tls.enable
### Memcache environment variables
```shell
source /usr/local/memcache_hybrid/set_env.sh
source /usr/local/memfabric_hybrid/set_env.sh
# Configuring Environment Variables in the Configuration File
export MMC_META_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-meta.conf
```
### Run Memcache Master
Starting the MetaService service.
```shell
1. Set environment variables for the configuration file.
export MMC_META_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-meta.conf
2. Access the Python console or compile the following Python script to start the process:
from memcache_hybrid import MetaService
MetaService.main()
```
Method 2 for starting the MetaService service.
```shell
source /usr/local/memcache_hybrid/set_env.sh
source /usr/local/memfabric_hybrid/set_env.sh
export MMC_META_CONFIG_PATH=/home/memcache/shell/mmc-meta.conf # Set it to the path of your own configuration file.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/python3.11.10/lib/
/usr/local/memcache_hybrid/latest/aarch64-linux/bin/mmc_meta_service
```
### PD Disaggregation Scenario
#### 1.Run `prefill` Node and `decode` Node
Using `MultiConnector` to simultaneously utilize both `MooncakeConnectorV1` and `AscendStoreConnector`. `MooncakeConnectorV1` performs kv_transfer, while `AscendStoreConnector` enables KV Cache Pool
#### 800I A2/800T A2 Series
`prefill` Node
```shell
rm -rf /root/ascend/log/*
source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh
# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
# nic_name can be looked up in ifconfig
nic_name="xxxxxx"
local_ip="xx.xx.xx.xx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
rm -rf ./connector.log
vllm serve xxxxxxx/Qwen3-32B \
--host 0.0.0.0 \
--port 30050 \
--enforce-eager \
--data-parallel-size 2 \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name qwen3 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--max-num_seqs 20 \
--no-enable-prefix-caching \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"engine_id": "2",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_buffer_device": "npu",
"kv_rank": 0,
"kv_port": "20001",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 4
},
"decode": {
"dp_size": 2,
"tp_size": 4
}
}
},
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config":{
"backend": "memcache",
"lookup_rpc_port":"0"
}
}
]
}
}' > log_p.log 2>&1
```
`decode` Node
```shell
rm -rf /root/ascend/log/*
source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh
# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
# nic_name can be looked up in ifconfig
nic_name="xxxxxx"
local_ip="xx.xx.xx.xx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
rm -rf ./connector.log
vllm serve xxxxxxx/Qwen3-32B \
--host 0.0.0.0 \
--port 30060 \
--enforce-eager \
--data-parallel-size 2 \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name qwen3 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--max-num_seqs 20 \
--no-enable-prefix-caching \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_buffer_device": "npu",
"kv_rank": 1,
"kv_port": "20002",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 4
},
"decode": {
"dp_size": 2,
"tp_size": 4
}
}
} ,
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config":{
"backend": "memcache",
"lookup_rpc_port":"1"
}
}
]
}
}' > log_d.log 2>&1
```
#### 800I A3/800T A3 Series
`prefill` Node
```shell
rm -rf /root/ascend/log/*
# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export ACL_OP_INIT_MODE=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
python -m vllm.entrypoints.openai.api_server \
--model=xxxxxxxxx/DeepSeek-R1 \
--served-model-name dsv3 \
--trust-remote-code \
--enforce-eager \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--port 30050 \
--max-num_seqs 28 \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--enable_expert_parallel \
--quantization ascend \
--gpu-memory-utilization 0.90 \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"engine_id": "2",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_buffer_device": "npu",
"kv_rank": 0,
"kv_port": "20001",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 2,
"tp_size": 8
}
}
},
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config":{
"backend": "memcache",
"lookup_rpc_port":"0"
}
}
]
}
}' > log_p.log 2>&1
```
`decode` Node
```shell
rm -rf /root/ascend/log/*
# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export ACL_OP_INIT_MODE=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
python -m vllm.entrypoints.openai.api_server \
--model=xxxxxxxxxxxxxxxx/DeepSeek \
--served-model-name dsv3 \
--trust-remote-code \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--port 30060 \
--max-model-len 16384 \
--max-num-batched-tokens 5200 \
--enforce-eager\
--quantization ascend \
--no-enable-prefix-caching \
--max-num_seqs 28 \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--enable_expert_parallel \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_buffer_device": "npu",
"kv_rank": 1,
"kv_port": "20002",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 2,
"tp_size": 8
}
}
},
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config":{
"backend": "memcache",
"lookup_rpc_port":"1"
}
}
]
}
}' > log_d.log 2>&1
```
#### [2、Start proxy_server](#2start-proxy_server)
#### [3、run-inference](#3run-inference)
### PD-Mixed Scenario
#### 1.Run Mixed Department Script
#### 800I A2/800T A2 Series
The deepseek model needs to be run in a two-node cluster.
**Run_hunbu_1.sh:**
```shell
rm -rf /root/ascend/log/*
source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh
# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
# nic_name can be looked up in ifconfig
nic_name="xxxxxxx"
local_ip="xx.xx.xx.xx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
rm -rf ./connector.log
vllm serve xxxxxxx/DeepSeek-R1 \
--host 0.0.0.0 \
--port 30050 \
--enforce-eager \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--api-server-count 2 \
--data-parallel-address 141.61.33.167 \
--data-parallel-rpc-port 13348 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--max-num_seqs 20 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false}' \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "memcache",
"lookup_rpc_port":"0"
}
}' > log_hunbu_1.log 2>&1
```
**Run_hunbu_2.sh:**
```shell
rm -rf /root/ascend/log/*
source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh
# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
# nic_name can be looked up in ifconfig
nic_name="xxxxxxx"
local_ip="xx.xx.xx.xx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
# export VLLM_TORCH_PROFILER_DIR="./vllm-profiling"
# export VLLM_TORCH_PROFILER_WITH_STACK=0
rm -rf ./connector.log
vllm serve xxxxxxx/DeepSeek-R1 \
--host 0.0.0.0 \
--port 30050 \
--headless \
--enforce-eager \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address 141.61.33.167 \
--data-parallel-rpc-port 13348 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--max-num_seqs 20 \
--enable-expert-parallel \
--no-enable-prefix-caching \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "chunked_prefill_for_mla":true}' \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "memcache",
"mooncake_rpc_port":"0"
}
}' > log_hunbu_2.log 2>&1
```
#### 800I A3/800T A3 Series
```shell
bash mixed_department.sh
```
Content of mixed_department.sh:
```shell
rm -rf /root/ascend/log/*
# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export ACL_OP_INIT_MODE=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
python -m vllm.entrypoints.openai.api_server \
--model=xxxxxxx/DeepSeek-R1 \
--served-model-name dsv3 \
--trust-remote-code \
--enforce-eager \
-dp 2 \
-tp 8 \
--port 30050 \
--max-num_seqs 28 \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "chunked_prefill_for_mla":true}' \
--enable_expert_parallel \
--quantization ascend \
--gpu-memory-utilization 0.90 \
--no-enable-prefix-caching \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "memcache",
"mooncake_rpc_port":"0"
}
}' > log_hunbu.log 2>&1
```
#### [2.Run Inference](#2run-inference)