**lookup_rpc_port**: Port for RPC Communication Between Pooling Scheduler Process and Worker Process: Each Instance Requires a Unique Port Configuration.
`eviction_high_watermark_ratio` determines the watermark where Mooncake Store will perform eviction,and `eviction_ratio` determines the portion of stored objects that would be evicted.
Using `MultiConnector` to simultaneously utilize both `MooncakeConnectorV1` and `AscendStoreConnector`. `MooncakeConnectorV1` performs kv_transfer, while `AscendStoreConnector` serves as the prefix-cache node.
# ASCEND_BUFFER_POOL is the environment variable for configuring the number and size of buffer on NPU Device for aggregation and KV transfer,the value 4:8 means we allocate 4 buffers of size 8MB.
# Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
# The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
# This ensures that even in the worst-case scenario—where all Decode cards simultaneously attempt to connect to the same Prefill card the connection will not time out.
# Unit: ms. The timeout for one-sided communication transfer is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039).
Currently, the key-value pool in PD Disaggregate only stores the kv cache generated by the Prefill node by default. In models using MLA, it is now supported that the Decode node stores the kv cache for use by the Prefill node, enabled by adding `consumer_is_to_put: true` to the AscendStoreConnector. If the Prefill node enables PP, `prefill_pp_size` or `prefill_pp_layer_partition` also needs to be set. Example as follows:
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
Configure the localhost, port, and model weight path in the command to your own settings. The requests sent will only go to the port where the mixed deployment script is located, and there is no need to start a separate proxy.
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
Note: For MooncakeStore, it is recommended to perform a warm-up phase before running actual performance benchmarks.
This is because HCCL one-sided communication connections are created lazily after the instance is launched when Device-to-Device communication is involved. Currently, full-mesh connections between all devices are required. Establishing these connections introduces a one-time time overhead and persistent device memory consumption (4 MB of device memory per connection).
**For warm-up, it is recommended to issue requests with an input sequence length of 8K and an output sequence length of 1, with the total number of requests being 2–3× the number of devices (cards/dies).**
Set TLS certificate configurations. If TLS is disabled, you do not need to upload a certificate. If TLS is enabled, you need to upload a certificate.
```shell
# mmc-meta.conf
ock.mmc.tls.enable = false
ock.mmc.config_store.tls.enable = false
# mmc-local.conf
ock.mmc.tls.enable = false
ock.mmc.config_store.tls.enable = false
ock.mmc.local_service.hcom.tls.enable = false
```
You are advised to copy mmc-local.conf and mmc-meta.conf to your own path and modify them, and set the MMC_META_CONFIG_PATH environment variable to the path of your own mmc-meta.conf file.
**mmc-meta.conf:**
```shell
# Meta service start-up url
# It will automatically modified to PodIP at Pod startup in K8s meta service cluster master-standby high availability scenario
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
# config store url, It will automatically modified to PodIP at Pod startup in K8s
* ock.mmc.meta_service_url:Configure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
* ock.mmc.meta_service.config_store_url:Configure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
* To disable TLS authentication modification, set the following parameters to false:ock.mmc.meta.ha.enable、ock.mmc.config_store.tls.enable
**mmc-local.conf:**
```shell
# Meta service start-up url
# K8s meta service cluster master-standby high availability scenario: ClusterIP address
# Non-HA scenario: keep consistent with the same name configuration in mmc-meta.conf
# Data transfer protocol, 'host_rdma': rdma over host; 'host_tcp': tcp over host; 'device_rdma': rdma over device; 'device_sdma': sdma over device
ock.mmc.local_service.protocol = device_sdma
# HBM/DRAM space usage, configuration type supports 134217728, 2048KB/2048K, 200MB/200mb/200m, 2.5GB or 1TB, case-insensitive, the maximum value is 1TB
# The system automatically calculates and aligns downwards to 2MB (host_sdma or host_tcp) or 1GB (device_sdma or device_rdma)
# After alignment, the HBM size and DRAM size cannot both be 0 at the same time
ock.mmc.local_service.dram.size = 2GB
ock.mmc.local_service.hbm.size = 0
# If the protocol is host_rdma, the ip needs to be set as RDMA network card ip. Use 'show_gids' command to query it
# The total retry duration (retry interval is 200ms) when client requests meta service and the connection does not exist
# Default value is 0, means no-retry and return immediately, value range [0, 600000]
ock.mmc.client.retry_milliseconds = 0
ock.mmc.client.timeout.seconds = 60
# read/write thread pool size, value range [1, 64]
ock.mmc.client.read_thread_pool.size = 16
ock.mmc.client.write_thread_pool.size = 2
```
**Key Focuses:**
* ock.mmc.meta_service_url:Configure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
* ock.mmc.local_service.config_store_url:Configure the IP address and port number of the master node. The IP address and port number of the P node and D node can be the same.
* ock.mmc.local_service.world_size:Total number of cards for starting services.
* ock.mmc.local_service.protocol:host_rdma (default), device_rdma (supported for A2 and A3 when device ROCE available, recommended for A2), device_sdma (supported for A3 when HCCS available, recommended for A3)
* ock.mmc.local_service.dram.size:Sets the size of the memory occupied by the master. The configured value is the size of the memory occupied by each card.
* To disable TLS authentication modification, set the following parameters to false::ock.mmc.meta.ha.enable、ock.mmc.config_store.tls.enable
### Memcache environment variables
```shell
source /usr/local/memcache_hybrid/set_env.sh
source /usr/local/memfabric_hybrid/set_env.sh
# Configuring Environment Variables in the Configuration File
Using `MultiConnector` to simultaneously utilize both `MooncakeConnectorV1` and `AscendStoreConnector`. `MooncakeConnectorV1` performs kv_transfer, while `AscendStoreConnector` enables KV Cache Pool