`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series base on the HBM size. Please modify the value according to a suitable value for your NPU series.
Once your server is started, you can query the model with input prompts:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
If you query the server successfully, you can see the info shown below (client):
```bash
{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"text":" here. It’s not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}
```
Logs of the vllm server:
```bash
INFO: 172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK
ray start --address='{head_node_ip}:{port_num}' --num-gpus=8 --node-ip-address={local_ip}
```
Start the vLLM server on head node:
```shell
export VLLM_HOST_IP={head_node_ip}
export HCCL_CONNECT_TIMEOUT=120
export ASCEND_PROCESS_LOG_PATH={plog_save_path}
export HCCL_IF_IP={head_node_ip}
if [ -d "{plog_save_path}" ]; then
rm -rf {plog_save_path}
echo ">>> remove {plog_save_path}"
fi
LOG_FILE="multinode_$(date +%Y%m%d_%H%M).log"
VLLM_TORCH_PROFILER_DIR=./vllm_profile
python -m vllm.entrypoints.openai.api_server \
--model="Deepseek/DeepSeek-V2-Lite-Chat" \
--trust-remote-code \
--enforce-eager \
--max-model-len {max_model_len} \
--distributed_executor_backend "ray" \
--tensor-parallel-size 16 \
--disable-log-requests \
--disable-log-stats \
--disable-frontend-multiprocessing \
--port {port_num} \
```
Once your server is started, you can query the model with input prompts:
```shell
curl -X POST http://127.0.0.1:{prot_num}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Deepseek/DeepSeek-V2-Lite-Chat",
"prompt": "The future of AI is",
"max_tokens": 24
}'
```
If you query the server successfully, you can see the info shown below (client):
```
{"id":"cmpl-6dfb5a8d8be54d748f0783285dd52303","object":"text_completion","created":1739957835,"model":"/home/data/DeepSeek-V2-Lite-Chat/","choices":[{"index":0,"text":" heavily influenced by neuroscience and cognitiveGuionistes. The goalochondria is to combine the efforts of researchers, technologists,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":30,"completion_tokens":24,"prompt_tokens_details":null}}
```
Logs of the vllm server:
```
INFO: 127.0.0.1:59384 - "POST /v1/completions HTTP/1.1" 200 OK