Files
xc-llm-ascend/docs/source/tutorials/multi_npu.md
Yikun Jiang 007aeaa48b [Doc] Change distributed_executor_backend to mp (#287)
### What this PR does / why we need it?
Fix `ValueError: Unrecognized distributed executor backend tp. Supported
values are 'ray', 'mp' 'uni', 'external_launcher' or custom ExecutorBase
subclass.`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test on my local node

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-10 11:27:26 +08:00

108 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Multi-NPU (QwQ 32B)
## Run vllm-ascend on Multi-NPU
Run docker container:
```{code-block} bash
:substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```
Setup environment variables:
```bash
# Use Modelscope mirror to speed up model download
export VLLM_USE_MODELSCOPE=True
# To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate for Qwen2.5-7B-Instruct
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```
### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
```bash
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/QwQ-32B",
"prompt": "QwQ-32B是什么",
"max_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.6"
}'
```
### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
```python
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="Qwen/QwQ-32B",
tensor_parallel_size=4,
distributed_executor_backend="mp",
max_model_len=4096)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
If you run this script successfully, you can see the info shown below:
```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```