xc-llm-ascend/docs/source/tutorials/multi_npu.md

# Multi-NPU (QwQ 32B)

## Run vllm-ascend on Multi-NPU

Run docker container:

```{code-block} bash
   :substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```

Set up environment variables:

```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True

# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```

### Online Inference on Multi-NPU

Run the following script to start the vLLM server on multi-NPU:

```bash
vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
```

Once your server is started, you can query the model with input prompts.

```bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/QwQ-32B",
        "prompt": "QwQ-32B是什么？",
        "max_tokens": "128",
        "top_p": "0.95",
        "top_k": "40",
        "temperature": "0.6"
    }'
```

### Offline Inference on Multi-NPU

Run the following script to execute offline inference on multi-NPU:

```python
import gc

import torch

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
                                             destroy_model_parallel)

def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="Qwen/QwQ-32B",
          tensor_parallel_size=4,
          distributed_executor_backend="mp",
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

del llm
clean_up()
```

If you run this script successfully, you can see the info shown below:

```bash
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
```
-												[Docs] Re-arch on doc and make QwQ doc work (#271)

### What this PR does / why we need it?
Re-arch on tutorials, move singe npu / multi npu / multi node to index.
- Unifiy docker run cmd
- Use dropdown to hide build from source installation doc
- Re-arch tutorials to include Qwen/QwQ/DeepSeek
- Make QwQ doc works

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test



Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-10 09:27:48 +08:00
+								# Multi-NPU (QwQ 32B)
 								## Run vllm-ascend on Multi-NPU
 								Run docker container:
 								```{code-block} bash
 								   :substitutions:
 								# Update the vllm-ascend image
 								export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
 								docker run --rm \
 								--name vllm-ascend \
 								--device /dev/davinci0 \
 								--device /dev/davinci1 \
 								--device /dev/davinci2 \
 								--device /dev/davinci3 \
 								--device /dev/davinci_manager \
 								--device /dev/devmm_svm \
 								--device /dev/hisi_hdc \
 								-v /usr/local/dcmi:/usr/local/dcmi \
 								-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								-v /etc/ascend_install.info:/etc/ascend_install.info \
 								-v /root/.cache:/root/.cache \
 								-p 8000:8000 \
 								-it $IMAGE bash
 								```
-												[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
											
										
										
											2025-10-29 11:32:12 +08:00
+								Set up environment variables:
-												[Docs] Re-arch on doc and make QwQ doc work (#271)

### What this PR does / why we need it?
Re-arch on tutorials, move singe npu / multi npu / multi node to index.
- Unifiy docker run cmd
- Use dropdown to hide build from source installation doc
- Re-arch tutorials to include Qwen/QwQ/DeepSeek
- Make QwQ doc works

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test



Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-10 09:27:48 +08:00
 								```bash
-												[Doc]Fix tutorial doc expression (#319)

Fix tutorial doc expression

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-13 15:24:05 +08:00
+								# Load model from ModelScope to speed up download
-												[Docs] Re-arch on doc and make QwQ doc work (#271)

### What this PR does / why we need it?
Re-arch on tutorials, move singe npu / multi npu / multi node to index.
- Unifiy docker run cmd
- Use dropdown to hide build from source installation doc
- Re-arch tutorials to include Qwen/QwQ/DeepSeek
- Make QwQ doc works

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test



Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-10 09:27:48 +08:00
+								export VLLM_USE_MODELSCOPE=True
-												[Doc]Fix tutorial doc expression (#319)

Fix tutorial doc expression

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-03-13 15:24:05 +08:00
+								# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
-												[Docs] Re-arch on doc and make QwQ doc work (#271)

### What this PR does / why we need it?
Re-arch on tutorials, move singe npu / multi npu / multi node to index.
- Unifiy docker run cmd
- Use dropdown to hide build from source installation doc
- Re-arch tutorials to include Qwen/QwQ/DeepSeek
- Make QwQ doc works

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test



Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-10 09:27:48 +08:00
+								export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
 								```
 								### Online Inference on Multi-NPU
-												[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
											
										
										
											2025-10-29 11:32:12 +08:00
+								Run the following script to start the vLLM server on multi-NPU:
-												[Docs] Re-arch on doc and make QwQ doc work (#271)

### What this PR does / why we need it?
Re-arch on tutorials, move singe npu / multi npu / multi node to index.
- Unifiy docker run cmd
- Use dropdown to hide build from source installation doc
- Re-arch tutorials to include Qwen/QwQ/DeepSeek
- Make QwQ doc works

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test



Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-10 09:27:48 +08:00
 								```bash
 								vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4
 								```
-												[v0.11.0][Doc] Update doc (#3852)

### What this PR does / why we need it?
Update doc


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
											
										
										
											2025-10-29 11:32:12 +08:00
+								Once your server is started, you can query the model with input prompts.
-												[Docs] Re-arch on doc and make QwQ doc work (#271)

### What this PR does / why we need it?
Re-arch on tutorials, move singe npu / multi npu / multi node to index.
- Unifiy docker run cmd
- Use dropdown to hide build from source installation doc
- Re-arch tutorials to include Qwen/QwQ/DeepSeek
- Make QwQ doc works

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test



Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-10 09:27:48 +08:00
 								```bash
 								curl http://localhost:8000/v1/completions \
 								    -H "Content-Type: application/json" \
 								    -d '{
 								        "model": "Qwen/QwQ-32B",
 								        "prompt": "QwQ-32B是什么？",
 								        "max_tokens": "128",
 								        "top_p": "0.95",
 								        "top_k": "40",
 								        "temperature": "0.6"
 								    }'
 								```
 								### Offline Inference on Multi-NPU
 								Run the following script to execute offline inference on multi-NPU:
 								```python
 								import gc
 								import torch
 								from vllm import LLM, SamplingParams
 								from vllm.distributed.parallel_state import (destroy_distributed_environment,
 								                                             destroy_model_parallel)
 								def clean_up():
 								    destroy_model_parallel()
 								    destroy_distributed_environment()
 								    gc.collect()
 								    torch.npu.empty_cache()
 								prompts = [
 								    "Hello, my name is",
 								    "The future of AI is",
 								]
 								sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
 								llm = LLM(model="Qwen/QwQ-32B",
 								          tensor_parallel_size=4,
-												[Doc] Change distributed_executor_backend to mp (#287)

### What this PR does / why we need it?
Fix `ValueError: Unrecognized distributed executor backend tp. Supported
values are 'ray', 'mp' 'uni', 'external_launcher' or custom ExecutorBase
subclass.`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test on my local node

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-10 11:27:26 +08:00
+								          distributed_executor_backend="mp",
-												[Docs] Re-arch on doc and make QwQ doc work (#271)

### What this PR does / why we need it?
Re-arch on tutorials, move singe npu / multi npu / multi node to index.
- Unifiy docker run cmd
- Use dropdown to hide build from source installation doc
- Re-arch tutorials to include Qwen/QwQ/DeepSeek
- Make QwQ doc works

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test



Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-10 09:27:48 +08:00
+								          max_model_len=4096)
 								outputs = llm.generate(prompts, sampling_params)
 								for output in outputs:
 								    prompt = output.prompt
 								    generated_text = output.outputs[0].text
 								    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 								del llm
 								clean_up()
 								```
 								If you run this script successfully, you can see the info shown below:
 								```bash
 								Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
 								Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
 								```