[Doc] Add qwen vl example in tutorials for 310I series (#2160)

### What this PR does / why we need it?
Add qwen vl example in tutorials for 310I series. 

Model: Qwen2.5-VL-3B-Instruct
Accuracy test result, dataset MMM-val:
| | 910B3 | 310P3 |
| --- | --- | --- |
|Summary|0.455 | 0.46 |
|--art_and_design| 0.558 | 0.566 |
|--business| 0.373 | 0.366 |
|--health_and_medicine|0.513 | 0.52 |
|--science|0.333 | 0.333 |
|--tech_and_engineering|0.362 | 0.380 |
|--humanities_and_social_science|0.691 | 0.691 |

Function test result:

1. On line:
![image](https://github.com/user-attachments/assets/d81bba61-df28-4676-a246-c5d094815ac7)
![image](https://github.com/user-attachments/assets/0be81628-9999-4ef2-93c1-898b3043e09e)

2. Offline:
![image](https://github.com/user-attachments/assets/603275c1-6ed6-4cfc-a6e2-7726156de087)

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
This commit is contained in:
leo-pony
2025-08-02 08:58:56 +08:00
committed by GitHub
parent 8cf97d8310
commit f0c1f0c828

View File

@@ -1,7 +1,8 @@
# Single Node (Atlas 300I series)
```{note}
This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement.
2. Currently, the 310I series only supports eager mode and the data type is float16.
```
## Run vLLM on Altlas 300I series
@@ -83,7 +84,7 @@ curl http://localhost:8000/v1/completions \
::::
::::{tab-item} Qwen/Qwen2.5-7B-Instruct
::::{tab-item} Qwen2.5-7B-Instruct
:sync: qwen7b
Run the following command to start the vLLM server:
@@ -113,6 +114,36 @@ curl http://localhost:8000/v1/completions \
::::
::::{tab-item} Qwen2.5-VL-3B-Instruct
:sync: qwen-vl-2.5-3b
Run the following command to start the vLLM server:
```{code-block} bash
:substitutions:
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
--tensor-parallel-size 1 \
--enforce-eager \
--dtype float16 \
--compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
}'
```
::::
::::{tab-item} Pangu-Pro-MoE-72B
:sync: pangu
@@ -251,6 +282,49 @@ clean_up()
::::
::::{tab-item} Qwen2.5-VL-3B-Instruct
:sync: qwen-vl-2.5-3b
```{code-block} python
:substitutions:
from vllm import LLM, SamplingParams
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, top_p=0.95, top_k=50, temperature=0.6)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen2.5-VL-3B-Instruct",
tensor_parallel_size=1,
enforce_eager=True, # For 300I series, only eager mode is supported.
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series
compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
```
::::
::::{tab-item} Pangu-Pro-MoE-72B
:sync: pangu