[Doc]Refresh model tutorial examples and serving commands (#7426)
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: MrZ20 <2609716663@qq.com>
This commit is contained in:
@@ -25,8 +25,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
### Model Weight
|
||||
|
||||
- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
|
||||
- `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8).
|
||||
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot).
|
||||
- `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot).
|
||||
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
|
||||
|
||||
|
||||
@@ -17,7 +17,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
### Model Weight
|
||||
|
||||
- `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
|
||||
- `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
|
||||
- `DeepSeek-V3.2-Exp-W8A8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8)
|
||||
- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
|
||||
- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)
|
||||
|
||||
|
||||
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
- `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5).
|
||||
- `GLM-5-w4a8`: [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8).
|
||||
- `GLM-5-w8a8`: [Download model weight](https://ai.gitcode.com/Eco-Tech/GLM-5-w8a8/tree/main).
|
||||
- `GLM-5-w8a8`: [Download model weight](https://www.modelscope.cn/models/Eco-Tech/GLM-5-w8a8).
|
||||
- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
@@ -1308,6 +1308,21 @@ python load_balance_proxy_server_example.py \
|
||||
6721 6722 6723 6724
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://<node0_ip>:<port>/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "glm-5",
|
||||
"prompt": "The future of AI is",
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
@@ -113,12 +113,24 @@ Run the following script to start the vLLM server on Multi-NPU:
|
||||
For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.
|
||||
|
||||
```bash
|
||||
vllm serve Kimi-K2-Thinking \
|
||||
--served-model-name kimi-k2-thinking \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--no-enable-prefix-caching
|
||||
#!/bin/bash
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export OMP_PROC_BIND=false
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
|
||||
vllm serve "moonshotai/Kimi-K2-Thinking" \
|
||||
--tensor-parallel-size 16 \
|
||||
--port 8000 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-batched-tokens 8192 \
|
||||
--max-num-seqs 12 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--enable-expert-parallel \
|
||||
--no-enable-prefix-caching
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
@@ -299,7 +299,7 @@ print(resp.choices[0].message.content)
|
||||
Or send a request using curl:
|
||||
|
||||
```{code-block} bash
|
||||
curl http://127.0.0.1:8000/v1/chat/completions \
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "MiniMax-M2.5",
|
||||
|
||||
@@ -74,7 +74,7 @@ Run the following script to start the vLLM server on single 910B4:
|
||||
#!/bin/sh
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
export MODEL_PATH="PaddlePaddle/PaddleOCR-VL"
|
||||
export TASK_QUEUE_ENABLE=2
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export CPU_AFFINITY_CONF=1
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
|
||||
|
||||
@@ -142,7 +142,7 @@ llm = LLM(
|
||||
)
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_completion_tokens=512
|
||||
max_tokens=512
|
||||
)
|
||||
|
||||
image_messages = [
|
||||
|
||||
@@ -122,7 +122,7 @@ Not supported yet.
|
||||
After starting the service, verify functionality using a `curl` request:
|
||||
|
||||
```shell
|
||||
curl http://<IP>:<Port>/v1/completions \
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen-2.5-7b-instruct",
|
||||
|
||||
@@ -16,8 +16,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)
|
||||
- `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-7B)
|
||||
- `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-3B)
|
||||
- `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B)
|
||||
|
||||
Following examples use the 7B version by default.
|
||||
|
||||
@@ -71,6 +71,8 @@ docker run --rm \
|
||||
:::{note}
|
||||
The env `LOCAL_MEDIA_PATH` which allowing API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments.
|
||||
|
||||
:::
|
||||
|
||||
```bash
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
export MODEL_PATH="Qwen/Qwen2.5-Omni-7B"
|
||||
@@ -104,10 +106,10 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]"
|
||||
```bash
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
export MODEL_PATH=Qwen/Qwen2.5-Omni-7B
|
||||
export LOCAL_MEDIA_PATH=/local_path/to_media/
|
||||
export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/
|
||||
export DP_SIZE=8
|
||||
|
||||
vllm serve ${MODEL_PATH}\
|
||||
vllm serve ${MODEL_PATH} \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--served-model-name Qwen-Omni \
|
||||
@@ -137,7 +139,7 @@ INFO: Application startup complete.
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{
|
||||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{
|
||||
"model": "Qwen-Omni",
|
||||
"messages": [
|
||||
{
|
||||
|
||||
@@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B)
|
||||
- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
|
||||
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
@@ -174,7 +174,7 @@ export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B \
|
||||
vllm serve Qwen/Qwen3-235B-A22B \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--data-parallel-size 2 \
|
||||
@@ -219,7 +219,7 @@ export OMP_NUM_THREADS=1
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B \
|
||||
vllm serve Qwen/Qwen3-235B-A22B \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--headless \
|
||||
|
||||
@@ -16,7 +16,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
`Qwen3-Coder-30B-A3B-Instruct`(BF16 version): requires 1 Atlas 800 A3 node (with 16x 64G NPUs) or 1 Atlas 800 A2 node (with 8x 64G/32G NPUs). [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Coder-30B-A3B-Instruct)
|
||||
`Qwen3-Coder-30B-A3B-Instruct`(BF16 version): requires 1 Atlas 800 A3 node (with 16x 64G NPUs) or 1 Atlas 800 A2 node (with 8x 64G/32G NPUs). [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Coder-30B-A3B-Instruct)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
|
||||
@@ -8,11 +8,7 @@ Welcome to the tutorial on optimizing Qwen Dense models in the vLLM-Ascend envir
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, accuracy and performance evaluation.
|
||||
|
||||
The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429)
|
||||
|
||||
## **Node**
|
||||
|
||||
This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
|
||||
The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429). This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.
|
||||
|
||||
## Supported Features
|
||||
|
||||
@@ -115,12 +111,13 @@ The specific example scenario is as follows:
|
||||
|
||||
### Run docker container
|
||||
|
||||
#### **Node**
|
||||
:::{note}
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
|
||||
- v0.11.0rc2-a3 is image tag, replace this with your actual tag.
|
||||
- replace this with your actual port: '-p 8113:8113'.
|
||||
- replace this with your actual card: '--device /dev/davinci0'.
|
||||
:::
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -142,7 +139,7 @@ docker run --rm \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /model/Qwen3-32B-W8A8:/model/Qwen3-32B-W8A8 \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-p 8113:8113 \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
@@ -174,7 +171,7 @@ export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve /model/Qwen3-32B-W8A8 \
|
||||
vllm serve vllm-ascend/Qwen3-32B-W8A8 \
|
||||
--served-model-name qwen3 \
|
||||
--trust-remote-code \
|
||||
--async-scheduling \
|
||||
@@ -190,15 +187,16 @@ vllm serve /model/Qwen3-32B-W8A8 \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
#### **Node**
|
||||
:::{note}
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
|
||||
|
||||
- If the model is not a quantized model, remove the `--quantization ascend` parameter.
|
||||
|
||||
- **[Optional]** `--additional-config '{"pa_shape_list":[48,64,72,80]}'`: `pa_shape_list` specifies the batch sizes where you want to switch to the PA operator. This is a temporary tuning knob. Currently, the attention operator dispatch defaults to the FIA operator. In some batch-size (concurrency) settings, FIA may have suboptimal performance. By setting `pa_shape_list`, when the runtime batch size matches one of the listed values, vLLM-Ascend will replace FIA with the PA operator to prevent performance degradation. In the future, FIA will be optimized for these scenarios and this parameter will be removed.
|
||||
|
||||
- If the ultimate performance is desired, the cudagraph_capture_sizes parameter can be enabled, reference: [key-optimization-points](./Qwen3-Dense.md#key-optimization-points)、[optimization-highlights](./Qwen3-Dense.md#optimization-highlights). Here is an example of batchsize of 72: `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,8,24,48,60,64,72,76]}'`.
|
||||
:::
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
@@ -219,11 +217,12 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU.
|
||||
|
||||
#### **Node**
|
||||
:::{note}
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
|
||||
|
||||
- If the model is not a quantized model,remove the `quantization="ascend"` parameter.
|
||||
:::
|
||||
|
||||
```python
|
||||
import gc
|
||||
@@ -244,7 +243,7 @@ prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
|
||||
llm = LLM(model="/model/Qwen3-32B-W8A8",
|
||||
llm = LLM(model="vllm-ascend/Qwen3-32B-W8A8",
|
||||
tensor_parallel_size=4,
|
||||
trust_remote_code=True,
|
||||
distributed_executor_backend="mp",
|
||||
@@ -299,12 +298,13 @@ There are three `vllm bench` subcommands:
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
#### **Node**
|
||||
:::{note}
|
||||
|
||||
- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
|
||||
- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path.
|
||||
:::
|
||||
|
||||
```shell
|
||||
vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
vllm bench serve --model vllm-ascend/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
@@ -16,7 +16,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
## Weight Preparation
|
||||
|
||||
Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main)
|
||||
Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Next-80B-A3B-Instruct)
|
||||
|
||||
## Deployment
|
||||
|
||||
@@ -103,7 +103,7 @@ if __name__ == '__main__':
|
||||
prompts = [
|
||||
"Who are you?",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
|
||||
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
enforce_eager=True,
|
||||
|
||||
@@ -109,7 +109,7 @@ def clean_up():
|
||||
|
||||
|
||||
def main():
|
||||
MODEL_PATH = "Qwen3/Qwen3-Omni-30B-A3B-Thinking"
|
||||
MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
|
||||
llm = LLM(
|
||||
model=MODEL_PATH,
|
||||
tensor_parallel_size=2,
|
||||
@@ -123,7 +123,7 @@ def main():
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
top_k=20,
|
||||
max_completion_tokens=16384,
|
||||
max_tokens=16384,
|
||||
)
|
||||
|
||||
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
|
||||
@@ -176,6 +176,11 @@ if __name__ == "__main__":
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2.
|
||||
|
||||
```bash
|
||||
export HCCL_BUFFSIZE=512
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
```
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel
|
||||
```
|
||||
|
||||
@@ -40,7 +40,7 @@ vllm serve Qwen/Qwen3-VL-Embedding-8B --runner pooling
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://127.0.0.1:8000/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||
curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||
"input": [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
|
||||
@@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3.5-27B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) nodes or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://huggingface.co/Qwen/Qwen3.5-27B/tree/main)
|
||||
- `Qwen3.5-27B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp/files)
|
||||
- `Qwen3.5-27B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) nodes or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3.5-27B)
|
||||
- `Qwen3.5-27B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
|
||||
@@ -145,7 +145,7 @@ The parameters are explained as follows:
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```shell
|
||||
curl http://<node0_ip>:<port>/v1/completions \
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3.5",
|
||||
|
||||
@@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main)
|
||||
- `Qwen3.5-397B-A17B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp/files)
|
||||
- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B)
|
||||
- `Qwen3.5-397B-A17B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
|
||||
|
||||
|
||||
@@ -41,7 +41,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --runner pooling --host 127.0.0.1 --port 8888
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://127.0.0.1:8888/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||
curl http://localhost:8888/v1/embeddings -H "Content-Type: application/json" -d '{
|
||||
"input": [
|
||||
"The capital of China is Beijing.",
|
||||
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
|
||||
|
||||
@@ -111,7 +111,7 @@ model_name = "Qwen/Qwen3-Reranker-8B"
|
||||
|
||||
model = LLM(
|
||||
model=model_name,
|
||||
task="score",
|
||||
runner="pooling",
|
||||
hf_overrides={
|
||||
"architectures": ["Qwen3ForSequenceClassification"],
|
||||
"classifier_from_token": ["no", "yes"],
|
||||
@@ -154,7 +154,7 @@ if __name__ == "__main__":
|
||||
|
||||
outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents)
|
||||
|
||||
print([output.outputs[0].score for output in outputs])
|
||||
print([output.outputs.score for output in outputs])
|
||||
```
|
||||
|
||||
If you run this script successfully, you will see a list of scores printed to the console, similar to this:
|
||||
|
||||
Reference in New Issue
Block a user