Files

meihanc 8c4e9bb76b [CI]update triton ascend version (#5392 )

### What this PR does / why we need it?
update triton-ascend version to 1229 and bisheng version in 1225;

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>

2025-12-30 09:51:45 +08:00

7.0 KiB

Raw Blame History

Qwen3-Next

Introduction

The Qwen3-Next model is a sparse MoE (Mixture of Experts) model with high sparsity. Compared to the MoE architecture of Qwen3, it has introduced key improvements in aspects such as the hybrid attention mechanism and multi-token prediction mechanism, enhancing the training and inference efficiency of the model under long contexts and large total parameter scales.

This document will present the core verification steps of the model, including supported features, environment preparation, as well as accuracy and performance evaluation. Qwen3 Next is currently using Triton Ascend, which is in the experimental phase. In subsequent versions, its performance related to stability and accuracy may change, and performance will be continuously optimized.

The Qwen3-Next model is first supported in vllm-ascend:v0.10.2rc1.

Supported Features

Refer to supported features to get the model's supported feature matrix.

Refer to feature guide to get the feature's configuration.

Weight Preparation

Download Link for the Qwen3-Next-80B-A3B-Instruct Model Weights: Download model weight

Deployment

If the machine environment is an Atlas 800I A3(64G*16), the deployment approach stays identical.

Run docker container

   :substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--shm-size=1g \
--name vllm-ascend-qwen3 \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

The Qwen3 Next is using Triton Ascend which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement.

Install Triton Ascend

The Triton Ascend is required when you run Qwen3 Next, please follow the instructions below to install it and its dependency.

Install the Ascend BiSheng toolkit, execute the command:

BISHENG_NAME="Ascend-BiSheng-toolkit_$(uname -i)_20251225.run"
BISHENG_URL="https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${BISHENG_NAME}"
wget -O "${BISHENG_NAME}" "${BISHENG_URL}" && chmod a+x "${BISHENG_NAME}" && "./${BISHENG_NAME}" --install && rm "${BISHENG_NAME}"
source /usr/local/Ascend/8.5.0/bisheng_toolkit/set_env.sh

Install Triton Ascend:

wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20251229-cp311-cp311-manylinux_2_27_$(uname -i).manylinux_2_28_$(uname -i).whl
pip install triton_ascend-3.2.0.dev20251229-cp311-cp311-manylinux_2_27_$(uname -i).manylinux_2_28_$(uname -i).whl

Inference

Please make sure you have already executed the command:

source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh

:::::{tab-set} ::::{tab-item} Online Inference

Run the following script to start the vLLM server on multi-NPU:

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.8 --max-num-batched-tokens 4096 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

Once your server is started, you can query the model with input prompts.

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-Next-80B-A3B-Instruct",
  "messages": [
    {"role": "user", "content": "Who are you?"}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 32
}'

::::

::::{tab-item} Offline Inference

Run the following script to execute offline inference on multi-NPU:

import gc
import torch

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
                                             destroy_model_parallel)

def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()

if __name__ == '__main__':
    prompts = [
        "Who are you?",
    ]
    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
    llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              distributed_executor_backend="mp",
              gpu_memory_utilization=0.7,
              max_model_len=4096)

    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    del llm
    clean_up()

If you run this script successfully, you can see the info shown below:

Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am'

:::: :::::

Accuracy Evaluation

Using AISBench

Refer to Using AISBench for details.
After execution, you can get the result, here is the result of Qwen3-Next-80B-A3B-Instruct in vllm-ascend:0.13.0rc1 for reference only.

dataset	version	metric	mode	vllm-api-general-chat
gsm8k	-	accuracy	gen	95.53

Performance

Using AISBench

Refer to Using AISBench for performance evaluation for details.

Using vLLM Benchmark

Run performance evaluation of Qwen3-Next as an example.

Refer to vllm benchmark for more details.

There are three vllm bench subcommand:

latency: Benchmark the latency of a single batch of requests.
serve: Benchmark the online serving throughput.
throughput: Benchmark offline inference throughput.

Take the serve as an example. Run the code as follows.

export VLLM_USE_MODELSCOPE=true
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct  --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./

After about several minutes, you can get the performance evaluation result.

The performance result is:

Hardware: A3-752T, 2 node

Deployment: TP4 + Full Decode Only

Input/Output: 2k/2k

Concurrency: 32

Performance: 580tps, TPOT 54ms

7.0 KiB Raw Blame History

Qwen3-Next

Introduction

Supported Features

Weight Preparation

Deployment

Run docker container

Install Triton Ascend

Inference

Accuracy Evaluation

Using AISBench

Performance

Using AISBench

Using vLLM Benchmark

7.0 KiB

Raw Blame History