Files
xc-llm-ascend/docs/source/tutorials/models/Qwen2.5-7B.md
herizhen 0d1424d81a [Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-09 15:37:57 +08:00

5.7 KiB
Raw Blame History

Qwen2.5-7B

Introduction

Qwen2.5-7B-Instruct is the flagship instruction-tuned variant of Alibaba Clouds Qwen 2.5 LLM series. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling.

This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation.

The Qwen2.5-7B-Instruct model was supported since vllm-ascend:v0.9.0.

Supported Features

Refer to supported features to get the model's supported feature matrix.

Refer to feature guide to get the feature's configuration.

Environment Preparation

Model Weight

It is recommended to download the model weights to a local directory (e.g., ./Qwen2.5-7B-Instruct/) for quick access during deployment.

Installation

You can use our official docker image and install extra operator for supporting Qwen2.5-7B-Instruct.

:::::{tab-set} :sync-group: install

::::{tab-item} A3 series :sync: A3

  1. Start the docker image on your each node.
   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

:::: ::::{tab-item} A2 series :sync: A2

Start the docker image on your each node.

   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

:::: :::::

Deployment

Single-node Deployment

Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service:

  1. Prepare model weights: Ensure the downloaded model weights are stored in the ./Qwen2.5-7B-Instruct/ directory.
  2. Create and execute the deployment script (save as deploy.sh):
#!/bin/sh
export ASCEND_RT_VISIBLE_DEVICES=0
export MODEL_PATH="./Qwen2.5-7B-Instruct"

vllm serve ${MODEL_PATH} \
          --host 0.0.0.0 \
          --port 8000 \
          --served-model-name qwen-2.5-7b-instruct \
          --trust-remote-code \
          --max-model-len 32768

Multi-node Deployment

Single-node deployment is recommended.

Prefill-Decode Disaggregation

Not supported yet.

Functional Verification

After starting the service, verify functionality using a curl request:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-2.5-7b-instruct",
        "prompt": "Beijing is a",
        "max_completion_tokens": 5,
        "temperature": 0
    }'

A valid response (e.g., "Beijing is a vibrant and historic capital city") indicates successful deployment.

Accuracy Evaluation

Using AISBench

Refer to Using AISBench for details.

Results and logs are saved to benchmark/outputs/default/. A sample accuracy report is shown below:

dataset version metric mode vllm-api-general-chat
gsm8k - accuracy gen 75.00

Performance

Using AISBench

Refer to Using AISBench for performance evaluation for details.

Using vLLM Benchmark

Run performance evaluation of Qwen2.5-7B-Instruct as an example.

Refer to vllm benchmark for more details.

There are three vllm bench subcommands:

  • latency: Benchmark the latency of a single batch of requests.
  • serve: Benchmark the online serving throughput.
  • throughput: Benchmark offline inference throughput.

Take the serve as an example. Run the code as follows.

vllm bench serve \
  --model ./Qwen2.5-7B-Instruct/ \
  --dataset-name random \
  --random-input 200 \
  --num-prompts 200 \
  --request-rate 1 \
  --save-result \
  --result-dir ./perf_results/

After several minutes, you can get the performance evaluation result.