Files
xc-llm-ascend/docs/source/tutorials/models/Qwen2.5-7B.md

181 lines
5.7 KiB
Markdown
Raw Permalink Normal View History

# Qwen2.5-7B
## Introduction
Qwen2.5-7B-Instruct is the flagship instruction-tuned variant of Alibaba Clouds Qwen 2.5 LLM series. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling.
This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation.
The `Qwen2.5-7B-Instruct` model was supported since `vllm-ascend:v0.9.0`.
## Supported Features
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
## Environment Preparation
### Model Weight
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 Atlas 910B4 (32G × 1) card. [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.
### Installation
You can use our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`.
:::::{tab-set}
:sync-group: install
::::{tab-item} A3 series
:sync: A3
1. Start the docker image on your each node.
```{code-block} bash
:substitutions:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
::::
::::{tab-item} A2 series
:sync: A2
Start the docker image on your each node.
```{code-block} bash
:substitutions:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
::::
:::::
## Deployment
### Single-node Deployment
Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service:
1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-7B-Instruct/` directory.
2. Create and execute the deployment script (save as `deploy.sh`):
```shell
#!/bin/sh
export ASCEND_RT_VISIBLE_DEVICES=0
export MODEL_PATH="Qwen/Qwen2.5-7B-Instruct"
vllm serve ${MODEL_PATH} \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen-2.5-7b-instruct \
--trust-remote-code \
--max-model-len 32768
```
### Multi-node Deployment
Single-node deployment is recommended.
### Prefill-Decode Disaggregation
Not supported yet.
## Functional Verification
After starting the service, verify functionality using a `curl` request:
```shell
[Doc]Refresh model tutorial examples and serving commands (#7426) ### What this PR does / why we need it? Main updates include: - update model IDs and default model paths in serving / offline inference examples - adjust some command snippets and notes for better copy-paste usability - replace `SamplingParams` argument usage from `max_completion_tokens` to `max_tokens`(**Offline** inference currently **does not support** the "max_completion_tokens") ``` bash Traceback (most recent call last): File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module> sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: Unexpected keyword argument 'max_completion_tokens' [ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception ``` - refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment variable ``` bash export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE=AIV ``` ``` bash EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048, epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine = 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + (maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB. [FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984] ``` - fix **Qwen3-reranker** example usage to match the current **pooling runner** interface and score output access ``` python model = LLM( model=model_name, task="score", # need fix hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], ``` ---> ``` python model = LLM( model=model_name, runner="pooling", hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], ``` - modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1` ``` bash (EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0. ``` These changes are needed because several documentation examples had drifted from the current runtime behavior and recommended invocation patterns, which could confuse users when following the tutorials directly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87 Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-20 11:34:18 +08:00
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-2.5-7b-instruct",
"prompt": "Beijing is a",
"max_completion_tokens": 5,
"temperature": 0
}'
```
A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment.
## Accuracy Evaluation
### Using AISBench
Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- |--------------|
| gsm8k | - | accuracy | gen | 75.00 |
## Performance
### Using AISBench
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
### Using vLLM Benchmark
Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
There are three `vllm bench` subcommands:
- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.
Take the `serve` as an example. Run the code as follows.
```shell
vllm bench serve \
--model ./Qwen2.5-7B-Instruct/ \
--dataset-name random \
--random-input 200 \
--num-prompts 200 \
--request-rate 1 \
--save-result \
--result-dir ./perf_results/
```
After about several minutes, you can get the performance evaluation result.