Files
r200_8f_xtrt_llm/examples/qwen/README.md
2025-08-06 15:49:14 +08:00

202 lines
7.6 KiB
Markdown

# Qwen
This document shows how to build and run a Qwen model in XTRT-LLM on both single XPU and single node multi-XPU.
Support Qwen1.5 model as well
## Overview
The XTRT-LLM Qwen example code is located in [`qwen`](./). There is one main file:
* [`build.py`](./build.py) to build the XTRT-LLM engine(s) needed to run the Qwen model.
In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
* [`../run.py`](../run.py) to run the inference on an input text;
* [`../summarize.py`](../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset.
## Support Matrix
* FP16
* INT8 Weight-Only
* Tensor Parallel
## Usage
The XTRT-LLM Qwen example code locates at [qwen](./). It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.
### Build XTRT engine(s)
Need to prepare the HF Qwen checkpoint first by following the guides here [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) or [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)
Create a `downloads` directory to store the weights downloaded from huaggingface.
```bash
mkdir -p ./downloads
```
Store Qwen-7B-Chat or Qwen-14B-Chat separately.
- for Qwen-7B-Chat
```bash
mv Qwen-7B-Chat ./downloads/qwen-7b/
```
- for Qwen-14B-Chat
```bash
mv Qwen-14B-Chat ./downloads/qwen-14b/
```
- for Qwen1.5-7B-Chat
```bash
mv Qwen1.5-7B-Chat ./downloads/Qwen1.5-7B-Chat/
```
XTRT-LLM Qwen builds XTRT engine(s) from HF checkpoint.
Normally `build.py` only requires single XPU, but if you've already got all the XPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--parallel_build` argument. Please note that currently `parallel_build` feature only supports single node.
** Notice: Qwen1.5 require arg "--version=1.5 **
** Notice: `pip install transformers-stream-generator` in build phase**
Here're some examples:
```bash
# Build a single-XPU float16 engine from HF weights.
# use_gpt_attention_plugin is necessary in Qwen.
# Try use_gemm_plugin to prevent accuracy issue.
# It is recommend to use --use_gpt_attention_plugin for better performance
# Build the Qwen 7B model using a single XPU and FP16.
python build.py --hf_model_dir ./downloads/qwen-7b \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/
# Build the Qwen1.5 7B model using a single XPU and FP16.
python build.py --hf_model_dir ./downloads/Qwen1.5-7B-Chat \
--version 1.5 \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/
# Build the Qwen 7B model using a single XPU and apply INT8 weight-only quantization.
python build.py --hf_model_dir ./downloads/qwen-7b/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--weight_only_precision int8 \
--output_dir ./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/
# Build Qwen 7B using 2-way tensor parallelism.
python build.py --hf_model_dir ./downloads/qwen-7b/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/ \
--world_size 2 \
--tp_size 2
# Build Qwen 14B using 2-way tensor parallelism.
python build.py --hf_model_dir ./downloads/qwen-14b/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/qwen-14b/trt_engines/fp16/2-XPU/ \
--world_size 2 \
--tp_size 2
```
#### SmoothQuant
The smoothquant supports both Qwen v1 and Qwen v2. Unlike the FP16 build where the HF weights are processed and loaded into the XTRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
Example:
```bash
python3 hf_qwen_convert.py -i ./downloads/qwen-7b/ -o ./downloads/qwen-7b/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type float16
```
Note `hf_qwen_convert.py` run with PyTorch, and
1. `torch-cpu` has better accuracy than xpytorch generally.
2. XPyTorch often use more than 32GB GM, thus more XPU are necessary to finish it.
3. add `-p=1` if run with XPyTorch.
[`build.py`](./build.py) add new options for the support of INT8 inference of SmoothQuant models.
`--use_smooth_quant` is the starting point of INT8 inference. By default, it
will run the model in the _per-tensor_ mode.
`--per-token` and `--per-channel` are not supported yet.
Examples of build invocations:
```bash
# Build model for SmoothQuant in the _per_tensor_ mode.
python3 build.py --ft_dir_path=./downloads/qwen-7b/sq0.5/1-XPU/ \
--use_smooth_quant \
--hf_model_dir ./downloads/qwen-7b/ \
--output_dir ./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/
```
- run
```bash
python3 ../run.py --input_text "你好,请问你叫什么?" \
--max_output_len=50 \
--tokenizer_dir ./downloads/qwen-7b/ \
--engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/
```
- summarize
```bash
python ../summarize.py --test_trt_llm \
--tokenizer_dir ./downloads/qwen-7b/ \
--data_type fp16 \
--engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/ \
--max_input_length 2048 \
--output_len 2048
```
### Run
**Notice: `pip install tiktoken` in run phase**
To run a XTRT-LLM Qwen model using the engines generated by `build.py`
```bash
# With fp16 inference
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
--max_output_len=50 \
--tokenizer_dir ./downloads/qwen-7b/ \
--engine_dir=./downloads/qwen-7b/trt_engines/fp16/1-XPU/
# Qwen1.5 With fp16 inference
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
--max_output_len=50 \
--tokenizer_dir ./downloads/Qwen1.5-7B-Chat/ \
--engine_dir=./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/
# With int8 weight only inference
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
--max_output_len=50 \
--tokenizer_dir ./downloads/qwen-7b/ \
--engine_dir=./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/
# Run Qwen 7B model in FP16 using two XPUs.
mpirun -n 2 --allow-run-as-root \
python ../run.py --input_text "你好,请问你叫什么?答:" \
--tokenizer_dir ./downloads/qwen-7b/ \
--max_output_len=50 \
--engine_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/
```
**Demo output of run.py:**
```bash
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
--max_output_len=50 \
--tokenizer_dir ./downloads/qwen-7b/ \
--engine_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/
```
```
Loading engine from ./downloads/qwen-7b/trt_engines/fp16/1-XPU/qwen_float16_tp1_rank0.engine
Input: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好,请问你叫什么?<|im_end|>
<|im_start|>assistant
"
Output: "我是来自阿里云的大规模语言模型,我叫通义千问。"
```