202 lines
7.6 KiB
Markdown
202 lines
7.6 KiB
Markdown
# Qwen
|
|
|
|
This document shows how to build and run a Qwen model in XTRT-LLM on both single XPU and single node multi-XPU.
|
|
|
|
Support Qwen1.5 model as well
|
|
|
|
## Overview
|
|
|
|
The XTRT-LLM Qwen example code is located in [`qwen`](./). There is one main file:
|
|
|
|
* [`build.py`](./build.py) to build the XTRT-LLM engine(s) needed to run the Qwen model.
|
|
|
|
In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:
|
|
|
|
* [`../run.py`](../run.py) to run the inference on an input text;
|
|
* [`../summarize.py`](../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset.
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
* INT8 Weight-Only
|
|
* Tensor Parallel
|
|
|
|
## Usage
|
|
|
|
The XTRT-LLM Qwen example code locates at [qwen](./). It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.
|
|
|
|
### Build XTRT engine(s)
|
|
|
|
Need to prepare the HF Qwen checkpoint first by following the guides here [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) or [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)
|
|
|
|
Create a `downloads` directory to store the weights downloaded from huaggingface.
|
|
```bash
|
|
mkdir -p ./downloads
|
|
```
|
|
|
|
Store Qwen-7B-Chat or Qwen-14B-Chat separately.
|
|
- for Qwen-7B-Chat
|
|
```bash
|
|
mv Qwen-7B-Chat ./downloads/qwen-7b/
|
|
```
|
|
- for Qwen-14B-Chat
|
|
```bash
|
|
mv Qwen-14B-Chat ./downloads/qwen-14b/
|
|
```
|
|
- for Qwen1.5-7B-Chat
|
|
```bash
|
|
mv Qwen1.5-7B-Chat ./downloads/Qwen1.5-7B-Chat/
|
|
```
|
|
|
|
XTRT-LLM Qwen builds XTRT engine(s) from HF checkpoint.
|
|
|
|
Normally `build.py` only requires single XPU, but if you've already got all the XPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--parallel_build` argument. Please note that currently `parallel_build` feature only supports single node.
|
|
|
|
** Notice: Qwen1.5 require arg "--version=1.5 **
|
|
** Notice: `pip install transformers-stream-generator` in build phase**
|
|
|
|
Here're some examples:
|
|
|
|
```bash
|
|
# Build a single-XPU float16 engine from HF weights.
|
|
# use_gpt_attention_plugin is necessary in Qwen.
|
|
# Try use_gemm_plugin to prevent accuracy issue.
|
|
# It is recommend to use --use_gpt_attention_plugin for better performance
|
|
|
|
# Build the Qwen 7B model using a single XPU and FP16.
|
|
python build.py --hf_model_dir ./downloads/qwen-7b \
|
|
--dtype float16 \
|
|
--use_gpt_attention_plugin float16 \
|
|
--output_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/
|
|
|
|
# Build the Qwen1.5 7B model using a single XPU and FP16.
|
|
python build.py --hf_model_dir ./downloads/Qwen1.5-7B-Chat \
|
|
--version 1.5 \
|
|
--dtype float16 \
|
|
--use_gpt_attention_plugin float16 \
|
|
--output_dir ./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/
|
|
|
|
# Build the Qwen 7B model using a single XPU and apply INT8 weight-only quantization.
|
|
python build.py --hf_model_dir ./downloads/qwen-7b/ \
|
|
--dtype float16 \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_weight_only \
|
|
--weight_only_precision int8 \
|
|
--output_dir ./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/
|
|
|
|
# Build Qwen 7B using 2-way tensor parallelism.
|
|
python build.py --hf_model_dir ./downloads/qwen-7b/ \
|
|
--dtype float16 \
|
|
--use_gpt_attention_plugin float16 \
|
|
--output_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/ \
|
|
--world_size 2 \
|
|
--tp_size 2
|
|
|
|
|
|
# Build Qwen 14B using 2-way tensor parallelism.
|
|
python build.py --hf_model_dir ./downloads/qwen-14b/ \
|
|
--dtype float16 \
|
|
--use_gpt_attention_plugin float16 \
|
|
--output_dir ./downloads/qwen-14b/trt_engines/fp16/2-XPU/ \
|
|
--world_size 2 \
|
|
--tp_size 2
|
|
```
|
|
|
|
#### SmoothQuant
|
|
|
|
The smoothquant supports both Qwen v1 and Qwen v2. Unlike the FP16 build where the HF weights are processed and loaded into the XTRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
|
|
|
|
Example:
|
|
```bash
|
|
python3 hf_qwen_convert.py -i ./downloads/qwen-7b/ -o ./downloads/qwen-7b/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type float16
|
|
```
|
|
|
|
Note `hf_qwen_convert.py` run with PyTorch, and
|
|
1. `torch-cpu` has better accuracy than xpytorch generally.
|
|
2. XPyTorch often use more than 32GB GM, thus more XPU are necessary to finish it.
|
|
3. add `-p=1` if run with XPyTorch.
|
|
|
|
[`build.py`](./build.py) add new options for the support of INT8 inference of SmoothQuant models.
|
|
|
|
`--use_smooth_quant` is the starting point of INT8 inference. By default, it
|
|
will run the model in the _per-tensor_ mode.
|
|
|
|
`--per-token` and `--per-channel` are not supported yet.
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build model for SmoothQuant in the _per_tensor_ mode.
|
|
python3 build.py --ft_dir_path=./downloads/qwen-7b/sq0.5/1-XPU/ \
|
|
--use_smooth_quant \
|
|
--hf_model_dir ./downloads/qwen-7b/ \
|
|
--output_dir ./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/
|
|
```
|
|
|
|
- run
|
|
```bash
|
|
python3 ../run.py --input_text "你好,请问你叫什么?" \
|
|
--max_output_len=50 \
|
|
--tokenizer_dir ./downloads/qwen-7b/ \
|
|
--engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/
|
|
```
|
|
|
|
- summarize
|
|
```bash
|
|
python ../summarize.py --test_trt_llm \
|
|
--tokenizer_dir ./downloads/qwen-7b/ \
|
|
--data_type fp16 \
|
|
--engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/ \
|
|
--max_input_length 2048 \
|
|
--output_len 2048
|
|
```
|
|
|
|
|
|
### Run
|
|
|
|
**Notice: `pip install tiktoken` in run phase**
|
|
|
|
To run a XTRT-LLM Qwen model using the engines generated by `build.py`
|
|
|
|
```bash
|
|
# With fp16 inference
|
|
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
|
|
--max_output_len=50 \
|
|
--tokenizer_dir ./downloads/qwen-7b/ \
|
|
--engine_dir=./downloads/qwen-7b/trt_engines/fp16/1-XPU/
|
|
|
|
# Qwen1.5 With fp16 inference
|
|
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
|
|
--max_output_len=50 \
|
|
--tokenizer_dir ./downloads/Qwen1.5-7B-Chat/ \
|
|
--engine_dir=./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/
|
|
|
|
# With int8 weight only inference
|
|
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
|
|
--max_output_len=50 \
|
|
--tokenizer_dir ./downloads/qwen-7b/ \
|
|
--engine_dir=./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/
|
|
|
|
# Run Qwen 7B model in FP16 using two XPUs.
|
|
mpirun -n 2 --allow-run-as-root \
|
|
python ../run.py --input_text "你好,请问你叫什么?答:" \
|
|
--tokenizer_dir ./downloads/qwen-7b/ \
|
|
--max_output_len=50 \
|
|
--engine_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/
|
|
```
|
|
**Demo output of run.py:**
|
|
```bash
|
|
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
|
|
--max_output_len=50 \
|
|
--tokenizer_dir ./downloads/qwen-7b/ \
|
|
--engine_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/
|
|
```
|
|
```
|
|
Loading engine from ./downloads/qwen-7b/trt_engines/fp16/1-XPU/qwen_float16_tp1_rank0.engine
|
|
Input: "<|im_start|>system
|
|
You are a helpful assistant.<|im_end|>
|
|
<|im_start|>user
|
|
你好,请问你叫什么?<|im_end|>
|
|
<|im_start|>assistant
|
|
"
|
|
Output: "我是来自阿里云的大规模语言模型,我叫通义千问。"
|
|
``` |