Qwen
This document shows how to build and run a Qwen model in XTRT-LLM on both single XPU and single node multi-XPU.
Support Qwen1.5 model as well
Overview
The XTRT-LLM Qwen example code is located in qwen. There is one main file:
build.pyto build the XTRT-LLM engine(s) needed to run the Qwen model.
In addition, there are two shared files in the parent folder examples for inference and evaluation:
../run.pyto run the inference on an input text;../summarize.pyto summarize the articles in the cnn_dailymail dataset.
Support Matrix
- FP16
- INT8 Weight-Only
- Tensor Parallel
Usage
The XTRT-LLM Qwen example code locates at qwen. It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.
Build XTRT engine(s)
Need to prepare the HF Qwen checkpoint first by following the guides here Qwen-7B-Chat or Qwen-14B-Chat
Create a downloads directory to store the weights downloaded from huaggingface.
mkdir -p ./downloads
Store Qwen-7B-Chat or Qwen-14B-Chat separately.
- for Qwen-7B-Chat
mv Qwen-7B-Chat ./downloads/qwen-7b/
- for Qwen-14B-Chat
mv Qwen-14B-Chat ./downloads/qwen-14b/
- for Qwen1.5-7B-Chat
mv Qwen1.5-7B-Chat ./downloads/Qwen1.5-7B-Chat/
XTRT-LLM Qwen builds XTRT engine(s) from HF checkpoint.
Normally build.py only requires single XPU, but if you've already got all the XPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding --parallel_build argument. Please note that currently parallel_build feature only supports single node.
** Notice: Qwen1.5 require arg "--version=1.5 **
** Notice: pip install transformers-stream-generator in build phase**
Here're some examples:
# Build a single-XPU float16 engine from HF weights.
# use_gpt_attention_plugin is necessary in Qwen.
# Try use_gemm_plugin to prevent accuracy issue.
# It is recommend to use --use_gpt_attention_plugin for better performance
# Build the Qwen 7B model using a single XPU and FP16.
python build.py --hf_model_dir ./downloads/qwen-7b \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/
# Build the Qwen1.5 7B model using a single XPU and FP16.
python build.py --hf_model_dir ./downloads/Qwen1.5-7B-Chat \
--version 1.5 \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/
# Build the Qwen 7B model using a single XPU and apply INT8 weight-only quantization.
python build.py --hf_model_dir ./downloads/qwen-7b/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--weight_only_precision int8 \
--output_dir ./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/
# Build Qwen 7B using 2-way tensor parallelism.
python build.py --hf_model_dir ./downloads/qwen-7b/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/ \
--world_size 2 \
--tp_size 2
# Build Qwen 14B using 2-way tensor parallelism.
python build.py --hf_model_dir ./downloads/qwen-14b/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/qwen-14b/trt_engines/fp16/2-XPU/ \
--world_size 2 \
--tp_size 2
SmoothQuant
The smoothquant supports both Qwen v1 and Qwen v2. Unlike the FP16 build where the HF weights are processed and loaded into the XTRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
Example:
python3 hf_qwen_convert.py -i ./downloads/qwen-7b/ -o ./downloads/qwen-7b/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type float16
Note hf_qwen_convert.py run with PyTorch, and
torch-cpuhas better accuracy than xpytorch generally.- XPyTorch often use more than 32GB GM, thus more XPU are necessary to finish it.
- add
-p=1if run with XPyTorch.
build.py add new options for the support of INT8 inference of SmoothQuant models.
--use_smooth_quant is the starting point of INT8 inference. By default, it
will run the model in the per-tensor mode.
--per-token and --per-channel are not supported yet.
Examples of build invocations:
# Build model for SmoothQuant in the _per_tensor_ mode.
python3 build.py --ft_dir_path=./downloads/qwen-7b/sq0.5/1-XPU/ \
--use_smooth_quant \
--hf_model_dir ./downloads/qwen-7b/ \
--output_dir ./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/
- run
python3 ../run.py --input_text "你好,请问你叫什么?" \
--max_output_len=50 \
--tokenizer_dir ./downloads/qwen-7b/ \
--engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/
- summarize
python ../summarize.py --test_trt_llm \
--tokenizer_dir ./downloads/qwen-7b/ \
--data_type fp16 \
--engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/ \
--max_input_length 2048 \
--output_len 2048
Run
Notice: pip install tiktoken in run phase
To run a XTRT-LLM Qwen model using the engines generated by build.py
# With fp16 inference
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
--max_output_len=50 \
--tokenizer_dir ./downloads/qwen-7b/ \
--engine_dir=./downloads/qwen-7b/trt_engines/fp16/1-XPU/
# Qwen1.5 With fp16 inference
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
--max_output_len=50 \
--tokenizer_dir ./downloads/Qwen1.5-7B-Chat/ \
--engine_dir=./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/
# With int8 weight only inference
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
--max_output_len=50 \
--tokenizer_dir ./downloads/qwen-7b/ \
--engine_dir=./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/
# Run Qwen 7B model in FP16 using two XPUs.
mpirun -n 2 --allow-run-as-root \
python ../run.py --input_text "你好,请问你叫什么?答:" \
--tokenizer_dir ./downloads/qwen-7b/ \
--max_output_len=50 \
--engine_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/
Demo output of run.py:
python3 ../run.py --input_text "你好,请问你叫什么?答:" \
--max_output_len=50 \
--tokenizer_dir ./downloads/qwen-7b/ \
--engine_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/
Loading engine from ./downloads/qwen-7b/trt_engines/fp16/1-XPU/qwen_float16_tp1_rank0.engine
Input: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好,请问你叫什么?<|im_end|>
<|im_start|>assistant
"
Output: "我是来自阿里云的大规模语言模型,我叫通义千问。"