EngineX-Kunlunxin/r200_8f_xtrt_llm

Fork 0

Files

History

Yang Jun bf00e72fb2 add pkgs

2025-08-06 15:49:14 +08:00

__pycache__

add pkgs

2025-08-06 15:49:14 +08:00

utils

add pkgs

2025-08-06 15:49:14 +08:00

-d

add pkgs

2025-08-06 15:49:14 +08:00

benchmark.py

add pkgs

2025-08-06 15:49:14 +08:00

build.py

add pkgs

2025-08-06 15:49:14 +08:00

hf_qwen_convert.py

add pkgs

2025-08-06 15:49:14 +08:00

qwen2_weight.py

add pkgs

2025-08-06 15:49:14 +08:00

qwen_weight.py

add pkgs

2025-08-06 15:49:14 +08:00

README_CN.md

add pkgs

2025-08-06 15:49:14 +08:00

README.md

add pkgs

2025-08-06 15:49:14 +08:00

requirements.txt

add pkgs

2025-08-06 15:49:14 +08:00

smoothquant.py

add pkgs

2025-08-06 15:49:14 +08:00

README.md

Qwen

This document shows how to build and run a Qwen model in XTRT-LLM on both single XPU and single node multi-XPU.

Support Qwen1.5 model as well

Overview

The XTRT-LLM Qwen example code is located in qwen. There is one main file:

build.py to build the XTRT-LLM engine(s) needed to run the Qwen model.

In addition, there are two shared files in the parent folder examples for inference and evaluation:

../run.py to run the inference on an input text;
../summarize.py to summarize the articles in the cnn_dailymail dataset.

Support Matrix

FP16
INT8 Weight-Only
Tensor Parallel

Usage

The XTRT-LLM Qwen example code locates at qwen. It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.

Build XTRT engine(s)

Need to prepare the HF Qwen checkpoint first by following the guides here Qwen-7B-Chat or Qwen-14B-Chat

Create a downloads directory to store the weights downloaded from huaggingface.

mkdir -p ./downloads

Store Qwen-7B-Chat or Qwen-14B-Chat separately.

for Qwen-7B-Chat

mv Qwen-7B-Chat ./downloads/qwen-7b/

for Qwen-14B-Chat

mv Qwen-14B-Chat ./downloads/qwen-14b/

for Qwen1.5-7B-Chat

mv Qwen1.5-7B-Chat ./downloads/Qwen1.5-7B-Chat/

XTRT-LLM Qwen builds XTRT engine(s) from HF checkpoint.

Normally build.py only requires single XPU, but if you've already got all the XPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding --parallel_build argument. Please note that currently parallel_build feature only supports single node.

** Notice: Qwen1.5 require arg "--version=1.5 ** ** Notice: pip install transformers-stream-generator in build phase**

Here're some examples:

# Build a single-XPU float16 engine from HF weights.
# use_gpt_attention_plugin is necessary in Qwen.
# Try use_gemm_plugin to prevent accuracy issue.
# It is recommend to use  --use_gpt_attention_plugin for better performance

# Build the Qwen 7B model using a single XPU and FP16.
python build.py --hf_model_dir ./downloads/qwen-7b \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/

# Build the Qwen1.5 7B model using a single XPU and FP16.
python build.py --hf_model_dir ./downloads/Qwen1.5-7B-Chat \
                --version 1.5 \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/

# Build the Qwen 7B model using a single XPU and apply INT8 weight-only quantization.
python build.py --hf_model_dir ./downloads/qwen-7b/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --use_weight_only \
                --weight_only_precision int8 \
                --output_dir ./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/

# Build Qwen 7B using 2-way tensor parallelism.
python build.py --hf_model_dir ./downloads/qwen-7b/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/ \
                --world_size 2 \
                --tp_size 2


# Build Qwen 14B using 2-way tensor parallelism.
python build.py --hf_model_dir ./downloads/qwen-14b/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/qwen-14b/trt_engines/fp16/2-XPU/ \
                --world_size 2 \
                --tp_size 2

SmoothQuant

The smoothquant supports both Qwen v1 and Qwen v2. Unlike the FP16 build where the HF weights are processed and loaded into the XTRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.

Example:

python3 hf_qwen_convert.py -i ./downloads/qwen-7b/ -o ./downloads/qwen-7b/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type float16

Note hf_qwen_convert.py run with PyTorch, and

torch-cpu has better accuracy than xpytorch generally.
XPyTorch often use more than 32GB GM, thus more XPU are necessary to finish it.
add -p=1 if run with XPyTorch.

build.py add new options for the support of INT8 inference of SmoothQuant models.

--use_smooth_quant is the starting point of INT8 inference. By default, it will run the model in the per-tensor mode.

--per-token and --per-channel are not supported yet.

Examples of build invocations:

# Build model for SmoothQuant in the _per_tensor_ mode.
python3 build.py --ft_dir_path=./downloads/qwen-7b/sq0.5/1-XPU/ \
                 --use_smooth_quant \
                 --hf_model_dir ./downloads/qwen-7b/ \
                 --output_dir ./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/

python3 ../run.py --input_text "你好，请问你叫什么？" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/qwen-7b/ \
                  --engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/

summarize

python ../summarize.py --test_trt_llm \
                       --tokenizer_dir ./downloads/qwen-7b/ \
                       --data_type fp16 \
                       --engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/ \
                       --max_input_length 2048 \
                       --output_len 2048

Run

Notice: pip install tiktoken in run phase

To run a XTRT-LLM Qwen model using the engines generated by build.py

# With fp16 inference
python3 ../run.py --input_text "你好，请问你叫什么？答：" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/qwen-7b/ \
                  --engine_dir=./downloads/qwen-7b/trt_engines/fp16/1-XPU/

# Qwen1.5 With fp16 inference
python3 ../run.py --input_text "你好，请问你叫什么？答：" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/Qwen1.5-7B-Chat/ \
                  --engine_dir=./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/

# With int8 weight only inference
python3 ../run.py --input_text "你好，请问你叫什么？答：" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/qwen-7b/ \
                  --engine_dir=./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/

# Run Qwen 7B model in FP16 using two XPUs.
mpirun -n 2 --allow-run-as-root \
    python ../run.py  --input_text "你好，请问你叫什么？答：" \
                      --tokenizer_dir  ./downloads/qwen-7b/ \
                      --max_output_len=50 \
                      --engine_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/

Demo output of run.py:

python3 ../run.py --input_text "你好，请问你叫什么？答：" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/qwen-7b/  \
                  --engine_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/

Loading engine from ./downloads/qwen-7b/trt_engines/fp16/1-XPU/qwen_float16_tp1_rank0.engine
Input: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好，请问你叫什么？<|im_end|>
<|im_start|>assistant
"
Output: "我是来自阿里云的大规模语言模型，我叫通义千问。"