EngineX-Kunlunxin/r200_8f_xtrt_llm

Fork 0

Files

History

Yang Jun bf00e72fb2 add pkgs

2025-08-06 15:49:14 +08:00

.gitignore

add pkgs

2025-08-06 15:49:14 +08:00

build.py

add pkgs

2025-08-06 15:49:14 +08:00

convert.py

add pkgs

2025-08-06 15:49:14 +08:00

hf_bloom_convert.py

add pkgs

2025-08-06 15:49:14 +08:00

README_CN.md

add pkgs

2025-08-06 15:49:14 +08:00

README.md

add pkgs

2025-08-06 15:49:14 +08:00

requirements.txt

add pkgs

2025-08-06 15:49:14 +08:00

run.py

add pkgs

2025-08-06 15:49:14 +08:00

smoothquant.py

add pkgs

2025-08-06 15:49:14 +08:00

summarize.py

add pkgs

2025-08-06 15:49:14 +08:00

weight.py

add pkgs

2025-08-06 15:49:14 +08:00

README.md

BLOOM

This document shows how to build and run a BLOOM model in XTRT-LLM on both single XPU and single node multi-XPU.

Overview

The XTRT-LLM BLOOM example code is located in examples/bloom. There are several main files in that folder:

build.py to build the XTRT engine(s) needed to run the BLOOM model,
run.py to run the inference on an input text,
summarize.py to summarize the articles in the cnn_dailymail dataset using the model.

Support Matrix

FP16
INT8 & INT4 Weight-Only
Tensor Parallel

Usage

The XTRT-LLM BLOOM example code locates at examples/bloom. It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.

Build XTRT engine(s)

Need to prepare the HF BLOOM checkpoint first by following the guides here https://huggingface.co/docs/transformers/main/en/model_doc/bloom.

e.g. To install BLOOM-560M

# Setup git-lfs
git lfs install
rm -rf ./downloads/bloom/560M/
mkdir -p ./downloads/bloom/560M/ && git clone https://huggingface.co/bigscience/bloom-560m ./downloads/bloom/560M/

XTRT-LLM BLOOM builds XTRT engine(s) from HF checkpoint.

Normally build.py only requires single XPU, but if you've already got all the XPUs needed for inference, you could enable parallel building to make the engine building process faster by adding --parallel_build argument. Please note that currently parallel_build feature only supports single node.

Here're some examples:

# Build a single-XPU float16 engine from HF weights.
# Try use_gemm_plugin to prevent accuracy issue. TODO check this holds for BLOOM

# Single XPU on BLOOM 560M
python build.py --model_dir ./downloads/bloom/560M/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/bloom/560M/trt_engines/fp16/1-XPU/

# Build the BLOOM 560M using a single XPU and apply INT8 weight-only quantization.
python build.py --model_dir ./downloads/bloom/560M/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --use_weight_only \
                --weight_only_precision int8 \
                --output_dir ./downloads/bloom/560M/trt_engines/int8_weight_only/1-XPU/

# Use 2-way tensor parallelism on BLOOM 560M
python build.py --model_dir ./downloads/bloom/560M/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/bloom/560M/trt_engines/fp16/2-XPU/ \
                --world_size 2

SmoothQuant

Unlike the FP16 build where the HF weights are processed and loaded into the XTRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.

Example:

python3 hf_bloom_convert.py -i ./downloads/bloom/560M/ -o ./downloads/bloom-smooth/560M --smoothquant 0.5 --tensor-parallelism 1 --storage-type float16

Note hf_bloom_convert.py run with pytorch, and

torch-cpu has better accuracy than XPyTorch generally.
XPyTorch often use more than 32GB GM, thus more XPU are necessary to finish it.
add -p=1 if run with XPyTorch.

build.py add new options for the support of INT8 inference of SmoothQuant models.

--use_smooth_quant is the starting point of INT8 inference. By default, it will run the model in the per-tensor mode.

--per-token and --per-channel are not supported yet.

Examples of build invocations:

# Build model for SmoothQuant in the _per_tensor_ mode.
python3 build.py --bin_model_dir=./downloads/bloom-smooth/560M/1-XPU \
                 --use_smooth_quant \
                 --use_gpt_attention_plugin float16 \
                 --output_dir ./downloads/bloom-smooth/560M/trt_engines/fp16/1-XPU/

Note that GPT attention plugin is required to be enabled for SmoothQuant for now.

Note we use --bin_model_dir instead of --model_dir since SmoothQuant model needs INT8 weights and various scales from the binary files.

Run

python ../summarize.py --test_trt_llm \
                       --hf_model_dir ./downloads/bloom/560M/ \
                       --data_type fp16 \
                       --engine_dir ./downloads/bloom/560M/trt_engines/fp16/1-XPU/

python ../summarize.py --test_trt_llm \
                       --hf_model_dir ./downloads/bloom/560M/ \
                       --data_type fp16 \
                       --engine_dir ./downloads/bloom/560M/trt_engines/int8_weight_only/1-XPU/

python run.py --tokenizer_dir ./downloads/bloom/560M/ \
              --max_output_len=50 \
              --engine_dir ./downloads/bloom/560M/trt_engines/fp16/1-XPU/

python run.py --tokenizer_dir ./downloads/bloom/560M/ \
              --max_output_len=50 \
              --engine_dir ./downloads/bloom/560M/trt_engines/int8_weight_only/1-XPU/

python run.py --tokenizer_dir ./downloads/bloom/560M/ \
              --max_output_len=50 \
              --engine_dir ./downloads/bloom-smooth/560M/trt_engines/fp16/1-XPU/

mpirun -n 2 --allow-run-as-root \
    python run.py --tokenizer_dir ./downloads/bloom/560M/ \
                  --max_output_len=50 \
                  --engine_dir ./downloads/bloom/560M/trt_engines/fp16/2-XPU/