BLOOM
This document shows how to build and run a BLOOM model in XTRT-LLM on both single XPU and single node multi-XPU.
Overview
The XTRT-LLM BLOOM example code is located in examples/bloom. There are several main files in that folder:
build.pyto build the XTRT engine(s) needed to run the BLOOM model,run.pyto run the inference on an input text,summarize.pyto summarize the articles in the cnn_dailymail dataset using the model.
Support Matrix
- FP16
- INT8 & INT4 Weight-Only
- Tensor Parallel
Usage
The XTRT-LLM BLOOM example code locates at examples/bloom. It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.
Build XTRT engine(s)
Need to prepare the HF BLOOM checkpoint first by following the guides here https://huggingface.co/docs/transformers/main/en/model_doc/bloom.
e.g. To install BLOOM-560M
# Setup git-lfs
git lfs install
rm -rf ./downloads/bloom/560M/
mkdir -p ./downloads/bloom/560M/ && git clone https://huggingface.co/bigscience/bloom-560m ./downloads/bloom/560M/
XTRT-LLM BLOOM builds XTRT engine(s) from HF checkpoint.
Normally build.py only requires single XPU, but if you've already got all the XPUs needed for inference, you could enable parallel building to make the engine building process faster by adding --parallel_build argument. Please note that currently parallel_build feature only supports single node.
Here're some examples:
# Build a single-XPU float16 engine from HF weights.
# Try use_gemm_plugin to prevent accuracy issue. TODO check this holds for BLOOM
# Single XPU on BLOOM 560M
python build.py --model_dir ./downloads/bloom/560M/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/bloom/560M/trt_engines/fp16/1-XPU/
# Build the BLOOM 560M using a single XPU and apply INT8 weight-only quantization.
python build.py --model_dir ./downloads/bloom/560M/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--weight_only_precision int8 \
--output_dir ./downloads/bloom/560M/trt_engines/int8_weight_only/1-XPU/
# Use 2-way tensor parallelism on BLOOM 560M
python build.py --model_dir ./downloads/bloom/560M/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/bloom/560M/trt_engines/fp16/2-XPU/ \
--world_size 2
SmoothQuant
Unlike the FP16 build where the HF weights are processed and loaded into the XTRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
Example:
python3 hf_bloom_convert.py -i ./downloads/bloom/560M/ -o ./downloads/bloom-smooth/560M --smoothquant 0.5 --tensor-parallelism 1 --storage-type float16
Note hf_bloom_convert.py run with pytorch, and
torch-cpuhas better accuracy than XPyTorch generally.- XPyTorch often use more than 32GB GM, thus more XPU are necessary to finish it.
- add
-p=1if run with XPyTorch.
build.py add new options for the support of INT8 inference of SmoothQuant models.
--use_smooth_quant is the starting point of INT8 inference. By default, it
will run the model in the per-tensor mode.
--per-token and --per-channel are not supported yet.
Examples of build invocations:
# Build model for SmoothQuant in the _per_tensor_ mode.
python3 build.py --bin_model_dir=./downloads/bloom-smooth/560M/1-XPU \
--use_smooth_quant \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/bloom-smooth/560M/trt_engines/fp16/1-XPU/
Note that GPT attention plugin is required to be enabled for SmoothQuant for now.
Note we use --bin_model_dir instead of --model_dir since SmoothQuant model needs INT8 weights and various scales from the binary files.
Run
python ../summarize.py --test_trt_llm \
--hf_model_dir ./downloads/bloom/560M/ \
--data_type fp16 \
--engine_dir ./downloads/bloom/560M/trt_engines/fp16/1-XPU/
python ../summarize.py --test_trt_llm \
--hf_model_dir ./downloads/bloom/560M/ \
--data_type fp16 \
--engine_dir ./downloads/bloom/560M/trt_engines/int8_weight_only/1-XPU/
python run.py --tokenizer_dir ./downloads/bloom/560M/ \
--max_output_len=50 \
--engine_dir ./downloads/bloom/560M/trt_engines/fp16/1-XPU/
python run.py --tokenizer_dir ./downloads/bloom/560M/ \
--max_output_len=50 \
--engine_dir ./downloads/bloom/560M/trt_engines/int8_weight_only/1-XPU/
python run.py --tokenizer_dir ./downloads/bloom/560M/ \
--max_output_len=50 \
--engine_dir ./downloads/bloom-smooth/560M/trt_engines/fp16/1-XPU/
mpirun -n 2 --allow-run-as-root \
python run.py --tokenizer_dir ./downloads/bloom/560M/ \
--max_output_len=50 \
--engine_dir ./downloads/bloom/560M/trt_engines/fp16/2-XPU/