Baichuan

This document shows how to build and run a Baichuan models (including v1_7b/v1_13b/v2_7b/v2_13b) in XTRT-LLM on both single XPU and single node multi-XPU.

Overview

The XTRT-LLM Baichuan example code is located in examples/baichuan. There are several main files in that folder:

build.py to build the XTRT engine(s) needed to run the Baichuan model,
run.py to run the inference on an input text,

These scripts accept an argument named model_version, whose value should be v1_7b/v1_13b/v2_7b/v2_13b and the default value is v1_13b.

Support Matrix

FP16
INT4 & INT8 Weight-Only

Usage

The XTRT-LLM Baichuan example code locates at examples/baichuan. It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.

Build XTRT engine(s)

Need to specify the HF Baichuan checkpoint path. For v1_13b, you should use whether ./downloads/baichuan-13b or baichuan-inc/Baichuan-13B-Base. For v2_13b, you should use whether baichuan-inc/Baichuan2-13B-Chat or baichuan-inc/Baichuan2-13B-Base. More Baichuan models could be found on baichuan-inc.

XTRT-LLM Baichuan builds XTRT engine(s) from HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) with dummy weights.

Normally build.py only requires single XPU, but if you've already got all the XPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding --parallel_build argument. Please note that currently parallel_build feature only supports single node.

Here're some examples that take v1_13b as example(v1_7b, v2_7b, v2_13b are supported):


# Build the Baichuan V1 13B model using a single XPU and FP16.
python build.py --model_version v1_13b \
                --model_dir ./downloads/baichuan-13b \
                --dtype float16 \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/baichuan-13b/fp16/tp1

# Build the Baichuan V1 13B model using a single XPU and apply INT8 weight-only quantization.
python build.py --model_version v1_13b \
                --model_dir ./downloads/baichuan-13b \
                --dtype float16 \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --use_weight_only \
                --output_dir ./downloads/baichuan-13b/int8/tp1

# Build the Baichuan V1 13B model using a single GPU and apply INT4 weight-only quantization.
python build.py --model_version v1_13b \
                --model_dir baichuan-inc/Baichuan-13B-Chat \
                --dtype float16 \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --use_weight_only \
                --weight_only_precision int4 \
                --output_dir ./tmp/baichuan_v1_13b/trt_engines/int4_weight_only/1-gpu/

# Build Baichuan V1 13B using 2-way tensor parallelism and FP16.
python build.py --model_version v1_13b \
                --model_dir ./downloads/baichuan-13b \
                --dtype float16 \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/baichuan-13b/fp16/tp2 \
                --parallel_build \
                --world_size 2

# Build Baichuan V1 13B using 2-way tensor parallelism and apply INT8 weight-only quantization.
python build.py --model_version v1_13b \
                --model_dir ./downloads/baichuan-13b \
                --dtype float16 \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --use_weight_only \
                --output_dir ./downloads/baichuan-13b/int8/tp2 \
                --parallel_build \
                --world_size 2

Run

Before running the examples, make sure set the environment variables:

export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0           # disable XPytorch use L3.

If you are runing with multiple XPUs and no L3 space, you can set BKCL_CCIX_BUFFER_GM=1 to disable L3.

To run a XTRT-LLM Baichuan model using the engines generated by build.py. Here're some examples:

# Generate summarization for a given input text
python summarize.py --model_version v2_13b \
                    --hf_model_location ./downloads/baichuan2-13b \
                    --engine_dir ./downloads/baichuan2-13b/fp16/tp1/ \
                    --log_level info

# With fp16 inference
python run.py --model_version v1_13b \
              --max_output_len=50 \
              --tokenizer_dir ./downloads/baichuan-13b \
              --log_level=info \
              --engine_dir=./downloads/baichuan-13b/fp16/tp1

# With INT8 weight-only quantization inference
python run.py --model_version v1_13b \
              --max_output_len=50 \
              --tokenizer_dir=./downloads/baichuan-13b \
              --log_level=info \
              --engine_dir=./downloads/baichuan-13b/int8/tp1

# With INT4 weight-only quantization inference
python run.py --model_version v1_13b \
              --max_output_len=50 \
              --tokenizer_dir=baichuan-inc/Baichuan-13B-Chat \
              --engine_dir=./tmp/baichuan_v1_13b/trt_engines/int4_weight_only/1-gpu/

# with fp16 and 2-way tensor parallelism inference
mpirun -n 2 --allow-run-as-root \
    python run.py --model_version v1_13b \
                  --max_output_len=50 \
                  --tokenizer_dir=./downloads/baichuan-13b \
                  --log_level=info \
                  --engine_dir=./downloads/baichuan-13b/fp16/tp2

# with INT8 weight-only and 2-way tensor parallelism inference
mpirun -n 2 --allow-run-as-root \
    python run.py --model_version v1_13b \
                  --max_output_len=50 \
                  --tokenizer_dir=./downloads/baichuan-13b \
                  --log_level=info \
                  --engine_dir=./downloads/baichuan-13b/int8/tp2

Known Issues

The implementation of the Baichuan-7B model with INT8 Weight-Only and Tensor Parallelism greater than 2 might have accuracy issues. It is under investigation.