EngineX-Kunlunxin/r200_8f_xtrt_llm

Fork 0

Files

History

Yang Jun bf00e72fb2 add pkgs

2025-08-06 15:49:14 +08:00

__pycache__

add pkgs

2025-08-06 15:49:14 +08:00

.gitignore

add pkgs

2025-08-06 15:49:14 +08:00

build.py

add pkgs

2025-08-06 15:49:14 +08:00

convert.py

add pkgs

2025-08-06 15:49:14 +08:00

hf_llama_convert.py

add pkgs

2025-08-06 15:49:14 +08:00

quantize.py

add pkgs

2025-08-06 15:49:14 +08:00

README_CN.md

add pkgs

2025-08-06 15:49:14 +08:00

README.md

add pkgs

2025-08-06 15:49:14 +08:00

requirements.txt

add pkgs

2025-08-06 15:49:14 +08:00

run.py

add pkgs

2025-08-06 15:49:14 +08:00

run.sh

add pkgs

2025-08-06 15:49:14 +08:00

smoothquant.py

add pkgs

2025-08-06 15:49:14 +08:00

summarize.py

add pkgs

2025-08-06 15:49:14 +08:00

weight.py

add pkgs

2025-08-06 15:49:14 +08:00

README.md

LLaMA

This document shows how to build and run a LLaMA model in XTRT-LLM on both single XPU and single node multi-XPU.

Overview

The XTRT-LLM LLaMA example code is located in examples/llama. There are several main files in that folder:

build.py to build the engine(s) needed to run the LLaMA model,
run.py to run the inference on an input text,

Support Matrix

FP16
INT8 & INT4 Weight-Only
Tensor Parallel

Usage

The XTRT-LLM LLaMA example code locates at examples/llama. It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.

Build XTRT engine(s)

Need to prepare the HF LLaMA checkpoint first by following the guides here https://huggingface.co/docs/transformers/main/en/model_doc/llama.

XTRT-LLM LLaMA builds XTRT engine(s) from HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) with dummy weights.

Normally build.py only requires single XPU, but if you've already got all the XPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding --parallel_build argument. Please note that currently parallel_build feature only supports single node.

Here're some examples:

# Build a single-XPU float16 engine from HF weights.
# use_gpt_attention_plugin is necessary in LLaMA.
# It is recommend to use --use_gpt_attention_plugin for better performance

# Build the LLaMA 7B model using a single XPU and FP16.
python build.py --model_dir ./downloads/llama-7b-hf/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/llama-7b-hf/trt_engines/fp16/1-XPU/


# Build the LLaMA 7B model using a single XPU and apply INT8 weight-only quantization.
python build.py --model_dir ./downloads/llama-7b-hf/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --use_weight_only \
                --output_dir ./downloads/llama-7b-hf/trt_engines/weight_only/1-XPU/

# Build LLaMA 7B using 2-way tensor parallelism.
python build.py --model_dir ./downloads/llama-7b-hf/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/llama-7b-hf/trt_engines/fp16/2-XPU/ \
                --world_size 2 \
                --tp_size 2 \
                --parallel_build


# Build LLaMA 13B using 2-way tensor parallelism.
python build.py --model_dir ./downloads/llama13b/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/llama13b/trt_engines/fp16/2-XPU/ \
                --world_size 2 \
                --tp_size 2 \
                --parallel_build

LLaMA v2 Updates

The LLaMA v2 models with 7B and 13B are compatible with the LLaMA v1 implementation. The above commands still work.

For LLaMA v2 70B, there is a restriction on tensor parallelism that the number of KV heads must be divisible by the number of XPUs. For example, since the 70B model has 8 KV heads, you can run it with 2, 4 or 8 XPUs

# Build LLaMA 70B using 8-way tensor parallelism.
python build.py --model_dir ./downloads/llama2-70b/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/llama2-70b/trt_engines/fp16/8-XPU/ \
                --world_size 8 \
                --tp_size 8 \
                --parallel_build

Same instructions can be applied to fine-tuned versions of the LLaMA v2 models (e.g. 7Bf or llama-2-7b-chat).

Test with summarize.py: pip install nltk rouge_score

python summarize.py --test_trt_llm \
                    --hf_model_location ./downloads/llama-7b-hf \
                    --data_type fp16 \
                    --engine_dir ./downloads/llama-7b-hf/trt_engines/fp16/1-XPU

SmoothQuant

The smoothquant supports both LLaMA v1 and LLaMA v2. Unlike the FP16 build where the HF weights are processed and loaded into the XTRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.

Example:

python3 hf_llama_convert.py -i ./downloads/llama-7b-hf -o ./downloads/smooth_llama_7B/sq0.8/ -sq 0.8 --tensor-parallelism 1 --storage-type fp16

Note hf_llama_convert.py run with pytorch, and

torch-cpu has better accuracy than XPyTorch generally.
XPyTorch often use more than 32GB GM, thus more XPU are necessary to finish it.
add -p=1 if run with XPyTorch.

We offer converted data here for LLaMa-7b with sq of 0.6.

build.py add new options for the support of INT8 inference of SmoothQuant models.

--use_smooth_quant is the starting point of INT8 inference. By default, it will run the model in the per-tensor mode.

--per-token and --per-channel are not supported yet.

Examples of build invocations:

# Build model for SmoothQuant in the _per_tensor_ mode.
python3 build.py --ft_model_dir=./downloads/smooth_llama_7B/sq0.8/1-XPU/ \
                 --use_smooth_quant \
                 --output_dir ./downloads/smooth_llama_7B/sq0.8/trt_engines/fp16/1-XPU/

Note we use --ft_model_dir instead of --model_dir and --meta_ckpt_dir since SmoothQuant model needs INT8 weights and various scales from the binary files.

Run

Before running the examples, make sure set the environment variables:

export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0           # disable XPytorch use L3.

If you are runing with multiple XPUs and no L3 space, you can set BKCL_CCIX_BUFFER_GM=1 to disable L3.

To run a XTRT-LLM LLaMA model using the engines generated by build.py

# With fp16 inference
python3 run.py --max_output_len=50 \
               --tokenizer_dir ./downloads/llama-7b-hf/ \
               --engine_dir=./downloads/llama-7b-hf/trt_engines/fp16/1-XPU/

# With fp16 inference, SmoothQuant
python3 run.py --max_output_len=50 \
               --tokenizer_dir ./downloads/llama-7b-hf/ \
               --engine_dir=./downloads/smooth_llama_7B/sq0.8/trt_engines/fp16/1-XPU/

Summarization using the LLaMA model

# Run summarization using the LLaMA 7B model in FP16.
python summarize.py --test_trt_llm \
                    --hf_model_location ./downloads/llama-7b-hf/ \
                    --data_type fp16 \
                    --engine_dir ./downloads/llama-7b-hf/trt_engines/fp16/1-XPU/

# Run summarization using the LLaMA 7B model quantized to INT8.
python summarize.py --test_trt_llm \
                    --hf_model_location ./downloads/llama-7b-hf/ \
                    --data_type fp16 \
                    --engine_dir ./downloads/llama-7b-hf/trt_engines/weight_only/1-XPU/

# Run summarization using the LLaMA 7B model in FP16 using two XPUs.
mpirun -n 2 --allow-run-as-root \
    python summarize.py --test_trt_llm \
                        --hf_model_location ./downloads/llama-7b-hf/ \
                        --data_type fp16 \
                        --engine_dir ./downloads/llama-7b-hf/trt_engines/fp16/2-XPU/