EngineX-Kunlunxin/r200_8f_xtrt_llm

Fork 0

Files

History

Yang Jun bf00e72fb2 add pkgs

2025-08-06 15:49:14 +08:00

.gitignore

add pkgs

2025-08-06 15:49:14 +08:00

build.py

add pkgs

2025-08-06 15:49:14 +08:00

process.py

add pkgs

2025-08-06 15:49:14 +08:00

quantize.py

add pkgs

2025-08-06 15:49:14 +08:00

README.md

add pkgs

2025-08-06 15:49:14 +08:00

requirements.txt

add pkgs

2025-08-06 15:49:14 +08:00

run.py

add pkgs

2025-08-06 15:49:14 +08:00

run.sh

add pkgs

2025-08-06 15:49:14 +08:00

smoothquant.py

add pkgs

2025-08-06 15:49:14 +08:00

visualize.py

add pkgs

2025-08-06 15:49:14 +08:00

weight.py

add pkgs

2025-08-06 15:49:14 +08:00

README.md

ChatGLM

This document explains how to build the ChatGLM-6B, ChatGLM2-6B, ChatGLM2-6B-32k, ChatGLM3-6B, ChatGLM3-6B-Base, ChatGLM3-6B-32k models using XTRT-LLM and run on a single XPU, a single node with multiple XPUs or multiple nodes with multiple XPUs.

Overview

The XTRT-LLM ChatGLM implementation can be found in xtrt_llm/models/chatglm/model.py. The XTRT-LLM ChatGLM example code is located in examples/chatglm. There are two main files:

build.py to build the XTRT engine(s) needed to run the ChatGLM model.
run.py to run the inference on an input text.

Support Matrix

Model Name	FP16	FMHA	WO	TP
chatglm_6b	Y	Y	Y	Y
chatglm2_6b	Y	Y	Y	Y
chatglm2-6b_32k	Y	Y	Y	Y
chatglm3_6b	Y	Y	Y	Y
chatglm3_6b_base	Y	Y	Y	Y
chatglm3_6b_32k	Y	Y	Y	Y
glm_10b	Y	Y	Y	Y

Model Name: the name of the model, the same as the name on HuggingFace
FMHA: Fused MultiHead Attention (see introduction below)
WO: Weight Only Quantization (int8 / int4)
AWQ: Activation Aware Weight Quantization
SQ: Smooth Quantization
ST: Strongly Typed
TP: Tensor Parallel
PP: Pipeline Parallel
IFB: In-flight Batching (see introduction below)

Usage

The next section describe how to build the engine and run the inference demo.

1. Download repo and weights from HuggingFace Transformers

pip install -r requirements.txt
apt-get update
apt-get install git-lfs
rm -rf chatglm*

# clone one or more models we want to build
git clone https://huggingface.co/THUDM/chatglm-6b       chatglm_6b
git clone https://huggingface.co/THUDM/chatglm2-6b      chatglm2_6b
git clone https://huggingface.co/THUDM/chatglm2-6b-32k  chatglm2_6b_32k
git clone https://huggingface.co/THUDM/chatglm3-6b      chatglm3_6b
git clone https://huggingface.co/THUDM/chatglm3-6b-base chatglm3_6b_base
git clone https://huggingface.co/THUDM/chatglm3-6b-32k  chatglm3_6b_32k
git clone https://huggingface.co/THUDM/glm-10b          glm_10b

2. Build XTRT engine(s)

This ChatGLM example in XTRT-LLM builds XTRT engine(s) using HF checkpoint directly (rather than using FT checkpoints such as GPT example).
If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights.
The build.py script requires a single XPU to build the XTRT engine(s).
You can enable parallel builds to accelerate the engine building process if you have more than one XPU in your system (of the same model).
For parallel building, add the --parallel_build argument to the build command (this feature cannot take advantage of more than a single node).
The number of XTRT engines depends on the number of XPUs that will be used to run inference.
argument [--model_name/-m] is required, which can be one of "chatglm_6b", "chatglm2_6b", "chatglm2_6b_32k", "chatglm3_6b", "chatglm3_6b_base", "chatglm3_6b_32k" or "glm-10b" (use "_" rather than "-") for ChatGLM-6B, ChatGLM2-6B, ChatGLM2-6B-32K ChatGLM3-6B, ChatGLM3-6B-Base, ChatGLM3-6B-32K or GLM-10B model respectively.

Examples of build invocations

# Build a default engine of ChatGLM3-6B on single XPU with FP16, GPT Attention plugin, Gemm plugin, RMS Normolization plugin
python3 build.py -m chatglm3_6b

# Build a engine on single XPU with FMHA kernels (see introduction below), other configurations are the same as default example
python3 build.py -m chatglm3_6b --enable_context_fmha  # or --enable_context_fmha_fp32_acc

# Build a engine on single XPU with int8/int4 Weight-Only quantization, other configurations are the same as default example
python3 build.py -m chatglm3_6b --use_weight_only  # or --use_weight_only --weight_only_precision int4

# Build a engine on single XPU with int8_kv_cache and remove_input_padding, other configurations are the same as default example
python3 build.py -m chatglm3_6b --paged_kv_cache --remove_input_padding

# Build a engine on two XPU, other configurations are the same as default example
python3 build.py -m chatglm3_6b --world_size 2

# Build a engine of Chatglm-6B on single XPU, other configurations are the same as default example
python3 build.py -m chatglm_6b

# Build a engine of Chatglm2-6B on single XPU, other configurations are the same as default example
python3 build.py -m chatglm2_6b

# Build a engine of ChatGLM2-6B-32k on single XPU, other configurations are the same as default example
python3 build.py -m chatglm2_6b-32k

# Build a engine of ChatGLM3-6B-Base on single XPU, other configurations are the same as default example
python3 build.py -m chatglm3_6b_base

# Build a engine of ChatGLM3-6B-32k on single XPU, other configurations are the same as default example
python3 build.py -m chatglm3_6b-32k

# Build a engine of GLM-10B on single XPU, other configurations are the same as default example
python3 build.py -m glm_10b

Enabled plugins

Use --use_gpt_attention_plugin <DataType> to configure GPT Attention plugin (default as float16)
Use --use_gemm_plugin <DataType> to configure GEMM plugin (default as float16)
Use --use_layernorm_plugin <DataType> (for ChatGLM-6B and GLM-10B models) to configure layernorm normolization plugin (default as float16)
Use --use_rmsnorm_plugin <DataType> (for ChatGLM2-6B* and ChatGLM3-6B* models) to configure RMS normolization plugin (default as float16)

Weight Only quantization

Use --use_weight_only to enable INT8-Weight-Only quantization, this will siginficantly lower the latency and memory footprint.
Furthermore, use --weight_only_precision int8 or --weight_only_precision int4 to configure the data type of the weights.

In-flight batching

The engine must be built accordingly if in-flight batching in C++ runtime will be used.
Use --use_inflight_batching to enable In-flight Batching.
Switch --use_gpt_attention_plugin=float16, --paged_kv_cache, --remove_input_padding will be set when using In-flight Batching.
It is possible to use --use_gpt_attention_plugin float32 In-flight Batching.
The size of the block in paged KV cache can be conteoled additionally by using --tokens_per_block=N.

3. Run

Single node, single XPU

# Run the default engine of ChatGLM3-6B on single XPU, other model name is available if built.
python3 run.py -m chatglm3_6b
# Run the default engine of ChatGLM3-6B on single XPU, using streaming output, other model name is available if built.
# In this case only the first sample in the first batch is shown,
# But actually all output of all batches are available.
python3 run.py -m chatglm3_6b --streaming
# Run the default engine of GLM3-10B on single XPU, other model name is available if built.
# Token "[MASK]" or "[sMASK]" or "[gMASK]" must be included inside the prompt as the original model commanded.
python3 run.py -m chatglm3_6b --input_text "Peking University is [MASK] than Tsinghua Univercity."

Single node, multi XPU

# Run the Tensor Parallel 2 engine of ChatGLM3-6B on two XPU, other model name is available if built.
mpirun -n 2 python run.py -m chatglm3_6b

--allow-run-as-root might be needed if using mpirun as root.

Run comparison of performance and accuracy

# Run the summarization of ChatGLM3-6B task, other model name is available if built.
python3 ../summarize.py --test_trt_llm --tokenizer_dir chatglm3_6b --max_input_length 2048

4. Note

vllm_test/test_llm_engine.py should be run instead of run.py when --paged_kv_cache is set.
Accuray of multi-batch chatglm2/3 is not available in padding mode.
--remove_input_padding is not available in chatglm_6b.