Files
r200_8f_xtrt_llm/examples/gptneox/README.md
2025-08-06 15:49:14 +08:00

3.4 KiB

GPT-NeoX

This document explains how to build the GPT-NeoX model using XTRT-LLM and run on single node multi-XPU.

Overview

The XTRT-LLM GPT-NeoX example code is located in examples/gptneox. There are several main files in that folder:

  • build.py to build the XTRT engine(s) needed to run the GPT-NeoX model,
  • run.py to run the inference on an input text,

Support Matrix

  • FP16
  • INT8 Weight-Only
  • Tensor Parallel

Usage

1. Download weights from HuggingFace (HF) Transformers

# Weights & config
sh get_weights.sh

2. Build XTRT engine(s)

XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights.

Examples of build invocations:

# Build a float16 engine using 2-way tensor parallelism and HF weights.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16                    \
                 --log_level=verbose                \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16          \
                 --use_layernorm_plugin float16     \
                 --max_batch_size=16                \
                 --max_input_len=1024               \
                 --max_output_len=1024              \
                 --world_size=2                     \
                 --output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/    \
                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log

# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16                    \
                 --log_level=verbose                \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16          \
                 --use_layernorm_plugin float16     \
                 --max_batch_size=16                \
                 --max_input_len=1024               \
                 --max_output_len=1024              \
                 --world_size=2                     \
                 --use_weight_only                  \
                 --output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/    \
                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log

3. Run

Before running the examples, make sure set the environment variables:

export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0           # disable XPytorch use L3.

If NOT using R480-X8, make sure set the environment variables:

export BKCL_PCIE_RING=1

To run a XTRT-LLM GPT-NeoX model using the engines generated by build.py:

# For 2-way tensor parallelism, FP16
mpirun -n 2 --allow-run-as-root \
    python3 run.py \
    --max_output_len=50 \
    --engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/  \
    --tokenizer_dir=./downloads/gptneox_model

# For 2-way tensor parallelism, INT8
mpirun -n 2 --allow-run-as-root \
    python3 run.py \
    --max_output_len=50 \
    --engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/  \
    --tokenizer_dir=./downloads/gptneox_model