Files
2025-08-06 15:49:14 +08:00
..
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00
2025-08-06 15:49:14 +08:00

GPT-J

This document explains how to build the GPT-J model using XTRT-LLM and run on a single XPU.

Overview

The XTRT-LLM GPT-J example code is located in examples/gptj. There are several main files in that folder:

  • build.py to build the [XTRT] engine(s) needed to run the GPT-J model,
  • run.py to run the inference on an input text,

Support Matrix

  • FP16

Usage

1. Download weights from HuggingFace (HF) Transformers

# 1. Weights & config
git clone https://huggingface.co/EleutherAI/gpt-j-6b ./downloads/gptj-6b
pushd ./downloads/gptj-6b && \
  rm -f pytorch_model.bin && \
  wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/pytorch_model.bin && \
popd

# 2. Vocab and merge table
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/vocab.json
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt

2. Build XTRT engine(s)

XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights.

Examples of build invocations:

# Build a float16 engine using HF weights.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.

python3 build.py --dtype=float16 \
                 --log_level=verbose \
                 --enable_context_fmha \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16 \
                 --max_batch_size=32 \
                 --max_input_len=1919 \
                 --max_output_len=128 \
                 --output_dir=./downloads/gptj-6b/trt_engines/fp16/1-XPU/ \
                 --model_dir=./downloads/gptj-6b 2>&1 | tee build.log

# Build a float16 engine using dummy weights, useful for performance tests.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.

python3 build.py --dtype=float16 \
                 --log_level=verbose \
                 --enable_context_fmha \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16 \
                 --max_batch_size=32 \
                 --max_input_len=1919 \
                 --max_output_len=128 \
                 --output_dir=./downloads/gptj-6b/trt_engines/gptj_engine_dummy_weights 2>&1 | tee build.log

3. Run

To run a XTRT-LLM GPT-J model:

python3 run.py --max_output_len=50 \
    --engine_dir=./downloads/gptj-6b/trt_engines/fp16/1-XPU/ \
    --hf_model_location=./downloads/gptj-6b