# GPT-J This document explains how to build the [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) model using XTRT-LLM and run on a single XPU. ## Overview The XTRT-LLM GPT-J example code is located in [`examples/gptj`](./). There are several main files in that folder: * [`build.py`](./build.py) to build the [XTRT] engine(s) needed to run the GPT-J model, * [`run.py`](./run.py) to run the inference on an input text, ## Support Matrix * FP16 ## Usage ### 1. Download weights from HuggingFace (HF) Transformers ```bash # 1. Weights & config git clone https://huggingface.co/EleutherAI/gpt-j-6b ./downloads/gptj-6b pushd ./downloads/gptj-6b && \ rm -f pytorch_model.bin && \ wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/pytorch_model.bin && \ popd # 2. Vocab and merge table wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/vocab.json wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt ``` ### 2. Build XTRT engine(s) XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights. Examples of build invocations: ```bash # Build a float16 engine using HF weights. # Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time. python3 build.py --dtype=float16 \ --log_level=verbose \ --enable_context_fmha \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --max_batch_size=32 \ --max_input_len=1919 \ --max_output_len=128 \ --output_dir=./downloads/gptj-6b/trt_engines/fp16/1-XPU/ \ --model_dir=./downloads/gptj-6b 2>&1 | tee build.log # Build a float16 engine using dummy weights, useful for performance tests. # Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time. python3 build.py --dtype=float16 \ --log_level=verbose \ --enable_context_fmha \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --max_batch_size=32 \ --max_input_len=1919 \ --max_output_len=128 \ --output_dir=./downloads/gptj-6b/trt_engines/gptj_engine_dummy_weights 2>&1 | tee build.log ``` ### 3. Run To run a XTRT-LLM GPT-J model: ```bash python3 run.py --max_output_len=50 \ --engine_dir=./downloads/gptj-6b/trt_engines/fp16/1-XPU/ \ --hf_model_location=./downloads/gptj-6b ```