78 lines
2.6 KiB
Markdown
78 lines
2.6 KiB
Markdown
# GPT-J
|
|
|
|
This document explains how to build the [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) model using XTRT-LLM and run on a single XPU.
|
|
|
|
## Overview
|
|
|
|
The XTRT-LLM GPT-J example
|
|
code is located in [`examples/gptj`](./). There are several main files in that folder:
|
|
|
|
* [`build.py`](./build.py) to build the [XTRT] engine(s) needed to run the GPT-J model,
|
|
* [`run.py`](./run.py) to run the inference on an input text,
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
|
|
## Usage
|
|
|
|
### 1. Download weights from HuggingFace (HF) Transformers
|
|
|
|
```bash
|
|
# 1. Weights & config
|
|
git clone https://huggingface.co/EleutherAI/gpt-j-6b ./downloads/gptj-6b
|
|
pushd ./downloads/gptj-6b && \
|
|
rm -f pytorch_model.bin && \
|
|
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/pytorch_model.bin && \
|
|
popd
|
|
|
|
# 2. Vocab and merge table
|
|
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/vocab.json
|
|
wget https://huggingface.co/EleutherAI/gpt-j-6b/resolve/main/merges.txt
|
|
```
|
|
|
|
### 2. Build XTRT engine(s)
|
|
|
|
XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using
|
|
dummy weights.
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build a float16 engine using HF weights.
|
|
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--enable_context_fmha \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--max_batch_size=32 \
|
|
--max_input_len=1919 \
|
|
--max_output_len=128 \
|
|
--output_dir=./downloads/gptj-6b/trt_engines/fp16/1-XPU/ \
|
|
--model_dir=./downloads/gptj-6b 2>&1 | tee build.log
|
|
|
|
# Build a float16 engine using dummy weights, useful for performance tests.
|
|
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--enable_context_fmha \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--max_batch_size=32 \
|
|
--max_input_len=1919 \
|
|
--max_output_len=128 \
|
|
--output_dir=./downloads/gptj-6b/trt_engines/gptj_engine_dummy_weights 2>&1 | tee build.log
|
|
```
|
|
|
|
### 3. Run
|
|
|
|
To run a XTRT-LLM GPT-J model:
|
|
|
|
```bash
|
|
python3 run.py --max_output_len=50 \
|
|
--engine_dir=./downloads/gptj-6b/trt_engines/fp16/1-XPU/ \
|
|
--hf_model_location=./downloads/gptj-6b
|
|
```
|