94 lines
3.4 KiB
Markdown
94 lines
3.4 KiB
Markdown
# GPT-NeoX
|
|
|
|
This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using XTRT-LLM and run on single node multi-XPU.
|
|
|
|
## Overview
|
|
|
|
The XTRT-LLM GPT-NeoX example code is located in [`examples/gptneox`](./). There are several main files in that folder:
|
|
|
|
* [`build.py`](./build.py) to build the XTRT engine(s) needed to run the GPT-NeoX model,
|
|
* [`run.py`](./run.py) to run the inference on an input text,
|
|
|
|
|
|
## Support Matrix
|
|
* FP16
|
|
* INT8 Weight-Only
|
|
* Tensor Parallel
|
|
|
|
## Usage
|
|
|
|
### 1. Download weights from HuggingFace (HF) Transformers
|
|
|
|
```bash
|
|
# Weights & config
|
|
sh get_weights.sh
|
|
```
|
|
|
|
### 2. Build XTRT engine(s)
|
|
|
|
XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights.
|
|
|
|
Examples of build invocations:
|
|
|
|
```bash
|
|
# Build a float16 engine using 2-way tensor parallelism and HF weights.
|
|
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--use_layernorm_plugin float16 \
|
|
--max_batch_size=16 \
|
|
--max_input_len=1024 \
|
|
--max_output_len=1024 \
|
|
--world_size=2 \
|
|
--output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
|
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
|
|
|
|
# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
|
|
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
|
python3 build.py --dtype=float16 \
|
|
--log_level=verbose \
|
|
--use_gpt_attention_plugin float16 \
|
|
--use_gemm_plugin float16 \
|
|
--use_layernorm_plugin float16 \
|
|
--max_batch_size=16 \
|
|
--max_input_len=1024 \
|
|
--max_output_len=1024 \
|
|
--world_size=2 \
|
|
--use_weight_only \
|
|
--output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
|
|
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
|
|
```
|
|
|
|
### 3. Run
|
|
|
|
Before running the examples, make sure set the environment variables:
|
|
```bash
|
|
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
|
|
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
|
|
```
|
|
|
|
If NOT using R480-X8, make sure set the environment variables:
|
|
```bash
|
|
export BKCL_PCIE_RING=1
|
|
```
|
|
|
|
To run a XTRT-LLM GPT-NeoX model using the engines generated by `build.py`:
|
|
|
|
```bash
|
|
# For 2-way tensor parallelism, FP16
|
|
mpirun -n 2 --allow-run-as-root \
|
|
python3 run.py \
|
|
--max_output_len=50 \
|
|
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
|
--tokenizer_dir=./downloads/gptneox_model
|
|
|
|
# For 2-way tensor parallelism, INT8
|
|
mpirun -n 2 --allow-run-as-root \
|
|
python3 run.py \
|
|
--max_output_len=50 \
|
|
--engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
|
|
--tokenizer_dir=./downloads/gptneox_model
|
|
```
|