95 lines
3.4 KiB
Markdown
95 lines
3.4 KiB
Markdown
# GPT-NeoX
|
||
|
||
本文档介绍了如何使用昆仑芯XTRT-LLM在单节点多XPU上构建和运行[GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) 模型。
|
||
|
||
## 概述
|
||
|
||
XTRT-LLM GPT-NeoX 示例代码位于 [`examples/gptneox`](./)。 此文件夹中有以下几个主要文件:
|
||
|
||
* [`build.py`](./build.py) 构建运行GPT-NeoX模型所需的XTRT引擎
|
||
* [`run.py`](./run.py) 基于输入的文字进行推理
|
||
|
||
## 支持的矩阵
|
||
|
||
* FP16
|
||
* INT8 Weight-Only
|
||
* Tensor Parallel
|
||
|
||
## 使用说明
|
||
|
||
### 1.从HuggingFace(HF) Transformers下载权重
|
||
|
||
```bash
|
||
# Weights & config
|
||
sh get_weights.sh
|
||
```
|
||
|
||
### 2. 构建XTRT引擎
|
||
|
||
XTRT-LLM从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录,XTRT-LLM将使用伪权重构建引擎。
|
||
|
||
构建调用示例:
|
||
|
||
```bash
|
||
# Build a float16 engine using 2-way tensor parallelism and HF weights.
|
||
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
||
python3 build.py --dtype=float16 \
|
||
--log_level=verbose \
|
||
--use_gpt_attention_plugin float16 \
|
||
--use_gemm_plugin float16 \
|
||
--use_layernorm_plugin float16 \
|
||
--max_batch_size=16 \
|
||
--max_input_len=1024 \
|
||
--max_output_len=1024 \
|
||
--world_size=2 \
|
||
--output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
||
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
|
||
|
||
# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
|
||
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
||
python3 build.py --dtype=float16 \
|
||
--log_level=verbose \
|
||
--use_gpt_attention_plugin float16 \
|
||
--use_gemm_plugin float16 \
|
||
--use_layernorm_plugin float16 \
|
||
--max_batch_size=16 \
|
||
--max_input_len=1024 \
|
||
--max_output_len=1024 \
|
||
--world_size=2 \
|
||
--use_weight_only \
|
||
--output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
|
||
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
|
||
```
|
||
|
||
### 3. 运行
|
||
|
||
在运行示例之前,请确保设置环境变量:
|
||
|
||
```bash
|
||
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
|
||
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
|
||
```
|
||
|
||
如果不使用昆仑芯R480-X8产品,请确保设置环境变量如下:
|
||
|
||
```bash
|
||
export BKCL_PCIE_RING=1
|
||
```
|
||
|
||
要使用`build.py`生成的引擎运行XTRT-LLM GPT-NeoX模型,请执行以下操作:
|
||
|
||
```bash
|
||
# For 2-way tensor parallelism, FP16
|
||
mpirun -n 2 --allow-run-as-root \
|
||
python3 run.py \
|
||
--max_output_len=50 \
|
||
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
||
--tokenizer_dir=./downloads/gptneox_model
|
||
|
||
# For 2-way tensor parallelism, INT8
|
||
mpirun -n 2 --allow-run-as-root \
|
||
python3 run.py \
|
||
--max_output_len=50 \
|
||
--engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
|
||
--tokenizer_dir=./downloads/gptneox_model
|
||
``` |