GPT-NeoX

本文档介绍了如何使用昆仑芯XTRT-LLM在单节点多XPU上构建和运行GPT-NeoX 模型。

概述

XTRT-LLM GPT-NeoX 示例代码位于 examples/gptneox。此文件夹中有以下几个主要文件：

build.py 构建运行GPT-NeoX模型所需的XTRT引擎
run.py 基于输入的文字进行推理

支持的矩阵

FP16
INT8 Weight-Only
Tensor Parallel

使用说明

1.从HuggingFace（HF） Transformers下载权重

# Weights & config
sh get_weights.sh

2. 构建XTRT引擎

XTRT-LLM从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录，XTRT-LLM将使用伪权重构建引擎。

构建调用示例：

# Build a float16 engine using 2-way tensor parallelism and HF weights.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16                    \
                 --log_level=verbose                \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16          \
                 --use_layernorm_plugin float16     \
                 --max_batch_size=16                \
                 --max_input_len=1024               \
                 --max_output_len=1024              \
                 --world_size=2                     \
                 --output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/    \
                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log

# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16                    \
                 --log_level=verbose                \
                 --use_gpt_attention_plugin float16 \
                 --use_gemm_plugin float16          \
                 --use_layernorm_plugin float16     \
                 --max_batch_size=16                \
                 --max_input_len=1024               \
                 --max_output_len=1024              \
                 --world_size=2                     \
                 --use_weight_only                  \
                 --output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/    \
                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log

3. 运行

在运行示例之前，请确保设置环境变量：

export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0           # disable XPytorch use L3.

如果不使用昆仑芯R480-X8产品，请确保设置环境变量如下：

export BKCL_PCIE_RING=1

要使用build.py生成的引擎运行XTRT-LLM GPT-NeoX模型，请执行以下操作：

# For 2-way tensor parallelism, FP16
mpirun -n 2 --allow-run-as-root \
    python3 run.py \
    --max_output_len=50 \
    --engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/  \
    --tokenizer_dir=./downloads/gptneox_model

# For 2-way tensor parallelism, INT8
mpirun -n 2 --allow-run-as-root \
    python3 run.py \
    --max_output_len=50 \
    --engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/  \
    --tokenizer_dir=./downloads/gptneox_model

3.4 KiB Raw Blame History Unescape Escape