Files
r200_8f_xtrt_llm/examples/gptneox/README_CN.md
2025-08-06 15:49:14 +08:00

95 lines
3.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GPT-NeoX
本文档介绍了如何使用昆仑芯XTRT-LLM在单节点多XPU上构建和运行[GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) 模型。
## 概述
XTRT-LLM GPT-NeoX 示例代码位于 [`examples/gptneox`](./)。 此文件夹中有以下几个主要文件:
* [`build.py`](./build.py) 构建运行GPT-NeoX模型所需的XTRT引擎
* [`run.py`](./run.py) 基于输入的文字进行推理
## 支持的矩阵
* FP16
* INT8 Weight-Only
* Tensor Parallel
## 使用说明
### 1.从HuggingFaceHF Transformers下载权重
```bash
# Weights & config
sh get_weights.sh
```
### 2. 构建XTRT引擎
XTRT-LLM从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录XTRT-LLM将使用伪权重构建引擎。
构建调用示例:
```bash
# Build a float16 engine using 2-way tensor parallelism and HF weights.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_layernorm_plugin float16 \
--max_batch_size=16 \
--max_input_len=1024 \
--max_output_len=1024 \
--world_size=2 \
--output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_layernorm_plugin float16 \
--max_batch_size=16 \
--max_input_len=1024 \
--max_output_len=1024 \
--world_size=2 \
--use_weight_only \
--output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
```
### 3. 运行
在运行示例之前,请确保设置环境变量:
```bash
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
```
如果不使用昆仑芯R480-X8产品请确保设置环境变量如下
```bash
export BKCL_PCIE_RING=1
```
要使用`build.py`生成的引擎运行XTRT-LLM GPT-NeoX模型请执行以下操作
```bash
# For 2-way tensor parallelism, FP16
mpirun -n 2 --allow-run-as-root \
python3 run.py \
--max_output_len=50 \
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
--tokenizer_dir=./downloads/gptneox_model
# For 2-way tensor parallelism, INT8
mpirun -n 2 --allow-run-as-root \
python3 run.py \
--max_output_len=50 \
--engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
--tokenizer_dir=./downloads/gptneox_model
```