# GPT-NeoX 本文档介绍了如何使用昆仑芯XTRT-LLM在单节点多XPU上构建和运行[GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) 模型。 ## 概述 XTRT-LLM GPT-NeoX 示例代码位于 [`examples/gptneox`](./)。 此文件夹中有以下几个主要文件: * [`build.py`](./build.py) 构建运行GPT-NeoX模型所需的XTRT引擎 * [`run.py`](./run.py) 基于输入的文字进行推理 ## 支持的矩阵 * FP16 * INT8 Weight-Only * Tensor Parallel ## 使用说明 ### 1.从HuggingFace(HF) Transformers下载权重 ```bash # Weights & config sh get_weights.sh ``` ### 2. 构建XTRT引擎 XTRT-LLM从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录,XTRT-LLM将使用伪权重构建引擎。 构建调用示例: ```bash # Build a float16 engine using 2-way tensor parallelism and HF weights. # Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time. python3 build.py --dtype=float16 \ --log_level=verbose \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --use_layernorm_plugin float16 \ --max_batch_size=16 \ --max_input_len=1024 \ --max_output_len=1024 \ --world_size=2 \ --output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \ --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log # Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization. # Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time. python3 build.py --dtype=float16 \ --log_level=verbose \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --use_layernorm_plugin float16 \ --max_batch_size=16 \ --max_input_len=1024 \ --max_output_len=1024 \ --world_size=2 \ --use_weight_only \ --output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \ --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log ``` ### 3. 运行 在运行示例之前,请确保设置环境变量: ```bash export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory. export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3. ``` 如果不使用昆仑芯R480-X8产品,请确保设置环境变量如下: ```bash export BKCL_PCIE_RING=1 ``` 要使用`build.py`生成的引擎运行XTRT-LLM GPT-NeoX模型,请执行以下操作: ```bash # For 2-way tensor parallelism, FP16 mpirun -n 2 --allow-run-as-root \ python3 run.py \ --max_output_len=50 \ --engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \ --tokenizer_dir=./downloads/gptneox_model # For 2-way tensor parallelism, INT8 mpirun -n 2 --allow-run-as-root \ python3 run.py \ --max_output_len=50 \ --engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \ --tokenizer_dir=./downloads/gptneox_model ```