add pkgs

2025-08-06 15:49:14 +08:00
parent e80b916c52
commit bf00e72fb2
111 changed files with 21880 additions and 1 deletions
--- a/examples/gptneox/README.md
+++ b/examples/gptneox/README.md
@@ -0,0 +1,93 @@
+# GPT-NeoX
+
+This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using XTRT-LLM and run on single node multi-XPU.
+
+## Overview
+
+The XTRT-LLM GPT-NeoX example code is located in [`examples/gptneox`](./). There are several main files in that folder:
+
+ * [`build.py`](./build.py) to build the XTRT engine(s) needed to run the GPT-NeoX model,
+ * [`run.py`](./run.py) to run the inference on an input text,
+
+
+## Support Matrix
+  * FP16
+  * INT8 Weight-Only
+  * Tensor Parallel
+
+## Usage
+
+### 1. Download weights from HuggingFace (HF) Transformers
+
+```bash
+# Weights & config
+sh get_weights.sh
+```
+
+### 2. Build XTRT engine(s)
+
+XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights.
+
+Examples of build invocations:
+
+```bash
+# Build a float16 engine using 2-way tensor parallelism and HF weights.
+# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
+python3 build.py --dtype=float16                    \
+                 --log_level=verbose                \
+                 --use_gpt_attention_plugin float16 \
+                 --use_gemm_plugin float16          \
+                 --use_layernorm_plugin float16     \
+                 --max_batch_size=16                \
+                 --max_input_len=1024               \
+                 --max_output_len=1024              \
+                 --world_size=2                     \
+                 --output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/    \
+                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
+
+# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
+# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
+python3 build.py --dtype=float16                    \
+                 --log_level=verbose                \
+                 --use_gpt_attention_plugin float16 \
+                 --use_gemm_plugin float16          \
+                 --use_layernorm_plugin float16     \
+                 --max_batch_size=16                \
+                 --max_input_len=1024               \
+                 --max_output_len=1024              \
+                 --world_size=2                     \
+                 --use_weight_only                  \
+                 --output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/    \
+                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
+```
+
+### 3. Run
+
+Before running the examples, make sure set the environment variables:
+```bash
+export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
+export XMLIR_D_XPU_L3_SIZE=0           # disable XPytorch use L3.
+```
+
+If NOT using R480-X8, make sure set the environment variables:
+```bash
+export BKCL_PCIE_RING=1
+```
+
+To run a XTRT-LLM GPT-NeoX model using the engines generated by `build.py`:
+
+```bash
+# For 2-way tensor parallelism, FP16
+mpirun -n 2 --allow-run-as-root \
+    python3 run.py \
+    --max_output_len=50 \
+    --engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/  \
+    --tokenizer_dir=./downloads/gptneox_model
+
+# For 2-way tensor parallelism, INT8
+mpirun -n 2 --allow-run-as-root \
+    python3 run.py \
+    --max_output_len=50 \
+    --engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/  \
+    --tokenizer_dir=./downloads/gptneox_model
+```