add pkgs

2025-08-06 15:49:14 +08:00
parent e80b916c52
commit bf00e72fb2
111 changed files with 21880 additions and 1 deletions
--- a/examples/baichuan/README_CN.md
+++ b/examples/baichuan/README_CN.md
@@ -0,0 +1,127 @@
+# Baichuan
+
+本文档介绍了如何使用昆仑芯XTRT-LLM在单XPU和单节点多XPU上构建和运行百川（Baichuan）模型（包括`v1_7b`/`v1_13b`/`v2_7b`/`v2_13b`）。
+
+## 概述
+
+XTRT-LLM Baichuan示例代码位于 [`examples/baichuan`](./). 此文件夹中有以下几个主要文件：
+
+ * [`build.py`](./build.py)  构建运行Baichuan模型所需的XTRT引擎
+ * [`run.py`](./run.py)  基于输入的文字进行推理
+
+这些脚本接收一个名为model_version的参数，其值应为 `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b` ，其默认值为 `v1_13b`。
+
+## 支持的矩阵
+
+  * FP16
+  * INT8 Weight-Only
+
+## 使用说明
+
+XTRT-LLM Baichuan示例代码位于 [`examples/baichuan`](./)。它使用HF权重作为输入，并且构建对应的XTRT引擎。XTRT引擎的数量取决于为了运行推理而使用的XPU个数。
+
+### 构建XTRT引擎
+
+需要明确HF Baichuan checkpoint的路径。对于`v1_13b`，应该使用 [./downloads/baichuan-13b](./downloads/baichuan-13b) 或者 [baichuan-inc/Baichuan-13B-Base](https://huggingface.co/baichuan-inc/Baichuan-13B-Base).对于`v2_13b`，应该使用 [baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat)或者 [baichuan-inc/Baichuan2-13B-Base](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base)。更多的Baichuan模型可见 [baichuan-inc](https://huggingface.co/baichuan-inc)。
+
+XTRT-LLM Baichuan从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录，XTRT-LLM将使用伪权重构建引擎。
+
+通常`build.py`只需要一个XPU，但如果您在推理时已经获得了所需的所有XPU，则可以通过添加`--parallel_build`参数来启用并行构建，从而加快引擎构建过程。请注意，当前并行构建功能仅支持单个节点。
+
+以下是一些以`v1_13b`为例的示例（亦支持`v1_7b`、`v2_7b`和`v2_13b`）：
+
+```bash
+# Build the Baichuan V1 13B model using a single XPU and FP16.
+python build.py --model_version v1_13b \
+                --model_dir ./downloads/baichuan-13b \
+                --dtype float16 \
+                --use_gemm_plugin float16 \
+                --use_gpt_attention_plugin float16 \
+                --output_dir ./downloads/baichuan-13b/fp16/tp1
+
+# Build the Baichuan V1 13B model using a single XPU and apply INT8 weight-only quantization.
+python build.py --model_version v1_13b \
+                --model_dir ./downloads/baichuan-13b \
+                --dtype float16 \
+                --use_gemm_plugin float16 \
+                --use_gpt_attention_plugin float16 \
+                --use_weight_only \
+                --output_dir ./downloads/baichuan-13b/int8/tp1
+
+# Build Baichuan V1 13B using 2-way tensor parallelism and FP16.
+python build.py --model_version v1_13b \
+                --model_dir ./downloads/baichuan-13b \
+                --dtype float16 \
+                --use_gemm_plugin float16 \
+                --use_gpt_attention_plugin float16 \
+                --output_dir ./downloads/baichuan-13b/fp16/tp2 \
+                --parallel_build \
+                --world_size 2
+
+# Build Baichuan V1 13B using 2-way tensor parallelism and apply INT8 weight-only quantization.
+python build.py --model_version v1_13b \
+                --model_dir ./downloads/baichuan-13b \
+                --dtype float16 \
+                --use_gemm_plugin float16 \
+                --use_gpt_attention_plugin float16 \
+                --use_weight_only \
+                --output_dir ./downloads/baichuan-13b/int8/tp2 \
+                --parallel_build \
+                --world_size 2
+
+```
+
+
+
+### 运行
+
+在运行示例之前，请确保设置环境变量：
+
+```bash
+export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
+export XMLIR_D_XPU_L3_SIZE=0           # disable XPytorch use L3.
+```
+
+如果使用多个XPU且没有L3空间运行，则可以通过设置`BKCL_CCIX_BUFFER_GM=1`以禁用L3。
+
+使用`build.py`生成的引擎运行XTRT-LLM Baichuan模型：
+
+```bash
+# With fp16 inference
+python run.py --model_version v1_13b \
+              --max_output_len=50 \
+              --tokenizer_dir ./downloads/baichuan-13b \
+              --log_level=info \
+              --engine_dir=./downloads/baichuan-13b/fp16/tp1
+
+# With INT8 weight-only quantization inference
+python run.py --model_version v1_13b \
+              --max_output_len=50 \
+              --tokenizer_dir=./downloads/baichuan-13b \
+              --log_level=info \
+              --engine_dir=./downloads/baichuan-13b/int8/tp1
+
+# with fp16 and 2-way tensor parallelism inference
+mpirun -n 2 --allow-run-as-root \
+    python run.py --model_version v1_13b \
+                  --max_output_len=50 \
+                  --tokenizer_dir=./downloads/baichuan-13b \
+                  --log_level=info \
+                  --engine_dir=./downloads/baichuan-13b/fp16/tp2
+
+# with INT8 weight-only and 2-way tensor parallelism inference
+mpirun -n 2 --allow-run-as-root \
+    python run.py --model_version v1_13b \
+                  --max_output_len=50 \
+                  --tokenizer_dir=./downloads/baichuan-13b \
+                  --log_level=info \
+                  --engine_dir=./downloads/baichuan-13b/int8/tp2
+
+```
+
+### 已知问题
+
+- 采用仅使用INT8权重和大于2的Tensor Parallelism的Baichuan-7B模型的实现可能存在精度问题。此问题正在调查中。
+
+
+