This commit is contained in:
2025-08-06 15:49:14 +08:00
parent e80b916c52
commit bf00e72fb2
111 changed files with 21880 additions and 1 deletions

146
examples/baichuan/README.md Normal file
View File

@@ -0,0 +1,146 @@
# Baichuan
This document shows how to build and run a Baichuan models (including `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b`) in XTRT-LLM on both single XPU and single node multi-XPU.
## Overview
The XTRT-LLM Baichuan example code is located in [`examples/baichuan`](./). There are several main files in that folder:
* [`build.py`](./build.py) to build the XTRT engine(s) needed to run the Baichuan model,
* [`run.py`](./run.py) to run the inference on an input text,
These scripts accept an argument named model_version, whose value should be `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b` and the default value is `v1_13b`.
## Support Matrix
* FP16
* INT4 & INT8 Weight-Only
## Usage
The XTRT-LLM Baichuan example code locates at [examples/baichuan](./). It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.
### Build XTRT engine(s)
Need to specify the HF Baichuan checkpoint path. For `v1_13b`, you should use whether [./downloads/baichuan-13b](./downloads/baichuan-13b) or [baichuan-inc/Baichuan-13B-Base](https://huggingface.co/baichuan-inc/Baichuan-13B-Base). For `v2_13b`, you should use whether [baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat) or [baichuan-inc/Baichuan2-13B-Base](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base). More Baichuan models could be found on [baichuan-inc](https://huggingface.co/baichuan-inc).
XTRT-LLM Baichuan builds XTRT engine(s) from HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) with dummy weights.
Normally `build.py` only requires single XPU, but if you've already got all the XPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--parallel_build` argument. Please note that currently `parallel_build` feature only supports single node.
Here're some examples that take `v1_13b` as example(`v1_7b`, `v2_7b`, `v2_13b` are supported):
```bash
# Build the Baichuan V1 13B model using a single XPU and FP16.
python build.py --model_version v1_13b \
--model_dir ./downloads/baichuan-13b \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/baichuan-13b/fp16/tp1
# Build the Baichuan V1 13B model using a single XPU and apply INT8 weight-only quantization.
python build.py --model_version v1_13b \
--model_dir ./downloads/baichuan-13b \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--output_dir ./downloads/baichuan-13b/int8/tp1
# Build the Baichuan V1 13B model using a single GPU and apply INT4 weight-only quantization.
python build.py --model_version v1_13b \
--model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--weight_only_precision int4 \
--output_dir ./tmp/baichuan_v1_13b/trt_engines/int4_weight_only/1-gpu/
# Build Baichuan V1 13B using 2-way tensor parallelism and FP16.
python build.py --model_version v1_13b \
--model_dir ./downloads/baichuan-13b \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/baichuan-13b/fp16/tp2 \
--parallel_build \
--world_size 2
# Build Baichuan V1 13B using 2-way tensor parallelism and apply INT8 weight-only quantization.
python build.py --model_version v1_13b \
--model_dir ./downloads/baichuan-13b \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--output_dir ./downloads/baichuan-13b/int8/tp2 \
--parallel_build \
--world_size 2
```
### Run
Before running the examples, make sure set the environment variables:
```bash
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
```
If you are runing with multiple XPUs and no L3 space, you can set `BKCL_CCIX_BUFFER_GM=1` to disable L3.
To run a XTRT-LLM Baichuan model using the engines generated by `build.py`. Here're some examples:
```bash
# Generate summarization for a given input text
python summarize.py --model_version v2_13b \
--hf_model_location ./downloads/baichuan2-13b \
--engine_dir ./downloads/baichuan2-13b/fp16/tp1/ \
--log_level info
# With fp16 inference
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir ./downloads/baichuan-13b \
--log_level=info \
--engine_dir=./downloads/baichuan-13b/fp16/tp1
# With INT8 weight-only quantization inference
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir=./downloads/baichuan-13b \
--log_level=info \
--engine_dir=./downloads/baichuan-13b/int8/tp1
# With INT4 weight-only quantization inference
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir=baichuan-inc/Baichuan-13B-Chat \
--engine_dir=./tmp/baichuan_v1_13b/trt_engines/int4_weight_only/1-gpu/
# with fp16 and 2-way tensor parallelism inference
mpirun -n 2 --allow-run-as-root \
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir=./downloads/baichuan-13b \
--log_level=info \
--engine_dir=./downloads/baichuan-13b/fp16/tp2
# with INT8 weight-only and 2-way tensor parallelism inference
mpirun -n 2 --allow-run-as-root \
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir=./downloads/baichuan-13b \
--log_level=info \
--engine_dir=./downloads/baichuan-13b/int8/tp2
```
### Known Issues
* The implementation of the Baichuan-7B model with INT8 Weight-Only and Tensor
Parallelism greater than 2 might have accuracy issues. It is under
investigation.

View File

@@ -0,0 +1,127 @@
# Baichuan
本文档介绍了如何使用昆仑芯XTRT-LLM在单XPU和单节点多XPU上构建和运行百川Baichuan模型包括`v1_7b`/`v1_13b`/`v2_7b`/`v2_13b`)。
## 概述
XTRT-LLM Baichuan示例代码位于 [`examples/baichuan`](./). 此文件夹中有以下几个主要文件:
* [`build.py`](./build.py) 构建运行Baichuan模型所需的XTRT引擎
* [`run.py`](./run.py) 基于输入的文字进行推理
这些脚本接收一个名为model_version的参数其值应为 `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b` ,其默认值为 `v1_13b`
## 支持的矩阵
* FP16
* INT8 Weight-Only
## 使用说明
XTRT-LLM Baichuan示例代码位于 [`examples/baichuan`](./)。它使用HF权重作为输入并且构建对应的XTRT引擎。XTRT引擎的数量取决于为了运行推理而使用的XPU个数。
### 构建XTRT引擎
需要明确HF Baichuan checkpoint的路径。对于`v1_13b`,应该使用 [./downloads/baichuan-13b](./downloads/baichuan-13b) 或者 [baichuan-inc/Baichuan-13B-Base](https://huggingface.co/baichuan-inc/Baichuan-13B-Base).对于`v2_13b`,应该使用 [baichuan-inc/Baichuan2-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat)或者 [baichuan-inc/Baichuan2-13B-Base](https://huggingface.co/baichuan-inc/Baichuan2-13B-Base)。更多的Baichuan模型可见 [baichuan-inc](https://huggingface.co/baichuan-inc)。
XTRT-LLM Baichuan从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录XTRT-LLM将使用伪权重构建引擎。
通常`build.py`只需要一个XPU但如果您在推理时已经获得了所需的所有XPU则可以通过添加`--parallel_build`参数来启用并行构建,从而加快引擎构建过程。请注意,当前并行构建功能仅支持单个节点。
以下是一些以`v1_13b`为例的示例(亦支持`v1_7b``v2_7b``v2_13b`
```bash
# Build the Baichuan V1 13B model using a single XPU and FP16.
python build.py --model_version v1_13b \
--model_dir ./downloads/baichuan-13b \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/baichuan-13b/fp16/tp1
# Build the Baichuan V1 13B model using a single XPU and apply INT8 weight-only quantization.
python build.py --model_version v1_13b \
--model_dir ./downloads/baichuan-13b \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--output_dir ./downloads/baichuan-13b/int8/tp1
# Build Baichuan V1 13B using 2-way tensor parallelism and FP16.
python build.py --model_version v1_13b \
--model_dir ./downloads/baichuan-13b \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--output_dir ./downloads/baichuan-13b/fp16/tp2 \
--parallel_build \
--world_size 2
# Build Baichuan V1 13B using 2-way tensor parallelism and apply INT8 weight-only quantization.
python build.py --model_version v1_13b \
--model_dir ./downloads/baichuan-13b \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--use_weight_only \
--output_dir ./downloads/baichuan-13b/int8/tp2 \
--parallel_build \
--world_size 2
```
### 运行
在运行示例之前,请确保设置环境变量:
```bash
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
```
如果使用多个XPU且没有L3空间运行则可以通过设置`BKCL_CCIX_BUFFER_GM=1`以禁用L3。
使用`build.py`生成的引擎运行XTRT-LLM Baichuan模型
```bash
# With fp16 inference
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir ./downloads/baichuan-13b \
--log_level=info \
--engine_dir=./downloads/baichuan-13b/fp16/tp1
# With INT8 weight-only quantization inference
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir=./downloads/baichuan-13b \
--log_level=info \
--engine_dir=./downloads/baichuan-13b/int8/tp1
# with fp16 and 2-way tensor parallelism inference
mpirun -n 2 --allow-run-as-root \
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir=./downloads/baichuan-13b \
--log_level=info \
--engine_dir=./downloads/baichuan-13b/fp16/tp2
# with INT8 weight-only and 2-way tensor parallelism inference
mpirun -n 2 --allow-run-as-root \
python run.py --model_version v1_13b \
--max_output_len=50 \
--tokenizer_dir=./downloads/baichuan-13b \
--log_level=info \
--engine_dir=./downloads/baichuan-13b/int8/tp2
```
### 已知问题
- 采用仅使用INT8权重和大于2的Tensor Parallelism的Baichuan-7B模型的实现可能存在精度问题。此问题正在调查中。

491
examples/baichuan/build.py Normal file
View File

@@ -0,0 +1,491 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import time
import onnx
import torch.multiprocessing as mp
import tvm as trt
from onnx import TensorProto, helper
from transformers import AutoConfig, AutoModelForCausalLM
import xtrt_llm
from xtrt_llm._utils import str_dtype_to_xtrt
from xtrt_llm.builder import Builder
from xtrt_llm.layers.attention import PositionEmbeddingType
from xtrt_llm.logger import logger
from xtrt_llm.mapping import Mapping
from xtrt_llm.models import BaichuanForCausalLM, weight_only_quantize
from xtrt_llm.network import net_guard
from xtrt_llm.plugin.plugin import ContextFMHAType
from xtrt_llm.quantization import QuantMode
from weight import load_from_hf_baichuan # isort:skip
# 2 routines: get_engine_name, serialize_engine
# are direct copy from gpt example, TODO: put in utils?
def trt_dtype_to_onnx(dtype):
if dtype == trt.float16:
return TensorProto.DataType.FLOAT16
elif dtype == trt.float32:
return TensorProto.DataType.FLOAT
elif dtype == trt.int32:
return TensorProto.DataType.INT32
else:
raise TypeError("%s is not supported" % dtype)
def to_onnx(network, path):
inputs = []
for i in range(network.num_inputs):
network_input = network.get_input(i)
inputs.append(
helper.make_tensor_value_info(
network_input.name, trt_dtype_to_onnx(network_input.dtype),
list(network_input.shape)))
outputs = []
for i in range(network.num_outputs):
network_output = network.get_output(i)
outputs.append(
helper.make_tensor_value_info(
network_output.name, trt_dtype_to_onnx(network_output.dtype),
list(network_output.shape)))
nodes = []
for i in range(network.num_layers):
layer = network.get_layer(i)
layer_inputs = []
for j in range(layer.num_inputs):
ipt = layer.get_input(j)
if ipt is not None:
layer_inputs.append(layer.get_input(j).name)
layer_outputs = [
layer.get_output(j).name for j in range(layer.num_outputs)
]
nodes.append(
helper.make_node(str(layer.type),
name=layer.name,
inputs=layer_inputs,
outputs=layer_outputs,
domain="com.nvidia"))
onnx_model = helper.make_model(helper.make_graph(nodes,
'attention',
inputs,
outputs,
initializer=None),
producer_name='NVIDIA')
onnx.save(onnx_model, path)
def get_engine_name(model, dtype, tp_size, rank):
return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank)
def serialize_engine(engine, path):
logger.info(f'Serializing engine to {path}...')
tik = time.time()
# import pdb;pdb.set_trace()
engine.serialize(path)
tok = time.time()
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
logger.info(f'Engine serialized. Total time: {t}')
def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument('--world_size',
type=int,
default=1,
help='world size, only support tensor parallelism now')
parser.add_argument('--model_dir',
type=str,
default='baichuan-inc/Baichuan-13B-Chat')
parser.add_argument('--model_version',
type=str,
default='v1_13b',
choices=['v1_7b', 'v1_13b', 'v2_7b', 'v2_13b'])
parser.add_argument('--dtype',
type=str,
default='float16',
choices=['float32', 'bfloat16', 'float16'])
parser.add_argument(
'--opt_memory_use',
default=True,
action="store_true",
help='Whether to use Host memory optimization for building engine')
parser.add_argument(
'--timing_cache',
type=str,
default='model.cache',
help=
'The path of to read timing cache from, will be ignored if the file does not exist'
)
parser.add_argument('--log_level', type=str, default='info')
parser.add_argument('--pp_size', type=int, default=1)
parser.add_argument('--vocab_size', type=int, default=64000)
parser.add_argument('--n_layer', type=int, default=40)
parser.add_argument('--n_positions', type=int, default=4096)
parser.add_argument('--n_embd', type=int, default=5120)
parser.add_argument('--n_head', type=int, default=40)
parser.add_argument('--inter_size', type=int, default=13696)
parser.add_argument('--hidden_act', type=str, default='silu')
parser.add_argument('--max_batch_size', type=int, default=1)
parser.add_argument('--max_input_len', type=int, default=1024)
parser.add_argument('--max_output_len', type=int, default=1024)
parser.add_argument('--max_beam_width', type=int, default=1)
parser.add_argument('--use_gpt_attention_plugin',
nargs='?',
const='float16',
type=str,
default=True,
choices=['float16', 'bfloat16', 'float32'])
parser.add_argument('--use_gemm_plugin',
nargs='?',
const='float16',
type=str,
default=False,
choices=['float16', 'bfloat16', 'float32'])
parser.add_argument('--enable_context_fmha',
default=False,
action='store_true')
parser.add_argument('--enable_context_fmha_fp32_acc',
default=False,
action='store_true')
parser.add_argument('--parallel_build', default=False, action='store_true')
parser.add_argument('--visualize', default=False, action='store_true')
parser.add_argument('--enable_debug_output',
default=False,
action='store_true')
parser.add_argument('--gpus_per_node', type=int, default=8)
parser.add_argument(
'--output_dir',
type=str,
default='baichuan_outputs',
help=
'The path to save the serialized engine files, timing cache file and model configs'
)
parser.add_argument('--remove_input_padding',
default=False,
action='store_true')
parser.add_argument(
'--use_weight_only',
default=False,
action="store_true",
help='Quantize weights for the various GEMMs to INT4/INT8.'
'See --weight_only_precision to set the precision')
parser.add_argument(
'--weight_only_precision',
const='int8',
type=str,
nargs='?',
default='int8',
choices=['int8', 'int4'],
help=
'Define the precision for the weights when using weight-only quantization.'
'You must also use --use_weight_only for that argument to have an impact.'
)
parser.add_argument(
'--use_inflight_batching',
action="store_true",
default=False,
help="Activates inflight batching mode of gptAttentionPlugin.")
parser.add_argument(
'--paged_kv_cache',
action="store_true",
default=False,
help=
'By default we use contiguous KV cache. By setting this flag you enable paged KV cache'
)
parser.add_argument('--tokens_per_block',
type=int,
default=64,
help='Number of tokens per block in paged KV cache')
parser.add_argument(
'--max_num_tokens',
type=int,
default=None,
help='Define the max number of tokens supported by the engine')
parser.add_argument('--gather_all_token_logits',
action='store_true',
default=False)
args = parser.parse_args()
if args.use_weight_only:
args.quant_mode = QuantMode.use_weight_only(
args.weight_only_precision == 'int4')
else:
args.quant_mode = QuantMode(0)
if args.use_inflight_batching:
if not args.use_gpt_attention_plugin:
args.use_gpt_attention_plugin = 'float16'
logger.info(
f"Using GPT attention plugin for inflight batching mode. Setting to default '{args.use_gpt_attention_plugin}'"
)
if not args.remove_input_padding:
args.remove_input_padding = True
logger.info(
"Using remove input padding for inflight batching mode.")
if not args.paged_kv_cache:
args.paged_kv_cache = True
logger.info("Using paged KV cache for inflight batching mode.")
if args.max_num_tokens is not None:
assert args.enable_context_fmha
if args.model_dir is not None:
hf_config = AutoConfig.from_pretrained(args.model_dir,
trust_remote_code=True)
# override the inter_size for Baichuan
args.inter_size = hf_config.intermediate_size
args.n_embd = hf_config.hidden_size
args.n_head = hf_config.num_attention_heads
args.n_layer = hf_config.num_hidden_layers
if args.model_version == 'v1_7b' or args.model_version == 'v2_7b':
args.n_positions = hf_config.max_position_embeddings
else:
args.n_positions = hf_config.model_max_length
args.vocab_size = hf_config.vocab_size
args.hidden_act = hf_config.hidden_act
else:
# default values are based on v1_13b, change them based on model_version
if args.model_version == 'v1_7b':
args.inter_size = 11008
args.n_embd = 4096
args.n_head = 32
args.n_layer = 32
args.n_positions = 4096
args.vocab_size = 64000
args.hidden_act = 'silu'
elif args.model_version == 'v2_7b':
args.inter_size = 11008
args.n_embd = 4096
args.n_head = 32
args.n_layer = 32
args.n_positions = 4096
args.vocab_size = 125696
args.hidden_act = 'silu'
elif args.model_version == 'v2_13b':
args.inter_size = 13696
args.n_embd = 5120
args.n_head = 40
args.n_layer = 40
args.n_positions = 4096
args.vocab_size = 125696
args.hidden_act = 'silu'
if args.dtype == 'bfloat16':
assert args.use_gemm_plugin, "Please use gemm plugin when dtype is bfloat16"
return args
def build_rank_engine(builder: Builder,
builder_config: xtrt_llm.builder.BuilderConfig,
engine_name, rank, args):
'''
@brief: Build the engine on the given rank.
@param rank: The rank to build the engine.
@param args: The cmd line arguments.
@return: The built engine.
'''
kv_dtype = str_dtype_to_xtrt(args.dtype)
if args.model_version == 'v1_7b' or args.model_version == 'v2_7b':
position_embedding_type = PositionEmbeddingType.rope_gpt_neox
else:
position_embedding_type = PositionEmbeddingType.alibi
# Initialize Module
xtrt_llm_baichuan = BaichuanForCausalLM(
num_layers=args.n_layer,
num_heads=args.n_head,
hidden_size=args.n_embd,
vocab_size=args.vocab_size,
hidden_act=args.hidden_act,
max_position_embeddings=args.n_positions,
position_embedding_type=position_embedding_type,
dtype=kv_dtype,
mlp_hidden_size=args.inter_size,
mapping=Mapping(world_size=args.world_size,
rank=rank,
tp_size=args.world_size),
gather_all_token_logits=args.gather_all_token_logits)
if args.use_weight_only and args.weight_only_precision == 'int8' and 0:
xtrt_llm_baichuan = weight_only_quantize(xtrt_llm_baichuan,
QuantMode.use_weight_only())
elif args.use_weight_only and args.weight_only_precision == 'int4' and 0:
xtrt_llm_baichuan = weight_only_quantize(
xtrt_llm_baichuan, QuantMode.use_weight_only(use_int4_weights=True))
if args.model_dir is not None:
logger.info(
f'Loading HF Baichuan {args.model_version} ... from {args.model_dir}'
)
tik = time.time()
hf_baichuan = AutoModelForCausalLM.from_pretrained(
args.model_dir,
device_map={
"model": "cpu",
"lm_head": "cpu"
}, # Load to CPU memory
torch_dtype="auto",
trust_remote_code=True)
tok = time.time()
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
logger.info(f'HF Baichuan {args.model_version} loaded. Total time: {t}')
load_from_hf_baichuan(xtrt_llm_baichuan,
hf_baichuan,
args.model_version,
rank,
args.world_size,
dtype=args.dtype)
del hf_baichuan
# Module -> Network
network = builder.create_network()
network.trt_network.name = engine_name
if args.use_gpt_attention_plugin:
network.plugin_config.set_gpt_attention_plugin(
dtype=args.use_gpt_attention_plugin)
if args.use_gemm_plugin:
network.plugin_config.set_gemm_plugin(dtype=args.use_gemm_plugin)
assert not (args.enable_context_fmha and args.enable_context_fmha_fp32_acc)
if args.enable_context_fmha:
network.plugin_config.set_context_fmha(ContextFMHAType.enabled)
if args.enable_context_fmha_fp32_acc:
network.plugin_config.set_context_fmha(
ContextFMHAType.enabled_with_fp32_acc)
if args.use_weight_only:
network.plugin_config.set_weight_only_quant_matmul_plugin(
dtype='float16')
builder_config.trt_builder_config.use_weight_only = args.weight_only_precision
if args.world_size > 1:
network.plugin_config.set_nccl_plugin(args.dtype)
if args.remove_input_padding:
network.plugin_config.enable_remove_input_padding()
if args.paged_kv_cache:
network.plugin_config.enable_paged_kv_cache(args.tokens_per_block)
with net_guard(network):
# Prepare
network.set_named_parameters(xtrt_llm_baichuan.named_parameters())
# Forward
inputs = xtrt_llm_baichuan.prepare_inputs(args.max_batch_size,
args.max_input_len,
args.max_output_len, True,
args.max_beam_width,
args.max_num_tokens)
xtrt_llm_baichuan(*inputs)
if args.enable_debug_output:
# mark intermediate nodes' outputs
for k, v in xtrt_llm_baichuan.named_network_outputs():
v = v.trt_tensor
v.name = k
network.trt_network.mark_output(v)
v.dtype = kv_dtype
if args.visualize:
model_path = os.path.join(args.output_dir, 'test.onnx')
to_onnx(network.trt_network, model_path)
engine = None
# Network -> Engine
engine = builder.build_engine(network, builder_config, compiler="gr")
if rank == 0:
config_path = os.path.join(args.output_dir, 'config.json')
builder.save_config(builder_config, config_path)
if args.opt_memory_use:
return engine, network
return engine
def build(rank, args):
# torch.cuda.set_device(rank % args.gpus_per_node)
xtrt_llm.logger.set_level(args.log_level)
if not os.path.exists(args.output_dir):
os.makedirs(args.output_dir)
# when doing serializing build, all ranks share one engine
builder = Builder()
cache = None
model_name = 'baichuan'
for cur_rank in range(args.world_size):
# skip other ranks if parallel_build is enabled
if args.parallel_build and cur_rank != rank:
continue
builder_config = builder.create_builder_config(
name=model_name,
precision=args.dtype,
timing_cache=args.timing_cache if cache is None else cache,
tensor_parallel=args.world_size, # TP only
parallel_build=args.parallel_build,
pipeline_parallel=args.pp_size,
num_layers=args.n_layer,
num_heads=args.n_head,
hidden_size=args.n_embd,
inter_size=args.inter_size,
vocab_size=args.vocab_size,
hidden_act=args.hidden_act,
max_position_embeddings=args.n_positions,
max_batch_size=args.max_batch_size,
max_input_len=args.max_input_len,
max_output_len=args.max_output_len,
max_num_tokens=args.max_num_tokens,
int8=args.quant_mode.has_act_and_weight_quant(),
quant_mode=args.quant_mode,
fusion_pattern_list=["remove_dup_mask"],
gather_all_token_logits=args.gather_all_token_logits,
)
guard = xtrt_llm.fusion_patterns.FuseonPatternGuard()
print(guard)
engine_name = get_engine_name(model_name, args.dtype, args.world_size,
cur_rank)
if args.opt_memory_use:
engine, network = build_rank_engine(builder, builder_config,
engine_name, cur_rank, args)
else:
engine = build_rank_engine(builder, builder_config, engine_name,
cur_rank, args)
assert engine is not None, f'Failed to build engine for rank {cur_rank}'
serialize_engine(engine, os.path.join(args.output_dir, engine_name))
if __name__ == '__main__':
args = parse_arguments()
logger.set_level(args.log_level)
tik = time.time()
if args.parallel_build and args.world_size > 1:
logger.warning(
f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
)
mp.spawn(build, nprocs=args.world_size, args=(args, ))
else:
args.parallel_build = False
logger.info('Serially build TensorRT engines.')
build(0, args)
tok = time.time()
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
logger.info(f'Total time of building all {args.world_size} engines: {t}')

99
examples/baichuan/build.sh Executable file
View File

@@ -0,0 +1,99 @@
build_baichuan() {
get_path
cmd="XTCL_BUILD_DEBUG=1 python3 build.py ${tp_cmd} --model_version $model_name \
--model_dir ${model_home}/downloads/baichuan${model_version_num}-${model_size} \
--dtype float16 ${int8_cmd} \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--output_dir ${model_home}/engine/baichuan${model_version_num}-${model_size}/${precision}/${tp}"
echo "******************** cmd *********************"
echo $cmd
eval ${cmd} |& tee ${log_file}
}
get_path(){
model_home=/home/workspace
model_version=$(echo $model_name | cut -d "_" -f 1)
model_size=$(echo $model_name | cut -d "_" -f 2)
precision=$(echo $model_name | cut -d "_" -f 3)
tp=$(echo $model_name | cut -d "_" -f 4)
model_name=${model_version}_${model_size}
model_version_num=$(echo $model_version | grep -o '[0-9]\+')
if [[ "$model_version_num" == "1" ]]; then
model_version_num=""
fi
mpi_num=$(echo $tp | cut -d "p" -f 2)
if (( $mpi_num > 1 )); then
mpi_cmd="mpirun --allow-run-as-root -n $mpi_num"
tp_cmd="--parallel_build --world_size $mpi_num"
else
mpi_cmd=""
tp_cmd=""
fi
if [[ "$precision" == "int8" ]]; then
int8_cmd="--use_weight_only"
else
int8_cmd=""
fi
echo "------------------------------------------------------"
log_file=./logs/relay_${model_name}_"$(date '+%Y-%m-%d-%H:%M:%S')".log
echo "log file -> ${log_file} "
echo -e "\033[1;31m" # 设置红色字体
echo "Model version Model size Precision TP"
echo -e "\033[0m" # 重置字体颜色
echo "------------------------------------------------------"
echo -e "\033[0;32m"
echo "$model_version" " " "$model_size" " " "$precision" " " "$tp"
echo ""
}
if [ "$#" -ne 1 ]; then
echo "Usage: $0 -m=<model_name>"
exit 1
fi
model_name="$1"
case $model_name in
"v1_13b_fp16_tp1")
build_baichuan
;;
"v1_13b_int8_tp1")
build_baichuan
;;
"v1_13b_fp16_tp2")
build_baichuan
;;
"v1_13b_int8_tp2")
build_baichuan
;;
"v1_7b_fp16_tp1")
build_baichuan
;;
"v1_7b_int8_tp1")
build_baichuan
;;
"v2_13b_fp16_tp1")
build_baichuan
;;
"v2_13b_int8_tp1")
build_baichuan
;;
"v2_13b_fp16_tp2")
build_baichuan
;;
"v2_13b_int8_tp2")
build_baichuan
;;
*)
echo "Unknown model name: $model_name"
exit 1
;;
esac

View File

@@ -0,0 +1,5 @@
datasets~=2.3.2
rouge_score~=0.1.2
sentencepiece~=0.1.99
cpm-kernels~=1.0.11
transformers_stream_generator~=0.0.4

283
examples/baichuan/run.py Normal file
View File

@@ -0,0 +1,283 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import csv
import json
import os
from pathlib import Path
import numpy as np
import torch
from transformers import AutoTokenizer
from tvm.contrib.profiling import Profiler
import xtrt_llm
from xtrt_llm.runtime import ModelConfig, SamplingConfig
from build import get_engine_name # isort:skip
EOS_TOKEN = 2
PAD_TOKEN = 0
def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument('--max_output_len', type=int, required=True)
parser.add_argument('--log_level', type=str, default='error')
parser.add_argument('--model_version',
type=str,
default='v1_13b',
choices=['v1_7b', 'v1_13b', 'v2_7b', 'v2_13b'])
parser.add_argument('--engine_dir', type=str, default='baichuan_outputs')
parser.add_argument('--tokenizer_dir',
type=str,
default="baichuan-inc/Baichuan-13B-Chat",
help="Directory containing the tokenizer.model.")
parser.add_argument('--input_text', type=str, default="解释一下“温故而知新”")
parser.add_argument(
'--input_tokens',
dest='input_file',
type=str,
help=
'CSV or Numpy file containing tokenized input. Alternative to text input.',
default=None)
parser.add_argument('--output_csv',
type=str,
help='CSV file where the tokenized output is stored.',
default=None)
parser.add_argument('--output_npy',
type=str,
help='Numpy file where the tokenized output is stored.',
default=None)
parser.add_argument('--num_beams',
type=int,
help="Use beam search if num_beams >1",
default=1)
parser.add_argument(
'--performance_test_scale',
type=str,
help=
"Scale for performance test. e.g., 8x1024x64 (batch_size, input_text_length, max_output_length)",
default="")
return parser.parse_args()
def generate(
max_output_len: int,
log_level: str = 'error',
model_version: str = 'v1_13b',
engine_dir: str = 'baichuan_outputs',
input_text: str = "解释一下“温故而知新”",
input_file: str = None,
output_csv: str = None,
output_npy: str = None,
tokenizer_dir: str = None,
num_beams: int = 1,
performance_test_scale: str = "",
):
xtrt_llm.logger.set_level(log_level)
config_path = os.path.join(engine_dir, 'config.json')
with open(config_path, 'r') as f:
config = json.load(f)
use_gpt_attention_plugin = config['plugin_config']['gpt_attention_plugin']
remove_input_padding = config['plugin_config']['remove_input_padding']
paged_kv_cache = config['plugin_config']['paged_kv_cache']
tokens_per_block = config['plugin_config']['tokens_per_block']
dtype = config['builder_config']['precision']
world_size = config['builder_config']['tensor_parallel']
# assert world_size == xtrt_llm.mpi_world_size(), \
# f'Engine world size ({world_size}) != Runtime world size ({xtrt_llm.mpi_world_size()})'
num_heads = config['builder_config']['num_heads'] // world_size
hidden_size = config['builder_config']['hidden_size'] // world_size
vocab_size = config['builder_config']['vocab_size']
num_layers = config['builder_config']['num_layers']
builder_config = config['builder_config']
gather_all_token_logits = builder_config.get('gather_all_token_logits',
False)
runtime_rank = xtrt_llm.mpi_rank()
if world_size > 1:
os.environ["XCCL_GROUP_ID"] = str(runtime_rank // world_size)
os.environ["XCCL_NRANKS"] = str(world_size)
os.environ["XCCL_CUR_RANK"] = str(runtime_rank % world_size)
os.environ["XCCL_DEVICE_ID"] = str(runtime_rank)
os.environ["MP_RUN"] = str(1)
runtime_mapping = xtrt_llm.Mapping(world_size,
runtime_rank,
tp_size=world_size)
torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir,
use_fast=False,
trust_remote_code=True)
model_config = ModelConfig(num_heads=num_heads,
num_kv_heads=num_heads,
hidden_size=hidden_size,
vocab_size=vocab_size,
num_layers=num_layers,
gpt_attention_plugin=use_gpt_attention_plugin,
paged_kv_cache=paged_kv_cache,
tokens_per_block=tokens_per_block,
remove_input_padding=remove_input_padding,
dtype=dtype,
gather_all_token_logits=gather_all_token_logits)
repetition_penalty = 1.1
temperature = 0.3
top_k = 5
top_p = 0.85
if args.model_version == 'v1_7b':
temperature = 1
top_k = 1
top_p = 0
elif args.model_version == 'v2_7b' or args.model_version == 'v2_13b':
repetition_penalty = 1.05
sampling_config = SamplingConfig(end_id=EOS_TOKEN,
pad_id=PAD_TOKEN,
num_beams=num_beams,
repetition_penalty=repetition_penalty,
temperature=temperature,
top_k=top_k,
top_p=top_p)
engine_name = get_engine_name('baichuan', dtype, world_size, runtime_rank)
serialize_path = os.path.join(engine_dir, engine_name)
decoder = xtrt_llm.runtime.GenerationSession(model_config, serialize_path,
runtime_mapping)
input_tokens = []
if input_file is None:
input_tokens.append(
tokenizer.encode(input_text, add_special_tokens=False))
else:
if input_file.endswith('.csv'):
with open(input_file, 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for line in csv_reader:
input_tokens.append(np.array(line, dtype='int32'))
elif input_file.endswith('.npy'):
inputs = np.load(input_file)
for row in inputs:
row = row[row != EOS_TOKEN]
input_tokens.append(row)
else:
print('Input file format not supported.')
raise SystemExit
input_ids = None
input_lengths = None
if input_file is None:
input_ids = torch.tensor(input_tokens, dtype=torch.int32, device='cuda')
input_lengths = torch.tensor([input_ids.size(1)],
dtype=torch.int32,
device='cuda')
else:
input_lengths = torch.tensor([len(x) for x in input_tokens],
dtype=torch.int32,
device='cuda')
if remove_input_padding:
input_ids = np.concatenate(input_tokens)
input_ids = torch.tensor(input_ids,
dtype=torch.int32,
device='cuda').unsqueeze(0)
else:
input_ids = torch.nested.to_padded_tensor(
torch.nested.nested_tensor(input_tokens, dtype=torch.int32),
EOS_TOKEN).cuda()
if performance_test_scale != "":
performance_test_scale_list = performance_test_scale.split("E")
warmup_epochs = 3
for scale in performance_test_scale_list:
for i in range(warmup_epochs):
xtrt_llm.logger.info(
f"Running performance test with scale {scale}")
bs, seqlen, max_output_len = [int(x) for x in scale.split("x")]
try:
_input_ids = torch.from_numpy(
np.zeros((bs, seqlen)).astype("int32")).cuda()
_input_lengths = torch.from_numpy(
np.full((bs, ), seqlen).astype("int32")).cuda()
max_input_length = torch.max(_input_lengths).item()
decoder.setup(_input_lengths.size(0), max_input_length,
max_output_len, num_beams)
with Profiler(f'{bs}_{seqlen}_{max_output_len}_decode',
show_report=True):
output_ids = decoder.decode(
_input_ids,
_input_lengths,
sampling_config,
stop_words_list=[tokenizer.eos_token_id])
except Exception as e:
xtrt_llm.logger.info(
f"Error occurs in performance test: {e}.")
exit(0)
max_input_length = torch.max(input_lengths).item()
decoder.setup(input_lengths.size(0),
max_input_length,
max_output_len,
beam_width=num_beams)
output_ids = decoder.decode(input_ids,
input_lengths,
sampling_config,
stop_words_list=[tokenizer.eos_token_id])
torch.cuda.synchronize()
if runtime_rank == 0:
if output_csv is None and output_npy is None:
for b in range(input_lengths.size(0)):
inputs = input_tokens[b]
input_text = tokenizer.decode(inputs)
print(f'Input: \"{input_text}\"')
if num_beams <= 1:
output_begin = max_input_length
outputs = output_ids[b][0][output_begin:].tolist()
output_text = tokenizer.decode(outputs)
print(f'Output: \"{output_text}\"')
else:
for beam in range(num_beams):
output_begin = input_lengths[b]
output_end = input_lengths[b] + max_output_len
outputs = output_ids[b][beam][
output_begin:output_end].tolist()
output_text = tokenizer.decode(outputs)
print(f'Output: \"{output_text}\"')
output_ids = output_ids.reshape((-1, output_ids.size(2)))
if output_csv is not None:
output_file = Path(output_csv)
output_file.parent.mkdir(exist_ok=True, parents=True)
outputs = output_ids.tolist()
with open(output_file, 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
writer.writerows(outputs)
if output_npy is not None:
output_file = Path(output_npy)
output_file.parent.mkdir(exist_ok=True, parents=True)
outputs = np.array(output_ids.cpu().contiguous(), dtype='int32')
np.save(output_file, outputs)
return
if __name__ == '__main__':
args = parse_arguments()
generate(**vars(args))

108
examples/baichuan/run.sh Executable file
View File

@@ -0,0 +1,108 @@
get_path(){
model_home=/home/workspace
model_version=$(echo $model_name | cut -d "_" -f 1)
model_size=$(echo $model_name | cut -d "_" -f 2)
precision=$(echo $model_name | cut -d "_" -f 3)
tp=$(echo $model_name | cut -d "_" -f 4)
model_name=${model_version}_${model_size}
model_version_num=$(echo $model_version | grep -o '[0-9]\+')
if [[ "$model_version_num" == "1" ]]; then
model_version_num=""
fi
mpi_num=$(echo $tp | cut -d "p" -f 2)
if (( $mpi_num > 1 )); then
mpi_cmd="mpirun --allow-run-as-root -n $mpi_num"
# mpi_cmd="mpirun -n $mpi_num"
else
mpi_cmd=""
fi
echo -e "\033[1;31m" # 设置红色字体
echo "Model version Model size Precision TP"
echo -e "\033[0m" # 重置字体颜色
echo "------------------------------------------------------"
echo -e "\033[0;32m"
echo "$model_version" " " "$model_size" " " "$precision" " " "$tp"
echo ""
}
run_baichuan(){
get_path
engine_dir=${model_home}/engine/baichuan${model_version_num}-${model_size}/${precision}/${tp}
tokenizer_dir=${model_home}/downloads/baichuan${model_version_num}-${model_size}
env_cmd="PYTORCH_NO_XPU_MEMORY_CACHING=0 XMLIR_D_XPU_L3_SIZE=0 "
required_cmd="--engine_dir=$engine_dir --tokenizer_dir=$tokenizer_dir"
options="--max_output_len=128 --log_level=info"
# inputs="--input_text='世界前五的高峰是?'"
prof_cmd="--performance_test_scale=\
1x512x512E1x1024x1024E1x2048x64E1x2048x2048E\
2x512x512E2x1024x1024E2x2048x64E2x2048x2048E\
4x512x512E4x1024x1024E4x2048x64E4x2048x2048E\
8x512x512E8x1024x1024E8x2048x64E8x2048x2048E\
16x2048x2048E\
32x128x2048E32x2048x128E\
64x128x128E"
prof_cmd="--performance_test_scale=20x1024x1024E32x1024x1024E48x1024x1024"
cmd="${env_cmd} $mpi_cmd python3 run.py ${required_cmd} ${options} ${inputs} ${prof_cmd}"
echo "==================== cmd ======================"
echo $cmd
eval $cmd
}
if [ "$#" -ne 2 ]; then
echo "Usage: $0 -m=<model_name> -d=<device_id>"
exit 1
fi
model_name="$1"
# device_id="$2"
log_file=./logs/${model_name}_"$(date '+%Y-%m-%d-%H:%M:%S')".log
export XPU_VISIBLE_DEVICES=$2
# export XPU_VISIBLE_DEVICES=0,1
case $model_name in
"v1_13b_fp16_tp1")
run_baichuan |& tee $log_file 2>&1 &
;;
"v1_13b_int8_tp1")
run_baichuan |& tee $log_file 2>&1 &
;;
"v1_13b_fp16_tp2")
run_baichuan |& tee $log_file 2>&1 &
;;
"v1_13b_int8_tp2")
run_baichuan |& tee $log_file 2>&1 &
;;
"v1_7b_fp16_tp1")
run_baichuan |& tee $log_file 2>&1 &
;;
"v1_7b_int8_tp1")
run_baichuan |& tee $log_file 2>&1 &
;;
"v2_13b_fp16_tp1")
run_baichuan |& tee $log_file 2>&1 &
;;
"v2_13b_int8_tp1")
run_baichuan |& tee $log_file 2>&1 &
;;
"v2_13b_fp16_tp2")
run_baichuan
;;
"v2_13b_int8_tp2")
run_baichuan |& tee $log_file 2>&1 &
;;
*)
echo "Unknown model name: $model_name"
exit 1
;;
esac

View File

@@ -0,0 +1,392 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import copy
import json
import os
import numpy as np
import torch
from datasets import load_dataset, load_metric
from transformers import AutoModelForCausalLM, AutoTokenizer
import xtrt_llm
import xtrt_llm.profiler as profiler
from xtrt_llm.logger import logger
from build import get_engine_name # isort:skip
def TRTBaichuan(args, config):
dtype = config['builder_config']['precision']
world_size = config['builder_config']['tensor_parallel']
assert world_size == xtrt_llm.mpi_world_size(), \
f'Engine world size ({world_size}) != Runtime world size ({xtrt_llm.mpi_world_size()})'
world_size = config['builder_config']['tensor_parallel']
num_heads = config['builder_config']['num_heads'] // world_size
hidden_size = config['builder_config']['hidden_size'] // world_size
vocab_size = config['builder_config']['vocab_size']
num_layers = config['builder_config']['num_layers']
use_gpt_attention_plugin = bool(
config['plugin_config']['gpt_attention_plugin'])
remove_input_padding = config['plugin_config']['remove_input_padding']
paged_kv_cache = config['plugin_config']['paged_kv_cache']
tokens_per_block = config['plugin_config']['tokens_per_block']
model_config = xtrt_llm.runtime.ModelConfig(
vocab_size=vocab_size,
num_layers=num_layers,
num_heads=num_heads,
num_kv_heads=num_heads,
hidden_size=hidden_size,
gpt_attention_plugin=use_gpt_attention_plugin,
tokens_per_block=tokens_per_block,
remove_input_padding=remove_input_padding,
paged_kv_cache=paged_kv_cache,
dtype=dtype)
runtime_rank = xtrt_llm.mpi_rank()
runtime_mapping = xtrt_llm.Mapping(world_size,
runtime_rank,
tp_size=world_size)
if world_size > 1:
os.environ["XCCL_GROUP_ID"] = str(runtime_rank // world_size)
os.environ["XCCL_NRANKS"] = str(world_size)
os.environ["XCCL_CUR_RANK"] = str(runtime_rank % world_size)
os.environ["XCCL_DEVICE_ID"] = str(runtime_rank)
os.environ["MP_RUN"] = str(1)
torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
engine_name = get_engine_name('baichuan', dtype, world_size, runtime_rank)
serialize_path = os.path.join(args.engine_dir, engine_name)
xtrt_llm.logger.set_level(args.log_level)
profiler.start('load xtrt_llm engine')
# with open(serialize_path, 'rb') as f:
# engine_buffer = f.read()
decoder = xtrt_llm.runtime.GenerationSession(model_config, serialize_path,
runtime_mapping)
profiler.stop('load xtrt_llm engine')
xtrt_llm.logger.info(
f'Load engine takes: {profiler.elapsed_time_in_sec("load xtrt_llm engine")} sec'
)
return decoder
def main(args):
runtime_rank = xtrt_llm.mpi_rank()
logger.set_level(args.log_level)
test_hf = args.test_hf and runtime_rank == 0 # only run hf on rank 0
test_trt_llm = args.test_trt_llm
hf_model_location = args.hf_model_location
profiler.start('load tokenizer')
tokenizer = AutoTokenizer.from_pretrained(hf_model_location,
use_fast=False,
trust_remote_code=True)
profiler.stop('load tokenizer')
xtrt_llm.logger.info(
f'Load tokenizer takes: {profiler.elapsed_time_in_sec("load tokenizer")} sec'
)
tokenizer.pad_token = tokenizer.eos_token
dataset_cnn = load_dataset("ccdv/cnn_dailymail",
'3.0.0',
cache_dir=args.dataset_path)
max_batch_size = args.batch_size
# runtime parameters
# repetition_penalty = 1
top_k = args.top_k
output_len = 100
test_token_num = 923
# top_p = 0.0
# random_seed = 5
temperature = 1
num_beams = args.num_beams
pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
if test_trt_llm:
config_path = os.path.join(args.engine_dir, 'config.json')
with open(config_path, 'r') as f:
config = json.load(f)
xtrt_llm_baichuan = TRTBaichuan(args, config)
if test_hf:
profiler.start('load HF model')
model = AutoModelForCausalLM.from_pretrained(hf_model_location,
trust_remote_code=True)
profiler.stop('load HF model')
xtrt_llm.logger.info(
f'Load HF model takes: {profiler.elapsed_time_in_sec("load HF model")} sec'
)
if args.data_type == 'fp16':
model.half()
model.cuda()
def summarize_xtrt_llm(datapoint):
batch_size = len(datapoint['article'])
line = copy.copy(datapoint['article'])
line_encoded = []
input_lengths = []
for i in range(batch_size):
line[i] = line[i] + ' TL;DR: '
line[i] = line[i].strip()
line[i] = line[i].replace(" n't", "n't")
input_id = tokenizer.encode(line[i],
return_tensors='pt').type(torch.int32)
input_id = input_id[:, -test_token_num:]
line_encoded.append(input_id)
input_lengths.append(input_id.shape[-1])
# do padding, should move outside the profiling to prevent the overhead
max_length = max(input_lengths)
if xtrt_llm_baichuan.remove_input_padding:
line_encoded = [
torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded
]
else:
# do padding, should move outside the profiling to prevent the overhead
for i in range(batch_size):
pad_size = max_length - input_lengths[i]
pad = torch.ones([1, pad_size]).type(torch.int32) * pad_id
line_encoded[i] = torch.cat(
[torch.tensor(line_encoded[i], dtype=torch.int32), pad],
axis=-1)
line_encoded = torch.cat(line_encoded, axis=0).cuda()
input_lengths = torch.tensor(input_lengths,
dtype=torch.int32).cuda()
sampling_config = xtrt_llm.runtime.SamplingConfig(end_id=end_id,
pad_id=pad_id,
top_k=top_k,
num_beams=num_beams)
with torch.no_grad():
xtrt_llm_baichuan.setup(batch_size,
max_context_length=max_length,
max_new_tokens=output_len,
beam_width=num_beams)
if xtrt_llm_baichuan.remove_input_padding:
output_ids = xtrt_llm_baichuan.decode_batch(
line_encoded, sampling_config)
else:
output_ids = xtrt_llm_baichuan.decode(
line_encoded,
input_lengths,
sampling_config,
)
torch.cuda.synchronize()
# Extract a list of tensors of shape beam_width x output_ids.
if xtrt_llm_baichuan.mapping.is_first_pp_rank():
output_beams_list = [
tokenizer.batch_decode(output_ids[batch_idx, :,
input_lengths[batch_idx]:],
skip_special_tokens=True)
for batch_idx in range(batch_size)
]
return output_beams_list, output_ids[:, :, max_length:].tolist()
return [], []
def summarize_hf(datapoint):
batch_size = len(datapoint['article'])
if batch_size > 1:
logger.warning(
f"HF does not support batch_size > 1 to verify correctness due to padding. Current batch size is {batch_size}"
)
line = copy.copy(datapoint['article'])
for i in range(batch_size):
line[i] = line[i] + ' TL;DR: '
line[i] = line[i].strip()
line[i] = line[i].replace(" n't", "n't")
line_encoded = tokenizer(line,
return_tensors='pt',
padding=True,
truncation=True)["input_ids"].type(torch.int64)
line_encoded = line_encoded[:, -test_token_num:]
line_encoded = line_encoded.cuda()
with torch.no_grad():
output = model.generate(line_encoded,
max_new_tokens=output_len,
top_k=top_k,
temperature=temperature,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
num_beams=num_beams,
num_return_sequences=num_beams,
early_stopping=True)
tokens_list = output[:, len(line_encoded[0]):].tolist()
output = output.reshape([batch_size, num_beams, -1])
output_lines_list = [
tokenizer.batch_decode(output[:, i, len(line_encoded[0]):],
skip_special_tokens=True)
for i in range(num_beams)
]
return output_lines_list, tokens_list
if test_trt_llm:
datapoint = dataset_cnn['test'][0:1]
summary, _ = summarize_xtrt_llm(datapoint)
if runtime_rank == 0:
logger.info(
"---------------------------------------------------------")
logger.info("XTRT-LLM Generated : ")
logger.info(f" Article : {datapoint['article']}")
logger.info(f"\n Highlights : {datapoint['highlights']}")
logger.info(f"\n Summary : {summary}")
logger.info(
"---------------------------------------------------------")
if test_hf:
datapoint = dataset_cnn['test'][0:1]
summary, _ = summarize_hf(datapoint)
logger.info("---------------------------------------------------------")
logger.info("HF Generated : ")
logger.info(f" Article : {datapoint['article']}")
logger.info(f"\n Highlights : {datapoint['highlights']}")
logger.info(f"\n Summary : {summary}")
logger.info("---------------------------------------------------------")
metric_xtrt_llm = [load_metric("rouge") for _ in range(num_beams)]
metric_hf = [load_metric("rouge") for _ in range(num_beams)]
for i in range(num_beams):
metric_xtrt_llm[i].seed = 0
metric_hf[i].seed = 0
ite_count = 0
data_point_idx = 0
while (data_point_idx < len(dataset_cnn['test'])) and (ite_count <
args.max_ite):
if runtime_rank == 0:
logger.debug(
f"run data_point {data_point_idx} ~ {data_point_idx + max_batch_size}"
)
datapoint = dataset_cnn['test'][data_point_idx:(data_point_idx +
max_batch_size)]
if test_trt_llm:
profiler.start('xtrt_llm')
summary_xtrt_llm, tokens_xtrt_llm = summarize_xtrt_llm(datapoint)
profiler.stop('xtrt_llm')
if test_hf:
profiler.start('hf')
summary_hf, tokens_hf = summarize_hf(datapoint)
profiler.stop('hf')
if runtime_rank == 0:
if test_trt_llm:
for batch_idx in range(len(summary_xtrt_llm)):
for beam_idx in range(num_beams):
metric_xtrt_llm[beam_idx].add_batch(
predictions=[summary_xtrt_llm[batch_idx][beam_idx]],
references=[datapoint['highlights'][batch_idx]])
if test_hf:
for beam_idx in range(num_beams):
for batch_idx in range(len(summary_hf[beam_idx])):
metric_hf[beam_idx].add_batch(
predictions=[summary_hf[beam_idx][batch_idx]],
references=[datapoint['highlights'][batch_idx]])
logger.debug('-' * 100)
logger.debug(f"Article : {datapoint['article']}")
if test_trt_llm:
logger.debug(f'XTRT-LLM Summary: {summary_xtrt_llm}')
if test_hf:
logger.debug(f'HF Summary: {summary_hf}')
logger.debug(f"highlights : {datapoint['highlights']}")
data_point_idx += max_batch_size
ite_count += 1
if runtime_rank == 0:
if test_trt_llm:
np.random.seed(0) # rouge score use sampling to compute the score
logger.info(
f'XTRT-LLM (total latency: {profiler.elapsed_time_in_sec("xtrt_llm")} sec)'
)
for beam_idx in range(num_beams):
logger.info(f"XTRT-LLM beam {beam_idx} result")
computed_metrics_xtrt_llm = metric_xtrt_llm[beam_idx].compute()
for key in computed_metrics_xtrt_llm.keys():
logger.info(
f' {key} : {computed_metrics_xtrt_llm[key].mid[2]*100}'
)
if args.check_accuracy and beam_idx == 0:
assert computed_metrics_xtrt_llm['rouge1'].mid[
2] * 100 > args.xtrt_llm_rouge1_threshold
if test_hf:
np.random.seed(0) # rouge score use sampling to compute the score
logger.info(
f'Hugging Face (total latency: {profiler.elapsed_time_in_sec("hf")} sec)'
)
for beam_idx in range(num_beams):
logger.info(f"HF beam {beam_idx} result")
computed_metrics_hf = metric_hf[beam_idx].compute()
for key in computed_metrics_hf.keys():
logger.info(
f' {key} : {computed_metrics_hf[key].mid[2]*100}')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model_version',
type=str,
default='v1_13b',
choices=['v1_7b', 'v1_13b', 'v2_7b', 'v2_13b'])
parser.add_argument('--hf_model_location',
type=str,
default='baichuan-inc/Baichuan-13B-Chat')
parser.add_argument('--test_hf', action='store_true')
parser.add_argument('--test_trt_llm', action='store_true')
parser.add_argument('--data_type',
type=str,
choices=['fp32', 'fp16'],
default='fp16')
parser.add_argument('--dataset_path', type=str, default='')
parser.add_argument('--log_level', type=str, default='info')
parser.add_argument('--engine_dir', type=str, default='baichuan_outputs')
parser.add_argument('--batch_size', type=int, default=1)
parser.add_argument('--max_ite', type=int, default=20)
parser.add_argument('--check_accuracy', action='store_true')
parser.add_argument('--xtrt_llm_rouge1_threshold', type=float, default=15.0)
parser.add_argument('--num_beams', type=int, default=1)
parser.add_argument('--top_k', type=int, default=1)
args = parser.parse_args()
main(args)

173
examples/baichuan/weight.py Normal file
View File

@@ -0,0 +1,173 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
import numpy as np
import torch
import xtrt_llm
from xtrt_llm._utils import str_dtype_to_torch, torch_to_numpy
from xtrt_llm.quantization import QuantMode
def extract_layer_idx(name):
ss = name.split('.')
for s in ss:
if s.isdigit():
return s
return None
def split(v, tp_size, idx, dim=0):
if tp_size == 1:
return v
if len(v.shape) == 1:
return np.ascontiguousarray(np.split(v, tp_size)[idx])
else:
return np.ascontiguousarray(np.split(v, tp_size, axis=dim)[idx])
def load_from_hf_baichuan(xtrt_llm_baichuan,
hf_baichuan,
model_version,
rank=0,
tensor_parallel=1,
dtype="float32"):
assert model_version is not None
xtrt_llm.logger.info(f'Loading weights from HF Baichuan {model_version}...')
tik = time.time()
quant_mode = getattr(xtrt_llm_baichuan, 'quant_mode', QuantMode(0))
if quant_mode.is_int8_weight_only():
plugin_weight_only_quant_type = torch.int8
elif quant_mode.is_int4_weight_only():
plugin_weight_only_quant_type = torch.quint4x2
use_weight_only = quant_mode.is_weight_only()
model_params = dict(hf_baichuan.named_parameters())
for k, v in model_params.items():
torch_dtype = str_dtype_to_torch(dtype)
v = torch_to_numpy(v.to(torch_dtype).detach().cpu())
if 'model.embed_tokens.weight' in k:
xtrt_llm_baichuan.vocab_embedding.weight.value = v
elif 'model.norm.weight' in k:
xtrt_llm_baichuan.ln_f.weight.value = v
elif 'lm_head.weight' in k:
if model_version.startswith('v2'):
# baichuan v2 models use NormHead
xtrt_llm.logger.info(
f'Normalizing lm_head.weight for {model_version}')
original_v = model_params[k]
v = torch_to_numpy(
torch.nn.functional.normalize(original_v).to(
torch_dtype).detach().cpu())
xtrt_llm_baichuan.lm_head.weight.value = np.ascontiguousarray(
split(v, tensor_parallel, rank))
else:
layer_idx = extract_layer_idx(k)
if layer_idx is None:
continue
idx = int(layer_idx)
if idx >= xtrt_llm_baichuan._num_layers:
continue
if 'input_layernorm.weight' in k:
xtrt_llm_baichuan.layers[idx].input_layernorm.weight.value = v
elif 'post_attention_layernorm.weight' in k:
dst = xtrt_llm_baichuan.layers[idx].post_layernorm.weight
dst.value = v
elif 'self_attn.W_pack.weight' in k:
dst = xtrt_llm_baichuan.layers[idx].attention.qkv.weight
q_emb = v.shape[0] // 3
model_emb = v.shape[1]
v = v.reshape(3, q_emb, model_emb)
split_v = split(v, tensor_parallel, rank, dim=1)
split_v = split_v.reshape(3 * (q_emb // tensor_parallel),
model_emb)
if use_weight_only:
v = np.ascontiguousarray(split_v.transpose())
processed_torch_weights, torch_weight_scales = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix(
torch.tensor(v), plugin_weight_only_quant_type)
# workaround for trt not supporting int8 inputs in plugins currently
dst.value = processed_torch_weights.view(
dtype=torch.float32).numpy()
scales = xtrt_llm_baichuan.layers[
idx].attention.qkv.per_channel_scale
scales.value = torch_weight_scales.numpy()
else:
dst.value = np.ascontiguousarray(split_v)
elif 'self_attn.o_proj.weight' in k:
dst = xtrt_llm_baichuan.layers[idx].attention.dense.weight
split_v = split(v, tensor_parallel, rank, dim=1)
if use_weight_only:
v = np.ascontiguousarray(split_v.transpose())
processed_torch_weights, torch_weight_scales = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix(
torch.tensor(v), plugin_weight_only_quant_type)
# workaround for trt not supporting int8 inputs in plugins currently
dst.value = processed_torch_weights.view(
dtype=torch.float32).numpy()
scales = xtrt_llm_baichuan.layers[
idx].attention.dense.per_channel_scale
scales.value = torch_weight_scales.numpy()
else:
dst.value = np.ascontiguousarray(split_v)
elif 'mlp.up_proj.weight' in k:
dst = xtrt_llm_baichuan.layers[idx].mlp.gate.weight
split_v = split(v, tensor_parallel, rank, dim=0)
if use_weight_only:
v = np.ascontiguousarray(split_v.transpose())
processed_torch_weights, torch_weight_scales = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix(
torch.tensor(v), plugin_weight_only_quant_type)
# workaround for trt not supporting int8 inputs in plugins currently
dst.value = processed_torch_weights.view(
dtype=torch.float32).numpy()
scales = xtrt_llm_baichuan.layers[
idx].mlp.gate.per_channel_scale
scales.value = torch_weight_scales.numpy()
else:
dst.value = np.ascontiguousarray(split_v)
elif 'mlp.down_proj.weight' in k:
dst = xtrt_llm_baichuan.layers[idx].mlp.proj.weight
split_v = split(v, tensor_parallel, rank, dim=1)
if use_weight_only:
v = np.ascontiguousarray(split_v.transpose())
processed_torch_weights, torch_weight_scales = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix(
torch.tensor(v), plugin_weight_only_quant_type)
# workaround for trt not supporting int8 inputs in plugins currently
dst.value = processed_torch_weights.view(
dtype=torch.float32).numpy()
scales = xtrt_llm_baichuan.layers[
idx].mlp.proj.per_channel_scale
scales.value = torch_weight_scales.numpy()
else:
dst.value = np.ascontiguousarray(split_v)
elif 'mlp.gate_proj.weight' in k:
dst = xtrt_llm_baichuan.layers[idx].mlp.fc.weight
split_v = split(v, tensor_parallel, rank, dim=0)
if use_weight_only:
v = np.ascontiguousarray(split_v.transpose())
processed_torch_weights, torch_weight_scales = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix(
torch.tensor(v), plugin_weight_only_quant_type)
# workaround for trt not supporting int8 inputs in plugins currently
dst.value = processed_torch_weights.view(
dtype=torch.float32).numpy()
scales = xtrt_llm_baichuan.layers[
idx].mlp.fc.per_channel_scale
scales.value = torch_weight_scales.numpy()
else:
dst.value = np.ascontiguousarray(split_v)
tok = time.time()
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
xtrt_llm.logger.info(f'Weights loaded. Total time: {t}')