This commit is contained in:
2025-08-06 15:49:14 +08:00
parent e80b916c52
commit bf00e72fb2
111 changed files with 21880 additions and 1 deletions

3
examples/gptneox/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
__pycache__/
gptneox_model/
*.log

View File

@@ -0,0 +1,93 @@
# GPT-NeoX
This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using XTRT-LLM and run on single node multi-XPU.
## Overview
The XTRT-LLM GPT-NeoX example code is located in [`examples/gptneox`](./). There are several main files in that folder:
* [`build.py`](./build.py) to build the XTRT engine(s) needed to run the GPT-NeoX model,
* [`run.py`](./run.py) to run the inference on an input text,
## Support Matrix
* FP16
* INT8 Weight-Only
* Tensor Parallel
## Usage
### 1. Download weights from HuggingFace (HF) Transformers
```bash
# Weights & config
sh get_weights.sh
```
### 2. Build XTRT engine(s)
XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights.
Examples of build invocations:
```bash
# Build a float16 engine using 2-way tensor parallelism and HF weights.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_layernorm_plugin float16 \
--max_batch_size=16 \
--max_input_len=1024 \
--max_output_len=1024 \
--world_size=2 \
--output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_layernorm_plugin float16 \
--max_batch_size=16 \
--max_input_len=1024 \
--max_output_len=1024 \
--world_size=2 \
--use_weight_only \
--output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
```
### 3. Run
Before running the examples, make sure set the environment variables:
```bash
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
```
If NOT using R480-X8, make sure set the environment variables:
```bash
export BKCL_PCIE_RING=1
```
To run a XTRT-LLM GPT-NeoX model using the engines generated by `build.py`:
```bash
# For 2-way tensor parallelism, FP16
mpirun -n 2 --allow-run-as-root \
python3 run.py \
--max_output_len=50 \
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
--tokenizer_dir=./downloads/gptneox_model
# For 2-way tensor parallelism, INT8
mpirun -n 2 --allow-run-as-root \
python3 run.py \
--max_output_len=50 \
--engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
--tokenizer_dir=./downloads/gptneox_model
```

View File

@@ -0,0 +1,95 @@
# GPT-NeoX
本文档介绍了如何使用昆仑芯XTRT-LLM在单节点多XPU上构建和运行[GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) 模型。
## 概述
XTRT-LLM GPT-NeoX 示例代码位于 [`examples/gptneox`](./)。 此文件夹中有以下几个主要文件:
* [`build.py`](./build.py) 构建运行GPT-NeoX模型所需的XTRT引擎
* [`run.py`](./run.py) 基于输入的文字进行推理
## 支持的矩阵
* FP16
* INT8 Weight-Only
* Tensor Parallel
## 使用说明
### 1.从HuggingFaceHF Transformers下载权重
```bash
# Weights & config
sh get_weights.sh
```
### 2. 构建XTRT引擎
XTRT-LLM从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录XTRT-LLM将使用伪权重构建引擎。
构建调用示例:
```bash
# Build a float16 engine using 2-way tensor parallelism and HF weights.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_layernorm_plugin float16 \
--max_batch_size=16 \
--max_input_len=1024 \
--max_output_len=1024 \
--world_size=2 \
--output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
python3 build.py --dtype=float16 \
--log_level=verbose \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_layernorm_plugin float16 \
--max_batch_size=16 \
--max_input_len=1024 \
--max_output_len=1024 \
--world_size=2 \
--use_weight_only \
--output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
```
### 3. 运行
在运行示例之前,请确保设置环境变量:
```bash
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
```
如果不使用昆仑芯R480-X8产品请确保设置环境变量如下
```bash
export BKCL_PCIE_RING=1
```
要使用`build.py`生成的引擎运行XTRT-LLM GPT-NeoX模型请执行以下操作
```bash
# For 2-way tensor parallelism, FP16
mpirun -n 2 --allow-run-as-root \
python3 run.py \
--max_output_len=50 \
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
--tokenizer_dir=./downloads/gptneox_model
# For 2-way tensor parallelism, INT8
mpirun -n 2 --allow-run-as-root \
python3 run.py \
--max_output_len=50 \
--engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
--tokenizer_dir=./downloads/gptneox_model
```

442
examples/gptneox/build.py Normal file
View File

@@ -0,0 +1,442 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import os
import time
#import tensorrt as trt
import torch
import torch.multiprocessing as mp
from safetensors import safe_open
from transformers import AutoModelForCausalLM, GPTNeoXConfig
from weight import load_from_hf_gpt_neox
import xtrt_llm
from xtrt_llm._utils import str_dtype_to_xtrt
from xtrt_llm.builder import Builder
from xtrt_llm.logger import logger
from xtrt_llm.mapping import Mapping
from xtrt_llm.models import weight_only_groupwise_quantize, weight_only_quantize
from xtrt_llm.network import net_guard
from xtrt_llm.plugin.plugin import ContextFMHAType
from xtrt_llm.quantization import QuantMode
MODEL_NAME = "gptneox"
hf_gpt = None
class StateDict():
def __init__(self, quant_ckpt_dir):
self.model_state_dict = safe_open(quant_ckpt_dir,
framework="pt",
device=0)
def get(self, k):
return self.model_state_dict.get_tensor(k).cpu()
class GPTQModel():
def __init__(self, model_dir, quant_ckpt_dir):
with open(model_dir + '/config.json', 'r') as f:
model_config = json.load(f)
self.config = GPTNeoXConfig()
self.config.vocab_size = model_config['vocab_size']
self.config.hidden_size = model_config['hidden_size']
self.config.num_hidden_layers = model_config['num_hidden_layers']
self.config.num_attention_heads = model_config[
'num_attention_heads']
self.config.intermediate_size = model_config['intermediate_size']
self.config.hidden_act = model_config['hidden_act']
self.config.rotary_pct = model_config['rotary_pct']
self.config.rotary_emb_base = model_config['rotary_emb_base']
self.config.max_position_embeddings = model_config[
'max_position_embeddings']
self.config.initializer_range = model_config['initializer_range']
self.config.layer_norm_eps = model_config['layer_norm_eps']
self.config.use_cache = model_config['use_cache']
self.config.bos_token_id = model_config['bos_token_id']
self.config.eos_token_id = model_config['eos_token_id']
self.config.tie_word_embeddings = model_config[
'tie_word_embeddings']
self.model_state_dict = StateDict(quant_ckpt_dir)
def state_dict(self):
return self.model_state_dict
def get_engine_name(model, dtype, tp_size, rank):
return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank)
def serialize_engine(engine, path):
logger.info(f'Serializing engine to {path}...')
tik = time.time()
engine.serialize(path)
tok = time.time()
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
logger.info(f'Engine serialized. Total time: {t}')
def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument('--world_size',
type=int,
default=1,
help='world size, only support tensor parallelism now')
parser.add_argument(
'--model_dir',
type=str,
default=None,
help='The path to HF GPT-NeoX model / checkpoints to read weights from')
parser.add_argument('--dtype',
type=str,
default='float16',
choices=['float16', 'float32'])
parser.add_argument(
'--timing_cache',
type=str,
default='model.cache',
help=
'The path of to read timing cache from, will be ignored if the file does not exist'
)
parser.add_argument('--log_level', type=str, default='info')
parser.add_argument('--vocab_size', type=int, default=50432)
parser.add_argument('--n_layer', type=int, default=44)
parser.add_argument('--n_positions', type=int, default=2048)
parser.add_argument('--n_embd', type=int, default=6144)
parser.add_argument('--n_head', type=int, default=64)
parser.add_argument('--hidden_act', type=str, default='gelu')
parser.add_argument(
'--rotary_pct',
type=float,
default=0.25,
help="Percentage of hidden dimensions to allocate to rotary embeddings."
)
parser.add_argument('--max_batch_size', type=int, default=64)
parser.add_argument('--max_input_len', type=int, default=1024)
parser.add_argument('--max_output_len', type=int, default=1024)
parser.add_argument('--max_beam_width', type=int, default=1)
parser.add_argument('--use_gpt_attention_plugin',
nargs='?',
const='float16',
type=str,
default=False,
choices=['float16', 'float32'])
parser.add_argument('--use_gemm_plugin',
nargs='?',
const='float16',
type=str,
default=False,
choices=['float16', 'float32'])
parser.add_argument('--use_weight_only_quant_matmul_plugin',
nargs='?',
const='float16',
type=str,
default=False,
choices=['float16'])
parser.add_argument('--use_weight_only_groupwise_quant_matmul_plugin',
nargs='?',
const='float16',
type=str,
default=False,
choices=['float16'])
parser.add_argument(
'--groupwise_quant_safetensors_path',
type=str,
default=None,
help=
"The path to groupwise quantized GPT-NeoX model / checkpoints to read weights from."
)
parser.add_argument('--use_layernorm_plugin',
nargs='?',
const='float16',
type=str,
default=False,
choices=['float16', 'float32'])
parser.add_argument('--parallel_build', default=False, action='store_true')
parser.add_argument('--enable_context_fmha',
default=False,
action='store_true')
parser.add_argument('--enable_context_fmha_fp32_acc',
default=False,
action='store_true')
parser.add_argument('--gpus_per_node', type=int, default=8)
parser.add_argument(
'--output_dir',
type=str,
default='gpt_outputs',
help=
'The path to save the serialized engine files, timing cache file and model configs'
)
parser.add_argument('--remove_input_padding',
default=False,
action='store_true')
parser.add_argument(
'--use_parallel_embedding',
action="store_true",
default=False,
help=
'By default embedding parallelism is disabled. By setting this flag, embedding parallelism is enabled'
)
parser.add_argument(
'--embedding_sharding_dim',
type=int,
default=1, # Meta does TP on hidden dim
choices=[0, 1],
help=
'By default the embedding lookup table is sharded along vocab dimension (--embedding_sharding_dim=0). '
'To shard it along hidden dimension, set --embedding_sharding_dim=1'
'Note: embedding sharing is only enabled when --embedding_sharding_dim=0'
)
parser.add_argument(
'--use_weight_only',
default=False,
action="store_true",
help='Quantize weights for the various GEMMs to INT4/INT8.'
'See --weight_only_precision to set the precision')
parser.add_argument(
'--weight_only_precision',
const='int8',
type=str,
nargs='?',
default='int8',
choices=['int8', 'int4'],
help=
'Define the precision for the weights when using weight-only quantization.'
'You must also use --use_weight_only for that argument to have an impact.'
)
parser.add_argument('--inter_size', type=int, default=None)
args = parser.parse_args()
xtrt_llm.logger.set_level(args.log_level)
if args.model_dir is not None:
global hf_gpt
if not args.use_weight_only_groupwise_quant_matmul_plugin:
logger.info(f'Loading HF GPT-NeoX model from {args.model_dir}...')
hf_gpt = AutoModelForCausalLM.from_pretrained(args.model_dir)
args.n_embd = hf_gpt.config.hidden_size
args.n_head = hf_gpt.config.num_attention_heads
args.n_layer = hf_gpt.config.num_hidden_layers
args.n_positions = hf_gpt.config.max_position_embeddings
args.vocab_size = hf_gpt.config.vocab_size
args.rotary_pct = hf_gpt.config.rotary_pct
else:
assert (
args.groupwise_quant_safetensors_path is not None
), f'Please set the path to the groupwise quantized GPT-NeoX checkpoints with --groupwise_quant_safetensors_path'
logger.info(
f'Loading GPTQ quantized HF GPT-NeoX model from {args.groupwise_quant_safetensors_path}...'
)
hf_gpt = GPTQModel(args.model_dir,
args.groupwise_quant_safetensors_path)
args.n_embd = hf_gpt.config.hidden_size
args.n_head = hf_gpt.config.num_attention_heads
args.n_layer = hf_gpt.config.num_hidden_layers
args.n_positions = hf_gpt.config.max_position_embeddings
args.vocab_size = hf_gpt.config.vocab_size
args.rotary_pct = hf_gpt.config.rotary_pct
args.inter_size = hf_gpt.config.intermediate_size
if args.use_weight_only:
args.quant_mode = QuantMode.use_weight_only(
args.weight_only_precision == 'int4')
else:
args.quant_mode = QuantMode(0)
return args
def build_rank_engine(builder: Builder,
builder_config: xtrt_llm.builder.BuilderConfig,
engine_name, rank, args):
'''
@brief: Build the engine on the given rank.
@param rank: The rank to build the engine.
@param args: The cmd line arguments.
@return: The built engine.
'''
kv_dtype = str_dtype_to_xtrt(args.dtype)
rotary_dim = int((args.n_embd // args.n_head) * args.rotary_pct)
# Initialize Module
xtrt_llm_gpt = xtrt_llm.models.GPTNeoXForCausalLM(
num_layers=args.n_layer,
num_heads=args.n_head,
hidden_size=args.n_embd,
vocab_size=args.vocab_size,
hidden_act=args.hidden_act,
max_position_embeddings=args.n_positions,
rotary_dim=rotary_dim,
dtype=kv_dtype,
mapping=Mapping(world_size=args.world_size,
rank=rank,
tp_size=args.world_size), # TP only
apply_query_key_layer_scaling=builder_config.
apply_query_key_layer_scaling,
use_parallel_embedding=args.use_parallel_embedding,
embedding_sharding_dim=args.embedding_sharding_dim)
if args.use_weight_only_quant_matmul_plugin:
xtrt_llm_gpt = weight_only_quantize(xtrt_llm_gpt)
if args.use_weight_only_groupwise_quant_matmul_plugin:
xtrt_llm_gpt = weight_only_groupwise_quantize(model=xtrt_llm_gpt,
quant_mode=QuantMode(0),
group_size=128,
zero=True)
if args.model_dir is not None:
assert hf_gpt is not None, f'Could not load weights from hf_gpt model as it is not loaded yet.'
if args.world_size > 1:
assert (
args.n_embd % args.world_size == 0
), f'Embedding size/hidden size must be divisible by world size.'
assert (
args.n_head % args.world_size == 0
), f'Number of attention heads must be divisible by world size.'
load_from_hf_gpt_neox(
xtrt_llm_gpt, hf_gpt, args.dtype, rank, args.world_size,
args.use_weight_only_groupwise_quant_matmul_plugin)
# Module -> Network
network = builder.create_network()
network.trt_network.name = engine_name
if args.use_gpt_attention_plugin:
network.plugin_config.set_gpt_attention_plugin(
dtype=args.use_gpt_attention_plugin)
if args.use_gemm_plugin:
network.plugin_config.set_gemm_plugin(dtype=args.use_gemm_plugin)
if args.use_layernorm_plugin:
network.plugin_config.set_layernorm_plugin(
dtype=args.use_layernorm_plugin)
assert not (args.enable_context_fmha and args.enable_context_fmha_fp32_acc)
if args.enable_context_fmha:
network.plugin_config.set_context_fmha(ContextFMHAType.enabled)
if args.enable_context_fmha_fp32_acc:
network.plugin_config.set_context_fmha(
ContextFMHAType.enabled_with_fp32_acc)
if args.use_weight_only_quant_matmul_plugin:
network.plugin_config.set_weight_only_quant_matmul_plugin(
dtype=args.use_weight_only_quant_matmul_plugin)
if args.use_weight_only_groupwise_quant_matmul_plugin:
network.plugin_config.set_weight_only_groupwise_quant_matmul_plugin(
dtype=args.use_weight_only_groupwise_quant_matmul_plugin)
if args.quant_mode.is_weight_only():
builder_config.trt_builder_config.use_weight_only = args.weight_only_precision
if args.world_size > 1:
network.plugin_config.set_nccl_plugin(args.dtype)
if args.remove_input_padding:
network.plugin_config.enable_remove_input_padding()
with net_guard(network):
# Prepare
network.set_named_parameters(xtrt_llm_gpt.named_parameters())
# Forward
inputs = xtrt_llm_gpt.prepare_inputs(args.max_batch_size,
args.max_input_len,
args.max_output_len, True,
args.max_beam_width)
xtrt_llm_gpt(*inputs)
#xtrt_llm.graph_rewriting.optimize(network)
engine = None
# Network -> Engine
engine = builder.build_engine(network, builder_config, compiler="gr")
if rank == 0:
config_path = os.path.join(args.output_dir, 'config.json')
builder.save_config(builder_config, config_path)
return engine
def build(rank, args):
#torch.cuda.set_device(rank % args.gpus_per_node)
xtrt_llm.logger.set_level(args.log_level)
if not os.path.exists(args.output_dir):
os.makedirs(args.output_dir)
# when doing serializing build, all ranks share one engine
apply_query_key_layer_scaling = False
builder = Builder()
cache = None
for cur_rank in range(args.world_size):
# skip other ranks if parallel_build is enabled
if args.parallel_build and cur_rank != rank:
continue
builder_config = builder.create_builder_config(
name=MODEL_NAME,
precision=args.dtype,
timing_cache=args.timing_cache if cache is None else cache,
tensor_parallel=args.world_size, # TP only
parallel_build=args.parallel_build,
num_layers=args.n_layer,
num_heads=args.n_head,
inter_size=args.inter_size,
hidden_size=args.n_embd,
vocab_size=args.vocab_size,
hidden_act=args.hidden_act,
max_position_embeddings=args.n_positions,
apply_query_key_layer_scaling=apply_query_key_layer_scaling,
max_batch_size=args.max_batch_size,
max_input_len=args.max_input_len,
max_output_len=args.max_output_len,
fusion_pattern_list=["remove_dup_mask"])
engine_name = get_engine_name(MODEL_NAME, args.dtype, args.world_size,
cur_rank)
engine = build_rank_engine(builder, builder_config, engine_name,
cur_rank, args)
assert engine is not None, f'Failed to build engine for rank {cur_rank}'
# if cur_rank == 0:
# # Use in-memory timing cache for multiple builder passes.
# if not args.parallel_build:
# cache = builder_config.trt_builder_config.get_timing_cache()
serialize_engine(engine, os.path.join(args.output_dir, engine_name))
# if rank == 0:
# ok = builder.save_timing_cache(
# builder_config, os.path.join(args.output_dir, "model.cache"))
# assert ok, "Failed to save timing cache."
if __name__ == '__main__':
args = parse_arguments()
tik = time.time()
if args.parallel_build and args.world_size > 1 and \
torch.cuda.device_count() >= args.world_size:
logger.warning(
f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
)
mp.spawn(build, nprocs=args.world_size, args=(args, ))
else:
args.parallel_build = False
logger.info('Serially build TensorRT engines.')
build(0, args)
tok = time.time()
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
logger.info(f'Total time of building all {args.world_size} engines: {t}')

View File

@@ -0,0 +1,16 @@
mkdir -p downloads
pushd downloads
rm -rf gptneox_model
git clone https://huggingface.co/EleutherAI/gpt-neox-20b gptneox_model
rm -f gptneox_model/model-*.safetensors
rm -f gptneox_model/model.safetensors.index.json
wget -q https://huggingface.co/EleutherAI/gpt-neox-20b/resolve/main/model.safetensors.index.json --directory-prefix gptneox_model
for i in $(seq -f %05g 46)
do
echo -n "Downloading $i of 00046..."
wget -q https://huggingface.co/EleutherAI/gpt-neox-20b/resolve/main/model-$i-of-00046.safetensors --directory-prefix gptneox_model
echo "Done"
done
popd

View File

@@ -0,0 +1,9 @@
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git GPTQ-for-LLaMa
pip install -r ./GPTQ-for-LLaMa/requirements.txt
CUDA_VISIBLE_DEVICES=0 python3 GPTQ-for-LLaMa/neox.py ./gptneox_model \
wikitext2 \
--wbits 4 \
--groupsize 128 \
--save_safetensors ./gptneox_model/gptneox-20b-4bit-gs128.safetensors

View File

@@ -0,0 +1,2 @@
datasets~=2.3.2
rouge_score~=0.1.2

141
examples/gptneox/run.py Normal file
View File

@@ -0,0 +1,141 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import os
import numpy as np
import torch
from transformers import AutoTokenizer
import xtrt_llm
from xtrt_llm.runtime import ModelConfig, SamplingConfig
from build import get_engine_name # isort:skip
def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument('--max_output_len', type=int, required=True)
parser.add_argument('--log_level', type=str, default='error')
parser.add_argument('--engine_dir', type=str, default='gptneox_outputs')
parser.add_argument('--tokenizer_dir',
type=str,
default="gptneox_model",
help="Directory containing the tokenizer.model.")
parser.add_argument('--input_text',
type=str,
default='Born in north-east France, Soyer trained as a')
parser.add_argument('--performance_test_scale',
type=str,
help=
"Scale for performance test. e.g., 8x1024x64 (batch_size, input_text_length, max_output_length)",
default="")
return parser.parse_args()
if __name__ == '__main__':
args = parse_arguments()
xtrt_llm.logger.set_level(args.log_level)
config_path = os.path.join(args.engine_dir, 'config.json')
with open(config_path, 'r') as f:
config = json.load(f)
use_gpt_attention_plugin = config['plugin_config']['gpt_attention_plugin']
remove_input_padding = config['plugin_config']['remove_input_padding']
dtype = config['builder_config']['precision']
world_size = config['builder_config']['tensor_parallel']
assert world_size == xtrt_llm.mpi_world_size(), \
f'Engine world size ({world_size}) != Runtime world size ({xtrt_llm.mpi_world_size()})'
num_heads = config['builder_config']['num_heads'] // world_size
hidden_size = config['builder_config']['hidden_size'] // world_size
vocab_size = config['builder_config']['vocab_size']
num_layers = config['builder_config']['num_layers']
runtime_rank = xtrt_llm.mpi_rank()
if world_size > 1:
os.environ["XCCL_GROUP_ID"] = str(runtime_rank // world_size)
os.environ["XCCL_NRANKS"] = str(world_size)
os.environ["XCCL_CUR_RANK"] = str(runtime_rank % world_size)
os.environ["XCCL_DEVICE_ID"] = str(runtime_rank)
os.environ["MP_RUN"] = str(1)
runtime_mapping = xtrt_llm.Mapping(world_size,
runtime_rank,
tp_size=world_size)
torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
engine_name = get_engine_name('gptneox', dtype, world_size, runtime_rank)
#serialize_path = os.path.join(args.engine_dir, engine_name)
serialize_path = str(args.engine_dir) + "/" + engine_name
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
input_ids = torch.tensor(tokenizer.encode(args.input_text),
dtype=torch.int32).cuda().unsqueeze(0)
model_config = ModelConfig(num_heads=num_heads,
num_kv_heads=num_heads,
hidden_size=hidden_size,
vocab_size=vocab_size,
num_layers=num_layers,
gpt_attention_plugin=use_gpt_attention_plugin,
remove_input_padding=remove_input_padding,
dtype=dtype)
tokenizer.pad_token = tokenizer.eos_token
end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
sampling_config = SamplingConfig(end_id=end_id, pad_id=pad_id)
input_lengths = torch.tensor(
[input_ids.size(1) for _ in range(input_ids.size(0))]).int().cuda()
#with open(serialize_path, 'rb') as f:
# engine_buffer = f.read()
decoder = xtrt_llm.runtime.GenerationSession(model_config,
serialize_path,
runtime_mapping,
debug_mode=False)
if args.performance_test_scale != "":
performance_test_scale_list = args.performance_test_scale.split("E")
for scale in performance_test_scale_list:
xtrt_llm.logger.info(f"Running performance test with scale {scale}")
bs, seqlen, max_output_len = [int(x) for x in scale.split("x")]
_input_ids = torch.from_numpy(
np.zeros((bs, seqlen)).astype("int32")).cuda()
_input_lengths = torch.from_numpy(
np.full((bs, ), seqlen).astype("int32")).cuda()
import time
_t_begin = time.time()
decoder.setup(_input_ids.size(0), _input_ids.size(1), max_output_len)
_output_ids = decoder.decode(_input_ids,
_input_lengths,
sampling_config)
_t_end = time.time()
xtrt_llm.logger.info(
f"Total latency: {(_t_end - _t_begin) * 1000:.3f} ms")
if remove_input_padding:
decoder.setup(1, torch.max(input_lengths).item(), args.max_output_len)
else:
decoder.setup(input_ids.size(0), input_ids.size(1), args.max_output_len)
output_ids = decoder.decode(input_ids, input_lengths, sampling_config)
torch.cuda.synchronize()
output_ids = output_ids.tolist()[0][0][input_ids.size(1):]
output_text = tokenizer.decode(output_ids)
print(f'Input: \"{args.input_text}\"')
print(f'Output: \"{output_text}\"')

8
examples/gptneox/run.sh Executable file
View File

@@ -0,0 +1,8 @@
BKCL_PCIE_RING=1 PYTORCH_NO_XPU_MEMORY_CACHING=1 XMLIR_D_XPU_L3_SIZE=0 \
mpirun -n 2 --allow-run-as-root \
python3 run.py \
--max_output_len=50 \
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
--tokenizer_dir=./downloads/gptneox_model \
--performance_test_scale=1x512x256E2x512x256E4x512x256E8x512x256 \
--log_level=info

View File

@@ -0,0 +1,373 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import copy
import json
import os
import numpy as np
import torch
from datasets import load_dataset, load_metric
from transformers import AutoModelForCausalLM, AutoTokenizer
import tensorrt_llm
import tensorrt_llm.profiler as profiler
from tensorrt_llm.logger import logger
from build import get_engine_name # isort:skip
def TRTGPTNeoX(args, config):
dtype = config['builder_config']['precision']
world_size = config['builder_config']['tensor_parallel']
assert world_size == tensorrt_llm.mpi_world_size(), \
f'Engine world size ({world_size}) != Runtime world size ({tensorrt_llm.mpi_world_size()})'
world_size = config['builder_config']['tensor_parallel']
num_heads = config['builder_config']['num_heads'] // world_size
hidden_size = config['builder_config']['hidden_size'] // world_size
vocab_size = config['builder_config']['vocab_size']
num_layers = config['builder_config']['num_layers']
use_gpt_attention_plugin = bool(
config['plugin_config']['gpt_attention_plugin'])
remove_input_padding = config['plugin_config']['remove_input_padding']
model_config = tensorrt_llm.runtime.ModelConfig(
vocab_size=vocab_size,
num_layers=num_layers,
num_heads=num_heads,
num_kv_heads=num_heads,
hidden_size=hidden_size,
gpt_attention_plugin=use_gpt_attention_plugin,
remove_input_padding=remove_input_padding,
dtype=dtype)
runtime_rank = tensorrt_llm.mpi_rank()
runtime_mapping = tensorrt_llm.Mapping(world_size,
runtime_rank,
tp_size=world_size)
torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
engine_name = get_engine_name('gptneox', dtype, world_size, runtime_rank)
serialize_path = os.path.join(args.engine_dir, engine_name)
tensorrt_llm.logger.set_level(args.log_level)
with open(serialize_path, 'rb') as f:
engine_buffer = f.read()
decoder = tensorrt_llm.runtime.GenerationSession(model_config,
engine_buffer,
runtime_mapping)
return decoder
def main(args):
runtime_rank = tensorrt_llm.mpi_rank()
logger.set_level(args.log_level)
test_hf = args.test_hf and runtime_rank == 0 # only run hf on rank 0
test_trt_llm = args.test_trt_llm
model_dir = args.model_dir
tokenizer = AutoTokenizer.from_pretrained(model_dir,
padding_side='left',
model_max_length=2048,
truncation=True)
tokenizer.pad_token = tokenizer.eos_token
dataset_cnn = load_dataset("ccdv/cnn_dailymail",
'3.0.0',
cache_dir=args.dataset_path)
config_path = os.path.join(args.engine_dir, 'config.json')
with open(config_path, 'r') as f:
config = json.load(f)
max_batch_size = args.batch_size
# runtime parameters
# repetition_penalty = 1
top_k = args.top_k
output_len = args.output_len
test_token_num = 923
# top_p = 0.0
# random_seed = 5
temperature = 1
num_beams = args.num_beams
pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
if test_trt_llm:
tensorrt_llm_gpt = TRTGPTNeoX(args, config)
if test_hf:
model = AutoModelForCausalLM.from_pretrained(model_dir)
model.cuda()
if args.data_type == 'fp16':
model.half()
def summarize_tensorrt_llm(datapoint):
batch_size = len(datapoint['article'])
line = copy.copy(datapoint['article'])
line_encoded = []
input_lengths = []
for i in range(batch_size):
line[i] = line[i] + ' TL;DR: '
line[i] = line[i].strip()
line[i] = line[i].replace(" n't", "n't")
input_id = tokenizer.encode(line[i],
return_tensors='pt').type(torch.int32)
input_id = input_id[:, -test_token_num:]
line_encoded.append(input_id)
input_lengths.append(input_id.shape[-1])
# do padding, should move outside the profiling to prevent the overhead
max_length = max(input_lengths)
if tensorrt_llm_gpt.remove_input_padding:
line_encoded = [
torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded
]
else:
# do padding, should move outside the profiling to prevent the overhead
for i in range(batch_size):
pad_size = max_length - input_lengths[i]
pad = torch.ones([1, pad_size]).type(torch.int32) * pad_id
line_encoded[i] = torch.cat(
[torch.tensor(line_encoded[i], dtype=torch.int32), pad],
axis=-1)
line_encoded = torch.cat(line_encoded, axis=0).cuda()
input_lengths = torch.tensor(input_lengths,
dtype=torch.int32).cuda()
sampling_config = tensorrt_llm.runtime.SamplingConfig(
end_id=end_id, pad_id=pad_id, top_k=top_k, num_beams=num_beams)
with torch.no_grad():
tensorrt_llm_gpt.setup(batch_size,
max_context_length=max_length,
max_new_tokens=output_len,
beam_width=num_beams)
if tensorrt_llm_gpt.remove_input_padding:
output_ids = tensorrt_llm_gpt.decode_batch(
line_encoded, sampling_config)
else:
output_ids = tensorrt_llm_gpt.decode(
line_encoded,
input_lengths,
sampling_config,
)
torch.cuda.synchronize()
# Extract a list of tensors of shape beam_width x output_ids.
if tensorrt_llm_gpt.mapping.is_first_pp_rank():
output_beams_list = [
tokenizer.batch_decode(output_ids[batch_idx, :,
input_lengths[batch_idx]:],
skip_special_tokens=True)
for batch_idx in range(batch_size)
]
return output_beams_list, output_ids[:, :, max_length:].tolist()
return [], []
def summarize_hf(datapoint):
batch_size = len(datapoint['article'])
if batch_size > 1:
logger.warning(
f"HF does not support batch_size > 1 to verify correctness due to padding. Current batch size is {batch_size}"
)
line = copy.copy(datapoint['article'])
for i in range(batch_size):
line[i] = line[i] + ' TL;DR: '
line[i] = line[i].strip()
line[i] = line[i].replace(" n't", "n't")
line_encoded = tokenizer(line,
return_tensors='pt',
padding=True,
truncation=True)["input_ids"].type(torch.int64)
line_encoded = line_encoded[:, -test_token_num:]
line_encoded = line_encoded.cuda()
with torch.no_grad():
output = model.generate(line_encoded,
max_length=len(line_encoded[0]) +
output_len,
top_k=top_k,
temperature=temperature,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
num_beams=num_beams,
num_return_sequences=num_beams,
early_stopping=True)
tokens_list = output[:, len(line_encoded[0]):].tolist()
output = output.reshape([batch_size, num_beams, -1])
output_lines_list = [
tokenizer.batch_decode(output[:, i, len(line_encoded[0]):],
skip_special_tokens=True)
for i in range(num_beams)
]
return output_lines_list, tokens_list
if test_trt_llm:
datapoint = dataset_cnn['test'][0:1]
summary, _ = summarize_tensorrt_llm(datapoint)
if runtime_rank == 0:
logger.info(
"---------------------------------------------------------")
logger.info("XTRT-LLM Generated : ")
logger.info(f" Article : {datapoint['article']}")
logger.info(f"\n Highlights : {datapoint['highlights']}")
logger.info(f"\n Summary : {summary}")
logger.info(
"---------------------------------------------------------")
if test_hf:
datapoint = dataset_cnn['test'][0:1]
summary, _ = summarize_hf(datapoint)
logger.info("---------------------------------------------------------")
logger.info("HF Generated : ")
logger.info(f" Article : {datapoint['article']}")
logger.info(f"\n Highlights : {datapoint['highlights']}")
logger.info(f"\n Summary : {summary}")
logger.info("---------------------------------------------------------")
metric_tensorrt_llm = [load_metric("rouge") for _ in range(num_beams)]
metric_hf = [load_metric("rouge") for _ in range(num_beams)]
for i in range(num_beams):
metric_tensorrt_llm[i].seed = 0
metric_hf[i].seed = 0
ite_count = 0
data_point_idx = 0
while (data_point_idx < len(dataset_cnn['test'])) and (ite_count <
args.max_ite):
if runtime_rank == 0:
logger.debug(
f"run data_point {data_point_idx} ~ {data_point_idx + max_batch_size}"
)
datapoint = dataset_cnn['test'][data_point_idx:(data_point_idx +
max_batch_size)]
if test_trt_llm:
profiler.start('tensorrt_llm')
summary_tensorrt_llm, tokens_tensorrt_llm = summarize_tensorrt_llm(
datapoint)
profiler.stop('tensorrt_llm')
if test_hf:
profiler.start('hf')
summary_hf, tokens_hf = summarize_hf(datapoint)
profiler.stop('hf')
if runtime_rank == 0:
if test_trt_llm:
for batch_idx in range(len(summary_tensorrt_llm)):
for beam_idx in range(num_beams):
metric_tensorrt_llm[beam_idx].add_batch(
predictions=[
summary_tensorrt_llm[batch_idx][beam_idx]
],
references=[datapoint['highlights'][batch_idx]])
if test_hf:
for beam_idx in range(num_beams):
for i in range(len(summary_hf[beam_idx])):
metric_hf[beam_idx].add_batch(
predictions=[summary_hf[beam_idx][i]],
references=[datapoint['highlights'][i]])
logger.debug('-' * 100)
logger.debug(f"Article : {datapoint['article']}")
if test_trt_llm:
logger.debug(f'XTRT-LLM Summary: {summary_tensorrt_llm}')
if test_hf:
logger.debug(f'HF Summary: {summary_hf}')
logger.debug(f"highlights : {datapoint['highlights']}")
data_point_idx += max_batch_size
ite_count += 1
if runtime_rank == 0:
if test_trt_llm:
np.random.seed(0) # rouge score use sampling to compute the score
logger.info(
f'XTRT-LLM (total latency: {profiler.elapsed_time_in_sec("tensorrt_llm")} sec)'
)
for beam_idx in range(num_beams):
logger.info(f"XTRT-LLM beam {beam_idx} result")
computed_metrics_tensorrt_llm = metric_tensorrt_llm[
beam_idx].compute()
for key in computed_metrics_tensorrt_llm.keys():
logger.info(
f' {key} : {computed_metrics_tensorrt_llm[key].mid[2]*100}'
)
if args.check_accuracy and beam_idx == 0:
assert computed_metrics_tensorrt_llm['rouge1'].mid[
2] * 100 > args.tensorrt_llm_rouge1_threshold
if test_hf:
np.random.seed(0) # rouge score use sampling to compute the score
logger.info(
f'Hugging Face (total latency: {profiler.elapsed_time_in_sec("hf")} sec)'
)
for beam_idx in range(num_beams):
logger.info(f"HF beam {beam_idx} result")
computed_metrics_hf = metric_hf[beam_idx].compute()
for key in computed_metrics_hf.keys():
logger.info(
f' {key} : {computed_metrics_hf[key].mid[2]*100}')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model_dir',
type=str,
default='EleutherAI/gpt-neox-20b')
parser.add_argument('--test_hf', action='store_true')
parser.add_argument('--test_trt_llm', action='store_true')
parser.add_argument('--data_type',
type=str,
choices=['fp32', 'fp16'],
default='fp32')
parser.add_argument('--dataset_path', type=str, default='')
parser.add_argument('--log_level', type=str, default='info')
parser.add_argument('--engine_dir', type=str, default='gptneox_engine')
parser.add_argument('--batch_size', type=int, default=1)
parser.add_argument('--max_ite', type=int, default=20)
parser.add_argument('--output_len', type=int, default=100)
parser.add_argument('--check_accuracy', action='store_true')
parser.add_argument('--tensorrt_llm_rouge1_threshold',
type=float,
default=15.0)
parser.add_argument('--num_beams', type=int, default=1)
parser.add_argument('--top_k', type=int, default=1)
args = parser.parse_args()
main(args)

464
examples/gptneox/weight.py Normal file
View File

@@ -0,0 +1,464 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
from operator import attrgetter
import numpy as np
import torch
import xtrt_llm
from xtrt_llm._utils import pad_vocab_size
from xtrt_llm.models import GPTNeoXForCausalLM
from xtrt_llm._utils import str_dtype_to_torch
UINT4_TO_INT4_FLAG = 1
GPTQ_FLAG = 1
GROUP_SIZE = 128
def numpy_split(v, tp_size, idx, dim=0):
if tp_size == 1:
return v
else:
return np.ascontiguousarray(np.split(v, tp_size, axis=dim)[idx])
def torch_split(v, tp_size, idx, dim=0):
if tp_size == 1:
return v
else:
return (torch.split(v, v.shape[dim] // tp_size,
dim=dim)[idx]).contiguous()
def unpack_int32_into_int8(w_packed):
# Unpack inputs packed in int32/float32 into uint4 and store them in int8 format
w_packed_int4x2 = w_packed.contiguous().view(torch.uint8)
w_unpacked = torch.zeros(w_packed_int4x2.shape[0],
w_packed_int4x2.shape[1] * 2,
dtype=torch.int8)
w_unpacked[:, ::2] = w_packed_int4x2 % 16
w_unpacked[:, 1::2] = w_packed_int4x2 // 16
return w_unpacked.contiguous()
def preprocess_groupwise_weight_params(qweight_unpacked_int8, scales_fp16,
qzeros_unpacked_int8):
packer = torch.ops.fastertransformer.pack_int8_tensor_to_packed_int4
preprocessor = torch.ops.fastertransformer.preprocess_weights_for_mixed_gemm
qweight_interleaved = preprocessor(packer(qweight_unpacked_int8),
torch.quint4x2).view(torch.float32)
# zeros = zeros * scales
zeros_x_scales_fp16 = (-qzeros_unpacked_int8 + 8 * UINT4_TO_INT4_FLAG -
GPTQ_FLAG) * scales_fp16
zeros_x_scales_fp16 = zeros_x_scales_fp16.half()
# return processed interleaved weight, original scales and zeros * scales
return qweight_interleaved.contiguous().numpy(), scales_fp16.contiguous(
).numpy(), zeros_x_scales_fp16.contiguous().numpy()
def load_from_hf_gpt_neox(xtrt_llm_gpt_neox: GPTNeoXForCausalLM,
hf_gpt_neox,
dtype="float32",
rank=0,
tp_size=1,
use_weight_only_groupwise_quant_matmul_plugin=False):
hf_model_gptneox_block_names = [
"input_layernorm.weight",
"input_layernorm.bias",
"post_attention_layernorm.weight",
"post_attention_layernorm.bias",
]
xtrt_llm_model_gptneox_block_names = [
"input_layernorm.weight",
"input_layernorm.bias",
"post_attention_layernorm.weight",
"post_attention_layernorm.bias",
]
if not use_weight_only_groupwise_quant_matmul_plugin:
hf_model_gptneox_block_names += [
"attention.dense.weight",
"attention.dense.bias",
"mlp.dense_h_to_4h.weight",
"mlp.dense_h_to_4h.bias",
"mlp.dense_4h_to_h.weight",
"mlp.dense_4h_to_h.bias",
]
xtrt_llm_model_gptneox_block_names += [
"attention.dense.weight",
"attention.dense.bias",
"mlp.fc.weight",
"mlp.fc.bias",
"mlp.proj.weight",
"mlp.proj.bias",
]
if not use_weight_only_groupwise_quant_matmul_plugin:
xtrt_llm.logger.info('Loading weights from HF GPT-NeoX...')
else:
xtrt_llm.logger.info(
'Loading weights from GPTQ quantized HF GPT-NeoX...')
tik = time.time()
torch_dtype = str_dtype_to_torch(dtype)
hf_gpt_neox_state_dict = hf_gpt_neox.state_dict()
# [vocab_size, hidden_size]
v = hf_gpt_neox_state_dict.get('gpt_neox.embed_in.weight').to(
torch_dtype).cpu().numpy()
if xtrt_llm_gpt_neox._use_parallel_embedding:
v = numpy_split(v, tp_size, rank,
xtrt_llm_gpt_neox._embedding_sharding_dim)
xtrt_llm_gpt_neox.embedding.weight.value = v
n_layer = hf_gpt_neox.config.num_hidden_layers
for layer_idx in range(n_layer):
prefix = "gpt_neox.layers." + str(layer_idx) + "."
for idx, hf_attr in enumerate(hf_model_gptneox_block_names):
v = hf_gpt_neox_state_dict.get(prefix + hf_attr).to(
torch_dtype).cpu().numpy()
layer = attrgetter(xtrt_llm_model_gptneox_block_names[idx])(
xtrt_llm_gpt_neox.layers[layer_idx])
if tp_size > 1:
if 'dense.weight' in hf_attr:
# [n=hidden_size, k=hidden_size] ->
# [n=hidden_size, k=hidden_size // tp_size]
split_v = numpy_split(v, tp_size, rank, dim=1)
elif 'dense_h_to_4h.weight' in hf_attr:
# [hidden_size * 4, hidden_size] ->
# [hidden_size * 4 // tp_size, hidden_size]
split_v = numpy_split(v, tp_size, rank, dim=0)
elif 'dense_h_to_4h.bias' in hf_attr:
# [hidden_size * 4] -> [hidden_size * 4 // tp_size]
split_v = numpy_split(v, tp_size, rank, dim=0)
elif 'dense_4h_to_h.weight' in hf_attr:
# [hidden_size, hidden_size * 4] ->
# [hidden_size, hidden_size * 4 // tp_size]
split_v = numpy_split(v, tp_size, rank, dim=1)
else:
split_v = v
setattr(layer, 'value', split_v)
else:
setattr(layer, 'value', v)
num_heads = hf_gpt_neox.config.num_attention_heads
hidden_size = hf_gpt_neox.config.hidden_size
head_size = hidden_size // num_heads
if not use_weight_only_groupwise_quant_matmul_plugin:
# Attention QKV Linear
# qkv_weights [num_heads x (q|k|v), hidden_size] ->
# [(num_heads x q)|(num_heads x k)|(num_heads x v), hidden_size]
qkv_weights = hf_gpt_neox_state_dict.get(
prefix + "attention.query_key_value.weight")
qkv_bias = hf_gpt_neox_state_dict.get(
prefix + "attention.query_key_value.bias")
new_qkv_weight_shape = torch.Size(
[num_heads, 3, head_size * qkv_weights.size()[-1]])
new_qkv_bias_shape = torch.Size([num_heads, 3, head_size])
qkv_weights = qkv_weights.view(new_qkv_weight_shape).permute(
1, 0, 2).reshape([hidden_size * 3, hidden_size])
qkv_bias = qkv_bias.view(new_qkv_bias_shape).permute(
1, 0, 2).reshape([hidden_size * 3])
if tp_size > 1:
qkv_weights = qkv_weights.reshape(
3, hidden_size, hidden_size).to(torch_dtype).cpu().numpy()
split_qkv_weights = numpy_split(
qkv_weights, tp_size, rank,
dim=1).reshape(3 * (hidden_size // tp_size), hidden_size)
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.weight.value = \
np.ascontiguousarray(split_qkv_weights)
qkv_bias = qkv_bias.reshape(
3, hidden_size).to(torch_dtype).cpu().numpy()
split_qkv_bias = numpy_split(qkv_bias, tp_size, rank,
dim=1).reshape(
3 * (hidden_size // tp_size))
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
np.ascontiguousarray(split_qkv_bias)
else:
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.weight.value = \
qkv_weights.to(torch_dtype).cpu().numpy()
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
qkv_bias.to(torch_dtype).cpu().numpy()
else:
# use_weight_only_groupwise_quant_matmul_plugin
qweight_int32 = hf_gpt_neox_state_dict.get(
prefix + "attention.query_key_value.qweight")
scales_fp16 = hf_gpt_neox_state_dict.get(
prefix + "attention.query_key_value.scales")
qzeros_int32 = hf_gpt_neox_state_dict.get(
prefix + "attention.query_key_value.qzeros")
biases_fp16 = hf_gpt_neox_state_dict.get(
prefix + "attention.query_key_value.bias")
# [hidden_size // 8, hidden_size * 3] -> [hidden_size * 3, hidden_size]
qweight_unpacked_int8 = unpack_int32_into_int8(
qweight_int32.T).contiguous() - 8
# [hidden_size // GROUP_SIZE, hidden_size * 3 // 8] ->
# [hidden_size // GROUP_SIZE, hidden_size * 3]
qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
# qkv_weights [num_heads x (q|k|v), hidden_size] ->
# [(num_heads x q)|(num_heads x k)|(num_heads x v), hidden_size]
new_qkv_weight_shape = torch.Size(
[num_heads, 3, head_size * qweight_unpacked_int8.size()[-1]])
# [hidden_size * 3, hidden_size]
qweight_unpacked_int8 = qweight_unpacked_int8.view(
new_qkv_weight_shape).permute(1, 0, 2).reshape(
[hidden_size * 3, hidden_size]).contiguous()
new_qkv_scale_shape = torch.Size(
[num_heads, 3, head_size * (hidden_size // GROUP_SIZE)])
# [hidden_size * 3, hidden_size // GROUP_SIZE]
scales_fp16 = scales_fp16.T.contiguous().view(
new_qkv_scale_shape).permute(1, 0, 2).reshape(
[hidden_size * 3, hidden_size // GROUP_SIZE]).contiguous()
new_qkv_zero_shape = torch.Size(
[num_heads, 3, head_size * (hidden_size // GROUP_SIZE)])
# [hidden_size * 3, hidden_size // GROUP_SIZE]
qzeros_unpacked_int8 = qzeros_unpacked_int8.T.contiguous().view(
new_qkv_zero_shape).permute(1, 0, 2).reshape(
[hidden_size * 3, hidden_size // GROUP_SIZE]).contiguous()
new_qkv_bias_shape = torch.Size([num_heads, 3, head_size])
biases_fp16 = biases_fp16.view(new_qkv_bias_shape).permute(
1, 0, 2).reshape([hidden_size * 3]).numpy()
if tp_size > 1:
qweight_unpacked_int8 = qweight_unpacked_int8.reshape(
[3, hidden_size, hidden_size])
qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
tp_size,
rank,
dim=1)
qweight_unpacked_int8 = qweight_unpacked_int8.reshape(
[3 * hidden_size // tp_size, hidden_size])
scales_fp16 = scales_fp16.reshape(
[3, hidden_size, hidden_size // GROUP_SIZE])
scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=1)
scales_fp16 = scales_fp16.reshape(
[3 * hidden_size // tp_size, hidden_size // GROUP_SIZE])
qzeros_unpacked_int8 = qzeros_unpacked_int8.reshape(
[3, hidden_size, hidden_size // GROUP_SIZE])
qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
tp_size,
rank,
dim=1)
qzeros_unpacked_int8 = qzeros_unpacked_int8.reshape(
[3 * hidden_size // tp_size, hidden_size // GROUP_SIZE])
biases_fp16 = biases_fp16.reshape([3, hidden_size])
biases_fp16 = numpy_split(biases_fp16, tp_size, rank, dim=1)
biases_fp16 = biases_fp16.reshape([3 * hidden_size // tp_size])
qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
qweight_unpacked_int8.T.contiguous(),
scales_fp16.T.contiguous(), qzeros_unpacked_int8.T.contiguous())
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.qweight.value = \
qweight_fp32
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.scale.value = \
scales_fp16
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.zero.value = \
zeros_fp16
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
biases_fp16
qweight_int32 = hf_gpt_neox_state_dict.get(
prefix + "attention.dense.qweight")
scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
"attention.dense.scales")
qzeros_int32 = hf_gpt_neox_state_dict.get(prefix +
"attention.dense.qzeros")
biases_fp16 = hf_gpt_neox_state_dict.get(
prefix + "attention.dense.bias").numpy()
# [k=hidden_size // 8, n=hidden_size] -> [n=hidden_size, k=hidden_size]
qweight_unpacked_int8 = unpack_int32_into_int8(
qweight_int32.T).contiguous() - 8
# [n=hidden_size, k=hidden_size] -> [k=hidden_size, n=hidden_size]
qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
# [k=hidden_size // GROUP_SIZE, n=hidden_size // 8] ->
# [k=hidden_size // GROUP_SIZE, n=hidden_size]
qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
if tp_size > 1:
qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
tp_size,
rank,
dim=0)
scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=0)
qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
tp_size,
rank,
dim=0)
if rank > 0:
biases_fp16 = np.zeros_like(biases_fp16)
qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.qweight.value = \
qweight_fp32
xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.scale.value = \
scales_fp16
xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.zero.value = \
zeros_fp16
xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.bias.value = \
biases_fp16
qweight_int32 = hf_gpt_neox_state_dict.get(
prefix + "mlp.dense_h_to_4h.qweight")
scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
"mlp.dense_h_to_4h.scales")
qzeros_int32 = hf_gpt_neox_state_dict.get(
prefix + "mlp.dense_h_to_4h.qzeros")
biases_fp16 = hf_gpt_neox_state_dict.get(
prefix + "mlp.dense_h_to_4h.bias").numpy()
# [hidden_size // 8, hidden_size * 4] -> [hidden_size, hidden_size * 4]
qweight_unpacked_int8 = unpack_int32_into_int8(
qweight_int32.T).contiguous() - 8
qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
# [hidden_size // GROUP_SIZE, hidden_size * 4 // 8] ->
# [hidden_size // GROUP_SIZE, hidden_size * 4]
qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
if tp_size > 1:
# [hidden_size, hidden_size * 4] ->
# [hidden_size, hidden_size * 4 // tp_size]
qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
tp_size,
rank,
dim=1)
# [hidden_size // GROUP_SIZE, hidden_size * 4] ->
# [hidden_size // GROUP_SIZE, hidden_size * 4 // tp_size]
scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=1)
# [hidden_size // GROUP_SIZE, hidden_size * 4] ->
# [hidden_size // GROUP_SIZE, hidden_size * 4 // tp_size]
qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
tp_size,
rank,
dim=1)
# [hidden_size * 4] -> [hidden_size * 4 // tp_size]
biases_fp16 = numpy_split(biases_fp16, tp_size, rank, dim=0)
qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.qweight.value = \
qweight_fp32
xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.scale.value = \
scales_fp16
xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.zero.value = \
zeros_fp16
xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.bias.value = \
biases_fp16
qweight_int32 = hf_gpt_neox_state_dict.get(
prefix + "mlp.dense_4h_to_h.qweight")
scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
"mlp.dense_4h_to_h.scales")
qzeros_int32 = hf_gpt_neox_state_dict.get(
prefix + "mlp.dense_4h_to_h.qzeros")
biases_fp16 = hf_gpt_neox_state_dict.get(
prefix + "mlp.dense_4h_to_h.bias").numpy()
# [hidden_size * 4 // 8, hidden_size] -> [hidden_size * 4, hidden_size]
qweight_unpacked_int8 = unpack_int32_into_int8(
qweight_int32.T).contiguous() - 8
qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
# [hidden_size * 4 // GROUP_SIZE, hidden_size // 8] ->
# [hidden_size * 4 // GROUP_SIZE, hidden_size]
qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
if tp_size > 1:
# [hidden_size * 4, hidden_size] ->
# [hidden_size * 4 // tp_size, hidden_size]
qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
tp_size,
rank,
dim=0)
# [hidden_size * 4 // GROUP_SIZE, hidden_size] ->
# [hidden_size * 4 // GROUP_SIZE // tp_size, hidden_size] ->
scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=0)
# [hidden_size * 4 // GROUP_SIZE, hidden_size] ->
# [hidden_size * 4 // GROUP_SIZE // tp_size, hidden_size]
qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
tp_size,
rank,
dim=0)
if rank > 0:
biases_fp16 = np.zeros_like(biases_fp16)
qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.qweight.value = \
qweight_fp32
xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.scale.value = \
scales_fp16
xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.zero.value = \
zeros_fp16
xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.bias.value = \
biases_fp16
v = hf_gpt_neox_state_dict.get('gpt_neox.final_layer_norm.weight')
xtrt_llm_gpt_neox.ln_f.weight.value = v.to(torch_dtype).cpu().numpy()
v = hf_gpt_neox_state_dict.get('gpt_neox.final_layer_norm.bias')
xtrt_llm_gpt_neox.ln_f.bias.value = v.to(torch_dtype).cpu().numpy()
v = hf_gpt_neox_state_dict.get('embed_out.weight').to(
torch_dtype).cpu().numpy()
if tp_size > 1:
# [vocab_size, hidden_size] ->
# [vocab_size // tp_size, hidden_size]
if v.shape[0] % tp_size != 0:
# padding
vocab_size_padded = pad_vocab_size(v.shape[0], tp_size)
pad_width = vocab_size_padded - v.shape[0]
v = np.pad(v, ((0, pad_width), (0, 0)),
'constant',
constant_values=0)
split_v = numpy_split(v, tp_size, rank, dim=0)
xtrt_llm_gpt_neox.lm_head.weight.value = split_v
else:
xtrt_llm_gpt_neox.lm_head.weight.value = v
tok = time.time()
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
xtrt_llm.logger.info(f'Weights loaded. Total time: {t}')