add pkgs
This commit is contained in:
3
examples/gptneox/.gitignore
vendored
Normal file
3
examples/gptneox/.gitignore
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
__pycache__/
|
||||
gptneox_model/
|
||||
*.log
|
||||
93
examples/gptneox/README.md
Normal file
93
examples/gptneox/README.md
Normal file
@@ -0,0 +1,93 @@
|
||||
# GPT-NeoX
|
||||
|
||||
This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using XTRT-LLM and run on single node multi-XPU.
|
||||
|
||||
## Overview
|
||||
|
||||
The XTRT-LLM GPT-NeoX example code is located in [`examples/gptneox`](./). There are several main files in that folder:
|
||||
|
||||
* [`build.py`](./build.py) to build the XTRT engine(s) needed to run the GPT-NeoX model,
|
||||
* [`run.py`](./run.py) to run the inference on an input text,
|
||||
|
||||
|
||||
## Support Matrix
|
||||
* FP16
|
||||
* INT8 Weight-Only
|
||||
* Tensor Parallel
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Download weights from HuggingFace (HF) Transformers
|
||||
|
||||
```bash
|
||||
# Weights & config
|
||||
sh get_weights.sh
|
||||
```
|
||||
|
||||
### 2. Build XTRT engine(s)
|
||||
|
||||
XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights.
|
||||
|
||||
Examples of build invocations:
|
||||
|
||||
```bash
|
||||
# Build a float16 engine using 2-way tensor parallelism and HF weights.
|
||||
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
||||
python3 build.py --dtype=float16 \
|
||||
--log_level=verbose \
|
||||
--use_gpt_attention_plugin float16 \
|
||||
--use_gemm_plugin float16 \
|
||||
--use_layernorm_plugin float16 \
|
||||
--max_batch_size=16 \
|
||||
--max_input_len=1024 \
|
||||
--max_output_len=1024 \
|
||||
--world_size=2 \
|
||||
--output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
||||
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
|
||||
|
||||
# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
|
||||
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
||||
python3 build.py --dtype=float16 \
|
||||
--log_level=verbose \
|
||||
--use_gpt_attention_plugin float16 \
|
||||
--use_gemm_plugin float16 \
|
||||
--use_layernorm_plugin float16 \
|
||||
--max_batch_size=16 \
|
||||
--max_input_len=1024 \
|
||||
--max_output_len=1024 \
|
||||
--world_size=2 \
|
||||
--use_weight_only \
|
||||
--output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
|
||||
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
|
||||
```
|
||||
|
||||
### 3. Run
|
||||
|
||||
Before running the examples, make sure set the environment variables:
|
||||
```bash
|
||||
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
|
||||
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
|
||||
```
|
||||
|
||||
If NOT using R480-X8, make sure set the environment variables:
|
||||
```bash
|
||||
export BKCL_PCIE_RING=1
|
||||
```
|
||||
|
||||
To run a XTRT-LLM GPT-NeoX model using the engines generated by `build.py`:
|
||||
|
||||
```bash
|
||||
# For 2-way tensor parallelism, FP16
|
||||
mpirun -n 2 --allow-run-as-root \
|
||||
python3 run.py \
|
||||
--max_output_len=50 \
|
||||
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
||||
--tokenizer_dir=./downloads/gptneox_model
|
||||
|
||||
# For 2-way tensor parallelism, INT8
|
||||
mpirun -n 2 --allow-run-as-root \
|
||||
python3 run.py \
|
||||
--max_output_len=50 \
|
||||
--engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
|
||||
--tokenizer_dir=./downloads/gptneox_model
|
||||
```
|
||||
95
examples/gptneox/README_CN.md
Normal file
95
examples/gptneox/README_CN.md
Normal file
@@ -0,0 +1,95 @@
|
||||
# GPT-NeoX
|
||||
|
||||
本文档介绍了如何使用昆仑芯XTRT-LLM在单节点多XPU上构建和运行[GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) 模型。
|
||||
|
||||
## 概述
|
||||
|
||||
XTRT-LLM GPT-NeoX 示例代码位于 [`examples/gptneox`](./)。 此文件夹中有以下几个主要文件:
|
||||
|
||||
* [`build.py`](./build.py) 构建运行GPT-NeoX模型所需的XTRT引擎
|
||||
* [`run.py`](./run.py) 基于输入的文字进行推理
|
||||
|
||||
## 支持的矩阵
|
||||
|
||||
* FP16
|
||||
* INT8 Weight-Only
|
||||
* Tensor Parallel
|
||||
|
||||
## 使用说明
|
||||
|
||||
### 1.从HuggingFace(HF) Transformers下载权重
|
||||
|
||||
```bash
|
||||
# Weights & config
|
||||
sh get_weights.sh
|
||||
```
|
||||
|
||||
### 2. 构建XTRT引擎
|
||||
|
||||
XTRT-LLM从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录,XTRT-LLM将使用伪权重构建引擎。
|
||||
|
||||
构建调用示例:
|
||||
|
||||
```bash
|
||||
# Build a float16 engine using 2-way tensor parallelism and HF weights.
|
||||
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
||||
python3 build.py --dtype=float16 \
|
||||
--log_level=verbose \
|
||||
--use_gpt_attention_plugin float16 \
|
||||
--use_gemm_plugin float16 \
|
||||
--use_layernorm_plugin float16 \
|
||||
--max_batch_size=16 \
|
||||
--max_input_len=1024 \
|
||||
--max_output_len=1024 \
|
||||
--world_size=2 \
|
||||
--output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
||||
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
|
||||
|
||||
# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
|
||||
# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
|
||||
python3 build.py --dtype=float16 \
|
||||
--log_level=verbose \
|
||||
--use_gpt_attention_plugin float16 \
|
||||
--use_gemm_plugin float16 \
|
||||
--use_layernorm_plugin float16 \
|
||||
--max_batch_size=16 \
|
||||
--max_input_len=1024 \
|
||||
--max_output_len=1024 \
|
||||
--world_size=2 \
|
||||
--use_weight_only \
|
||||
--output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
|
||||
--model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
|
||||
```
|
||||
|
||||
### 3. 运行
|
||||
|
||||
在运行示例之前,请确保设置环境变量:
|
||||
|
||||
```bash
|
||||
export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
|
||||
export XMLIR_D_XPU_L3_SIZE=0 # disable XPytorch use L3.
|
||||
```
|
||||
|
||||
如果不使用昆仑芯R480-X8产品,请确保设置环境变量如下:
|
||||
|
||||
```bash
|
||||
export BKCL_PCIE_RING=1
|
||||
```
|
||||
|
||||
要使用`build.py`生成的引擎运行XTRT-LLM GPT-NeoX模型,请执行以下操作:
|
||||
|
||||
```bash
|
||||
# For 2-way tensor parallelism, FP16
|
||||
mpirun -n 2 --allow-run-as-root \
|
||||
python3 run.py \
|
||||
--max_output_len=50 \
|
||||
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
||||
--tokenizer_dir=./downloads/gptneox_model
|
||||
|
||||
# For 2-way tensor parallelism, INT8
|
||||
mpirun -n 2 --allow-run-as-root \
|
||||
python3 run.py \
|
||||
--max_output_len=50 \
|
||||
--engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/ \
|
||||
--tokenizer_dir=./downloads/gptneox_model
|
||||
```
|
||||
442
examples/gptneox/build.py
Normal file
442
examples/gptneox/build.py
Normal file
@@ -0,0 +1,442 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
|
||||
#import tensorrt as trt
|
||||
import torch
|
||||
import torch.multiprocessing as mp
|
||||
from safetensors import safe_open
|
||||
from transformers import AutoModelForCausalLM, GPTNeoXConfig
|
||||
from weight import load_from_hf_gpt_neox
|
||||
|
||||
import xtrt_llm
|
||||
from xtrt_llm._utils import str_dtype_to_xtrt
|
||||
from xtrt_llm.builder import Builder
|
||||
from xtrt_llm.logger import logger
|
||||
from xtrt_llm.mapping import Mapping
|
||||
from xtrt_llm.models import weight_only_groupwise_quantize, weight_only_quantize
|
||||
from xtrt_llm.network import net_guard
|
||||
from xtrt_llm.plugin.plugin import ContextFMHAType
|
||||
from xtrt_llm.quantization import QuantMode
|
||||
|
||||
MODEL_NAME = "gptneox"
|
||||
hf_gpt = None
|
||||
|
||||
|
||||
class StateDict():
|
||||
|
||||
def __init__(self, quant_ckpt_dir):
|
||||
self.model_state_dict = safe_open(quant_ckpt_dir,
|
||||
framework="pt",
|
||||
device=0)
|
||||
|
||||
def get(self, k):
|
||||
return self.model_state_dict.get_tensor(k).cpu()
|
||||
|
||||
|
||||
class GPTQModel():
|
||||
|
||||
def __init__(self, model_dir, quant_ckpt_dir):
|
||||
with open(model_dir + '/config.json', 'r') as f:
|
||||
model_config = json.load(f)
|
||||
self.config = GPTNeoXConfig()
|
||||
self.config.vocab_size = model_config['vocab_size']
|
||||
self.config.hidden_size = model_config['hidden_size']
|
||||
self.config.num_hidden_layers = model_config['num_hidden_layers']
|
||||
self.config.num_attention_heads = model_config[
|
||||
'num_attention_heads']
|
||||
self.config.intermediate_size = model_config['intermediate_size']
|
||||
self.config.hidden_act = model_config['hidden_act']
|
||||
self.config.rotary_pct = model_config['rotary_pct']
|
||||
self.config.rotary_emb_base = model_config['rotary_emb_base']
|
||||
self.config.max_position_embeddings = model_config[
|
||||
'max_position_embeddings']
|
||||
self.config.initializer_range = model_config['initializer_range']
|
||||
self.config.layer_norm_eps = model_config['layer_norm_eps']
|
||||
self.config.use_cache = model_config['use_cache']
|
||||
self.config.bos_token_id = model_config['bos_token_id']
|
||||
self.config.eos_token_id = model_config['eos_token_id']
|
||||
self.config.tie_word_embeddings = model_config[
|
||||
'tie_word_embeddings']
|
||||
self.model_state_dict = StateDict(quant_ckpt_dir)
|
||||
|
||||
def state_dict(self):
|
||||
return self.model_state_dict
|
||||
|
||||
|
||||
def get_engine_name(model, dtype, tp_size, rank):
|
||||
return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank)
|
||||
|
||||
|
||||
def serialize_engine(engine, path):
|
||||
logger.info(f'Serializing engine to {path}...')
|
||||
tik = time.time()
|
||||
engine.serialize(path)
|
||||
tok = time.time()
|
||||
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
|
||||
logger.info(f'Engine serialized. Total time: {t}')
|
||||
|
||||
|
||||
def parse_arguments():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--world_size',
|
||||
type=int,
|
||||
default=1,
|
||||
help='world size, only support tensor parallelism now')
|
||||
parser.add_argument(
|
||||
'--model_dir',
|
||||
type=str,
|
||||
default=None,
|
||||
help='The path to HF GPT-NeoX model / checkpoints to read weights from')
|
||||
parser.add_argument('--dtype',
|
||||
type=str,
|
||||
default='float16',
|
||||
choices=['float16', 'float32'])
|
||||
parser.add_argument(
|
||||
'--timing_cache',
|
||||
type=str,
|
||||
default='model.cache',
|
||||
help=
|
||||
'The path of to read timing cache from, will be ignored if the file does not exist'
|
||||
)
|
||||
parser.add_argument('--log_level', type=str, default='info')
|
||||
parser.add_argument('--vocab_size', type=int, default=50432)
|
||||
parser.add_argument('--n_layer', type=int, default=44)
|
||||
parser.add_argument('--n_positions', type=int, default=2048)
|
||||
parser.add_argument('--n_embd', type=int, default=6144)
|
||||
parser.add_argument('--n_head', type=int, default=64)
|
||||
parser.add_argument('--hidden_act', type=str, default='gelu')
|
||||
parser.add_argument(
|
||||
'--rotary_pct',
|
||||
type=float,
|
||||
default=0.25,
|
||||
help="Percentage of hidden dimensions to allocate to rotary embeddings."
|
||||
)
|
||||
parser.add_argument('--max_batch_size', type=int, default=64)
|
||||
parser.add_argument('--max_input_len', type=int, default=1024)
|
||||
parser.add_argument('--max_output_len', type=int, default=1024)
|
||||
parser.add_argument('--max_beam_width', type=int, default=1)
|
||||
parser.add_argument('--use_gpt_attention_plugin',
|
||||
nargs='?',
|
||||
const='float16',
|
||||
type=str,
|
||||
default=False,
|
||||
choices=['float16', 'float32'])
|
||||
parser.add_argument('--use_gemm_plugin',
|
||||
nargs='?',
|
||||
const='float16',
|
||||
type=str,
|
||||
default=False,
|
||||
choices=['float16', 'float32'])
|
||||
parser.add_argument('--use_weight_only_quant_matmul_plugin',
|
||||
nargs='?',
|
||||
const='float16',
|
||||
type=str,
|
||||
default=False,
|
||||
choices=['float16'])
|
||||
parser.add_argument('--use_weight_only_groupwise_quant_matmul_plugin',
|
||||
nargs='?',
|
||||
const='float16',
|
||||
type=str,
|
||||
default=False,
|
||||
choices=['float16'])
|
||||
parser.add_argument(
|
||||
'--groupwise_quant_safetensors_path',
|
||||
type=str,
|
||||
default=None,
|
||||
help=
|
||||
"The path to groupwise quantized GPT-NeoX model / checkpoints to read weights from."
|
||||
)
|
||||
parser.add_argument('--use_layernorm_plugin',
|
||||
nargs='?',
|
||||
const='float16',
|
||||
type=str,
|
||||
default=False,
|
||||
choices=['float16', 'float32'])
|
||||
parser.add_argument('--parallel_build', default=False, action='store_true')
|
||||
parser.add_argument('--enable_context_fmha',
|
||||
default=False,
|
||||
action='store_true')
|
||||
parser.add_argument('--enable_context_fmha_fp32_acc',
|
||||
default=False,
|
||||
action='store_true')
|
||||
parser.add_argument('--gpus_per_node', type=int, default=8)
|
||||
parser.add_argument(
|
||||
'--output_dir',
|
||||
type=str,
|
||||
default='gpt_outputs',
|
||||
help=
|
||||
'The path to save the serialized engine files, timing cache file and model configs'
|
||||
)
|
||||
parser.add_argument('--remove_input_padding',
|
||||
default=False,
|
||||
action='store_true')
|
||||
parser.add_argument(
|
||||
'--use_parallel_embedding',
|
||||
action="store_true",
|
||||
default=False,
|
||||
help=
|
||||
'By default embedding parallelism is disabled. By setting this flag, embedding parallelism is enabled'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--embedding_sharding_dim',
|
||||
type=int,
|
||||
default=1, # Meta does TP on hidden dim
|
||||
choices=[0, 1],
|
||||
help=
|
||||
'By default the embedding lookup table is sharded along vocab dimension (--embedding_sharding_dim=0). '
|
||||
'To shard it along hidden dimension, set --embedding_sharding_dim=1'
|
||||
'Note: embedding sharing is only enabled when --embedding_sharding_dim=0'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--use_weight_only',
|
||||
default=False,
|
||||
action="store_true",
|
||||
help='Quantize weights for the various GEMMs to INT4/INT8.'
|
||||
'See --weight_only_precision to set the precision')
|
||||
parser.add_argument(
|
||||
'--weight_only_precision',
|
||||
const='int8',
|
||||
type=str,
|
||||
nargs='?',
|
||||
default='int8',
|
||||
choices=['int8', 'int4'],
|
||||
help=
|
||||
'Define the precision for the weights when using weight-only quantization.'
|
||||
'You must also use --use_weight_only for that argument to have an impact.'
|
||||
)
|
||||
parser.add_argument('--inter_size', type=int, default=None)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
xtrt_llm.logger.set_level(args.log_level)
|
||||
|
||||
if args.model_dir is not None:
|
||||
global hf_gpt
|
||||
if not args.use_weight_only_groupwise_quant_matmul_plugin:
|
||||
logger.info(f'Loading HF GPT-NeoX model from {args.model_dir}...')
|
||||
hf_gpt = AutoModelForCausalLM.from_pretrained(args.model_dir)
|
||||
args.n_embd = hf_gpt.config.hidden_size
|
||||
args.n_head = hf_gpt.config.num_attention_heads
|
||||
args.n_layer = hf_gpt.config.num_hidden_layers
|
||||
args.n_positions = hf_gpt.config.max_position_embeddings
|
||||
args.vocab_size = hf_gpt.config.vocab_size
|
||||
args.rotary_pct = hf_gpt.config.rotary_pct
|
||||
else:
|
||||
assert (
|
||||
args.groupwise_quant_safetensors_path is not None
|
||||
), f'Please set the path to the groupwise quantized GPT-NeoX checkpoints with --groupwise_quant_safetensors_path'
|
||||
logger.info(
|
||||
f'Loading GPTQ quantized HF GPT-NeoX model from {args.groupwise_quant_safetensors_path}...'
|
||||
)
|
||||
hf_gpt = GPTQModel(args.model_dir,
|
||||
args.groupwise_quant_safetensors_path)
|
||||
args.n_embd = hf_gpt.config.hidden_size
|
||||
args.n_head = hf_gpt.config.num_attention_heads
|
||||
args.n_layer = hf_gpt.config.num_hidden_layers
|
||||
args.n_positions = hf_gpt.config.max_position_embeddings
|
||||
args.vocab_size = hf_gpt.config.vocab_size
|
||||
args.rotary_pct = hf_gpt.config.rotary_pct
|
||||
args.inter_size = hf_gpt.config.intermediate_size
|
||||
|
||||
if args.use_weight_only:
|
||||
args.quant_mode = QuantMode.use_weight_only(
|
||||
args.weight_only_precision == 'int4')
|
||||
else:
|
||||
args.quant_mode = QuantMode(0)
|
||||
|
||||
return args
|
||||
|
||||
|
||||
def build_rank_engine(builder: Builder,
|
||||
builder_config: xtrt_llm.builder.BuilderConfig,
|
||||
engine_name, rank, args):
|
||||
'''
|
||||
@brief: Build the engine on the given rank.
|
||||
@param rank: The rank to build the engine.
|
||||
@param args: The cmd line arguments.
|
||||
@return: The built engine.
|
||||
'''
|
||||
kv_dtype = str_dtype_to_xtrt(args.dtype)
|
||||
rotary_dim = int((args.n_embd // args.n_head) * args.rotary_pct)
|
||||
|
||||
# Initialize Module
|
||||
xtrt_llm_gpt = xtrt_llm.models.GPTNeoXForCausalLM(
|
||||
num_layers=args.n_layer,
|
||||
num_heads=args.n_head,
|
||||
hidden_size=args.n_embd,
|
||||
vocab_size=args.vocab_size,
|
||||
hidden_act=args.hidden_act,
|
||||
max_position_embeddings=args.n_positions,
|
||||
rotary_dim=rotary_dim,
|
||||
dtype=kv_dtype,
|
||||
mapping=Mapping(world_size=args.world_size,
|
||||
rank=rank,
|
||||
tp_size=args.world_size), # TP only
|
||||
apply_query_key_layer_scaling=builder_config.
|
||||
apply_query_key_layer_scaling,
|
||||
use_parallel_embedding=args.use_parallel_embedding,
|
||||
embedding_sharding_dim=args.embedding_sharding_dim)
|
||||
|
||||
if args.use_weight_only_quant_matmul_plugin:
|
||||
xtrt_llm_gpt = weight_only_quantize(xtrt_llm_gpt)
|
||||
|
||||
if args.use_weight_only_groupwise_quant_matmul_plugin:
|
||||
xtrt_llm_gpt = weight_only_groupwise_quantize(model=xtrt_llm_gpt,
|
||||
quant_mode=QuantMode(0),
|
||||
group_size=128,
|
||||
zero=True)
|
||||
|
||||
if args.model_dir is not None:
|
||||
assert hf_gpt is not None, f'Could not load weights from hf_gpt model as it is not loaded yet.'
|
||||
|
||||
if args.world_size > 1:
|
||||
assert (
|
||||
args.n_embd % args.world_size == 0
|
||||
), f'Embedding size/hidden size must be divisible by world size.'
|
||||
assert (
|
||||
args.n_head % args.world_size == 0
|
||||
), f'Number of attention heads must be divisible by world size.'
|
||||
|
||||
load_from_hf_gpt_neox(
|
||||
xtrt_llm_gpt, hf_gpt, args.dtype, rank, args.world_size,
|
||||
args.use_weight_only_groupwise_quant_matmul_plugin)
|
||||
|
||||
# Module -> Network
|
||||
network = builder.create_network()
|
||||
network.trt_network.name = engine_name
|
||||
if args.use_gpt_attention_plugin:
|
||||
network.plugin_config.set_gpt_attention_plugin(
|
||||
dtype=args.use_gpt_attention_plugin)
|
||||
if args.use_gemm_plugin:
|
||||
network.plugin_config.set_gemm_plugin(dtype=args.use_gemm_plugin)
|
||||
if args.use_layernorm_plugin:
|
||||
network.plugin_config.set_layernorm_plugin(
|
||||
dtype=args.use_layernorm_plugin)
|
||||
assert not (args.enable_context_fmha and args.enable_context_fmha_fp32_acc)
|
||||
if args.enable_context_fmha:
|
||||
network.plugin_config.set_context_fmha(ContextFMHAType.enabled)
|
||||
if args.enable_context_fmha_fp32_acc:
|
||||
network.plugin_config.set_context_fmha(
|
||||
ContextFMHAType.enabled_with_fp32_acc)
|
||||
if args.use_weight_only_quant_matmul_plugin:
|
||||
network.plugin_config.set_weight_only_quant_matmul_plugin(
|
||||
dtype=args.use_weight_only_quant_matmul_plugin)
|
||||
if args.use_weight_only_groupwise_quant_matmul_plugin:
|
||||
network.plugin_config.set_weight_only_groupwise_quant_matmul_plugin(
|
||||
dtype=args.use_weight_only_groupwise_quant_matmul_plugin)
|
||||
if args.quant_mode.is_weight_only():
|
||||
builder_config.trt_builder_config.use_weight_only = args.weight_only_precision
|
||||
|
||||
if args.world_size > 1:
|
||||
network.plugin_config.set_nccl_plugin(args.dtype)
|
||||
if args.remove_input_padding:
|
||||
network.plugin_config.enable_remove_input_padding()
|
||||
with net_guard(network):
|
||||
# Prepare
|
||||
network.set_named_parameters(xtrt_llm_gpt.named_parameters())
|
||||
|
||||
# Forward
|
||||
inputs = xtrt_llm_gpt.prepare_inputs(args.max_batch_size,
|
||||
args.max_input_len,
|
||||
args.max_output_len, True,
|
||||
args.max_beam_width)
|
||||
xtrt_llm_gpt(*inputs)
|
||||
|
||||
#xtrt_llm.graph_rewriting.optimize(network)
|
||||
|
||||
engine = None
|
||||
|
||||
# Network -> Engine
|
||||
engine = builder.build_engine(network, builder_config, compiler="gr")
|
||||
if rank == 0:
|
||||
config_path = os.path.join(args.output_dir, 'config.json')
|
||||
builder.save_config(builder_config, config_path)
|
||||
return engine
|
||||
|
||||
|
||||
def build(rank, args):
|
||||
#torch.cuda.set_device(rank % args.gpus_per_node)
|
||||
xtrt_llm.logger.set_level(args.log_level)
|
||||
if not os.path.exists(args.output_dir):
|
||||
os.makedirs(args.output_dir)
|
||||
|
||||
# when doing serializing build, all ranks share one engine
|
||||
apply_query_key_layer_scaling = False
|
||||
builder = Builder()
|
||||
|
||||
cache = None
|
||||
for cur_rank in range(args.world_size):
|
||||
# skip other ranks if parallel_build is enabled
|
||||
if args.parallel_build and cur_rank != rank:
|
||||
continue
|
||||
builder_config = builder.create_builder_config(
|
||||
name=MODEL_NAME,
|
||||
precision=args.dtype,
|
||||
timing_cache=args.timing_cache if cache is None else cache,
|
||||
tensor_parallel=args.world_size, # TP only
|
||||
parallel_build=args.parallel_build,
|
||||
num_layers=args.n_layer,
|
||||
num_heads=args.n_head,
|
||||
inter_size=args.inter_size,
|
||||
hidden_size=args.n_embd,
|
||||
vocab_size=args.vocab_size,
|
||||
hidden_act=args.hidden_act,
|
||||
max_position_embeddings=args.n_positions,
|
||||
apply_query_key_layer_scaling=apply_query_key_layer_scaling,
|
||||
max_batch_size=args.max_batch_size,
|
||||
max_input_len=args.max_input_len,
|
||||
max_output_len=args.max_output_len,
|
||||
fusion_pattern_list=["remove_dup_mask"])
|
||||
|
||||
engine_name = get_engine_name(MODEL_NAME, args.dtype, args.world_size,
|
||||
cur_rank)
|
||||
engine = build_rank_engine(builder, builder_config, engine_name,
|
||||
cur_rank, args)
|
||||
assert engine is not None, f'Failed to build engine for rank {cur_rank}'
|
||||
|
||||
# if cur_rank == 0:
|
||||
# # Use in-memory timing cache for multiple builder passes.
|
||||
# if not args.parallel_build:
|
||||
# cache = builder_config.trt_builder_config.get_timing_cache()
|
||||
|
||||
serialize_engine(engine, os.path.join(args.output_dir, engine_name))
|
||||
|
||||
# if rank == 0:
|
||||
# ok = builder.save_timing_cache(
|
||||
# builder_config, os.path.join(args.output_dir, "model.cache"))
|
||||
# assert ok, "Failed to save timing cache."
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parse_arguments()
|
||||
tik = time.time()
|
||||
if args.parallel_build and args.world_size > 1 and \
|
||||
torch.cuda.device_count() >= args.world_size:
|
||||
logger.warning(
|
||||
f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
|
||||
)
|
||||
mp.spawn(build, nprocs=args.world_size, args=(args, ))
|
||||
else:
|
||||
args.parallel_build = False
|
||||
logger.info('Serially build TensorRT engines.')
|
||||
build(0, args)
|
||||
|
||||
tok = time.time()
|
||||
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
|
||||
logger.info(f'Total time of building all {args.world_size} engines: {t}')
|
||||
16
examples/gptneox/get_weights.sh
Normal file
16
examples/gptneox/get_weights.sh
Normal file
@@ -0,0 +1,16 @@
|
||||
mkdir -p downloads
|
||||
pushd downloads
|
||||
rm -rf gptneox_model
|
||||
git clone https://huggingface.co/EleutherAI/gpt-neox-20b gptneox_model
|
||||
|
||||
rm -f gptneox_model/model-*.safetensors
|
||||
rm -f gptneox_model/model.safetensors.index.json
|
||||
wget -q https://huggingface.co/EleutherAI/gpt-neox-20b/resolve/main/model.safetensors.index.json --directory-prefix gptneox_model
|
||||
|
||||
for i in $(seq -f %05g 46)
|
||||
do
|
||||
echo -n "Downloading $i of 00046..."
|
||||
wget -q https://huggingface.co/EleutherAI/gpt-neox-20b/resolve/main/model-$i-of-00046.safetensors --directory-prefix gptneox_model
|
||||
echo "Done"
|
||||
done
|
||||
popd
|
||||
9
examples/gptneox/gptq_convert.sh
Normal file
9
examples/gptneox/gptq_convert.sh
Normal file
@@ -0,0 +1,9 @@
|
||||
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git GPTQ-for-LLaMa
|
||||
|
||||
pip install -r ./GPTQ-for-LLaMa/requirements.txt
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 python3 GPTQ-for-LLaMa/neox.py ./gptneox_model \
|
||||
wikitext2 \
|
||||
--wbits 4 \
|
||||
--groupsize 128 \
|
||||
--save_safetensors ./gptneox_model/gptneox-20b-4bit-gs128.safetensors
|
||||
2
examples/gptneox/requirements.txt
Normal file
2
examples/gptneox/requirements.txt
Normal file
@@ -0,0 +1,2 @@
|
||||
datasets~=2.3.2
|
||||
rouge_score~=0.1.2
|
||||
141
examples/gptneox/run.py
Normal file
141
examples/gptneox/run.py
Normal file
@@ -0,0 +1,141 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import numpy as np
|
||||
|
||||
import torch
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
import xtrt_llm
|
||||
from xtrt_llm.runtime import ModelConfig, SamplingConfig
|
||||
|
||||
from build import get_engine_name # isort:skip
|
||||
|
||||
|
||||
def parse_arguments():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--max_output_len', type=int, required=True)
|
||||
parser.add_argument('--log_level', type=str, default='error')
|
||||
parser.add_argument('--engine_dir', type=str, default='gptneox_outputs')
|
||||
parser.add_argument('--tokenizer_dir',
|
||||
type=str,
|
||||
default="gptneox_model",
|
||||
help="Directory containing the tokenizer.model.")
|
||||
parser.add_argument('--input_text',
|
||||
type=str,
|
||||
default='Born in north-east France, Soyer trained as a')
|
||||
parser.add_argument('--performance_test_scale',
|
||||
type=str,
|
||||
help=
|
||||
"Scale for performance test. e.g., 8x1024x64 (batch_size, input_text_length, max_output_length)",
|
||||
default="")
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parse_arguments()
|
||||
xtrt_llm.logger.set_level(args.log_level)
|
||||
|
||||
config_path = os.path.join(args.engine_dir, 'config.json')
|
||||
with open(config_path, 'r') as f:
|
||||
config = json.load(f)
|
||||
use_gpt_attention_plugin = config['plugin_config']['gpt_attention_plugin']
|
||||
remove_input_padding = config['plugin_config']['remove_input_padding']
|
||||
dtype = config['builder_config']['precision']
|
||||
world_size = config['builder_config']['tensor_parallel']
|
||||
assert world_size == xtrt_llm.mpi_world_size(), \
|
||||
f'Engine world size ({world_size}) != Runtime world size ({xtrt_llm.mpi_world_size()})'
|
||||
num_heads = config['builder_config']['num_heads'] // world_size
|
||||
hidden_size = config['builder_config']['hidden_size'] // world_size
|
||||
vocab_size = config['builder_config']['vocab_size']
|
||||
num_layers = config['builder_config']['num_layers']
|
||||
|
||||
runtime_rank = xtrt_llm.mpi_rank()
|
||||
if world_size > 1:
|
||||
os.environ["XCCL_GROUP_ID"] = str(runtime_rank // world_size)
|
||||
os.environ["XCCL_NRANKS"] = str(world_size)
|
||||
os.environ["XCCL_CUR_RANK"] = str(runtime_rank % world_size)
|
||||
os.environ["XCCL_DEVICE_ID"] = str(runtime_rank)
|
||||
os.environ["MP_RUN"] = str(1)
|
||||
runtime_mapping = xtrt_llm.Mapping(world_size,
|
||||
runtime_rank,
|
||||
tp_size=world_size)
|
||||
torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
|
||||
|
||||
engine_name = get_engine_name('gptneox', dtype, world_size, runtime_rank)
|
||||
#serialize_path = os.path.join(args.engine_dir, engine_name)
|
||||
serialize_path = str(args.engine_dir) + "/" + engine_name
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
|
||||
input_ids = torch.tensor(tokenizer.encode(args.input_text),
|
||||
dtype=torch.int32).cuda().unsqueeze(0)
|
||||
|
||||
model_config = ModelConfig(num_heads=num_heads,
|
||||
num_kv_heads=num_heads,
|
||||
hidden_size=hidden_size,
|
||||
vocab_size=vocab_size,
|
||||
num_layers=num_layers,
|
||||
gpt_attention_plugin=use_gpt_attention_plugin,
|
||||
remove_input_padding=remove_input_padding,
|
||||
dtype=dtype)
|
||||
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
|
||||
pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
|
||||
|
||||
sampling_config = SamplingConfig(end_id=end_id, pad_id=pad_id)
|
||||
input_lengths = torch.tensor(
|
||||
[input_ids.size(1) for _ in range(input_ids.size(0))]).int().cuda()
|
||||
|
||||
#with open(serialize_path, 'rb') as f:
|
||||
# engine_buffer = f.read()
|
||||
decoder = xtrt_llm.runtime.GenerationSession(model_config,
|
||||
serialize_path,
|
||||
runtime_mapping,
|
||||
debug_mode=False)
|
||||
|
||||
if args.performance_test_scale != "":
|
||||
performance_test_scale_list = args.performance_test_scale.split("E")
|
||||
for scale in performance_test_scale_list:
|
||||
xtrt_llm.logger.info(f"Running performance test with scale {scale}")
|
||||
bs, seqlen, max_output_len = [int(x) for x in scale.split("x")]
|
||||
_input_ids = torch.from_numpy(
|
||||
np.zeros((bs, seqlen)).astype("int32")).cuda()
|
||||
_input_lengths = torch.from_numpy(
|
||||
np.full((bs, ), seqlen).astype("int32")).cuda()
|
||||
|
||||
import time
|
||||
_t_begin = time.time()
|
||||
decoder.setup(_input_ids.size(0), _input_ids.size(1), max_output_len)
|
||||
_output_ids = decoder.decode(_input_ids,
|
||||
_input_lengths,
|
||||
sampling_config)
|
||||
_t_end = time.time()
|
||||
xtrt_llm.logger.info(
|
||||
f"Total latency: {(_t_end - _t_begin) * 1000:.3f} ms")
|
||||
|
||||
if remove_input_padding:
|
||||
decoder.setup(1, torch.max(input_lengths).item(), args.max_output_len)
|
||||
else:
|
||||
decoder.setup(input_ids.size(0), input_ids.size(1), args.max_output_len)
|
||||
output_ids = decoder.decode(input_ids, input_lengths, sampling_config)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
output_ids = output_ids.tolist()[0][0][input_ids.size(1):]
|
||||
output_text = tokenizer.decode(output_ids)
|
||||
print(f'Input: \"{args.input_text}\"')
|
||||
print(f'Output: \"{output_text}\"')
|
||||
8
examples/gptneox/run.sh
Executable file
8
examples/gptneox/run.sh
Executable file
@@ -0,0 +1,8 @@
|
||||
BKCL_PCIE_RING=1 PYTORCH_NO_XPU_MEMORY_CACHING=1 XMLIR_D_XPU_L3_SIZE=0 \
|
||||
mpirun -n 2 --allow-run-as-root \
|
||||
python3 run.py \
|
||||
--max_output_len=50 \
|
||||
--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
|
||||
--tokenizer_dir=./downloads/gptneox_model \
|
||||
--performance_test_scale=1x512x256E2x512x256E4x512x256E8x512x256 \
|
||||
--log_level=info
|
||||
373
examples/gptneox/summarize.py
Normal file
373
examples/gptneox/summarize.py
Normal file
@@ -0,0 +1,373 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import argparse
|
||||
import copy
|
||||
import json
|
||||
import os
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from datasets import load_dataset, load_metric
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
import tensorrt_llm
|
||||
import tensorrt_llm.profiler as profiler
|
||||
from tensorrt_llm.logger import logger
|
||||
|
||||
from build import get_engine_name # isort:skip
|
||||
|
||||
|
||||
def TRTGPTNeoX(args, config):
|
||||
dtype = config['builder_config']['precision']
|
||||
world_size = config['builder_config']['tensor_parallel']
|
||||
assert world_size == tensorrt_llm.mpi_world_size(), \
|
||||
f'Engine world size ({world_size}) != Runtime world size ({tensorrt_llm.mpi_world_size()})'
|
||||
|
||||
world_size = config['builder_config']['tensor_parallel']
|
||||
num_heads = config['builder_config']['num_heads'] // world_size
|
||||
hidden_size = config['builder_config']['hidden_size'] // world_size
|
||||
vocab_size = config['builder_config']['vocab_size']
|
||||
num_layers = config['builder_config']['num_layers']
|
||||
use_gpt_attention_plugin = bool(
|
||||
config['plugin_config']['gpt_attention_plugin'])
|
||||
remove_input_padding = config['plugin_config']['remove_input_padding']
|
||||
|
||||
model_config = tensorrt_llm.runtime.ModelConfig(
|
||||
vocab_size=vocab_size,
|
||||
num_layers=num_layers,
|
||||
num_heads=num_heads,
|
||||
num_kv_heads=num_heads,
|
||||
hidden_size=hidden_size,
|
||||
gpt_attention_plugin=use_gpt_attention_plugin,
|
||||
remove_input_padding=remove_input_padding,
|
||||
dtype=dtype)
|
||||
|
||||
runtime_rank = tensorrt_llm.mpi_rank()
|
||||
runtime_mapping = tensorrt_llm.Mapping(world_size,
|
||||
runtime_rank,
|
||||
tp_size=world_size)
|
||||
torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
|
||||
|
||||
engine_name = get_engine_name('gptneox', dtype, world_size, runtime_rank)
|
||||
serialize_path = os.path.join(args.engine_dir, engine_name)
|
||||
|
||||
tensorrt_llm.logger.set_level(args.log_level)
|
||||
|
||||
with open(serialize_path, 'rb') as f:
|
||||
engine_buffer = f.read()
|
||||
decoder = tensorrt_llm.runtime.GenerationSession(model_config,
|
||||
engine_buffer,
|
||||
runtime_mapping)
|
||||
|
||||
return decoder
|
||||
|
||||
|
||||
def main(args):
|
||||
runtime_rank = tensorrt_llm.mpi_rank()
|
||||
logger.set_level(args.log_level)
|
||||
|
||||
test_hf = args.test_hf and runtime_rank == 0 # only run hf on rank 0
|
||||
test_trt_llm = args.test_trt_llm
|
||||
model_dir = args.model_dir
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_dir,
|
||||
padding_side='left',
|
||||
model_max_length=2048,
|
||||
truncation=True)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
dataset_cnn = load_dataset("ccdv/cnn_dailymail",
|
||||
'3.0.0',
|
||||
cache_dir=args.dataset_path)
|
||||
|
||||
config_path = os.path.join(args.engine_dir, 'config.json')
|
||||
with open(config_path, 'r') as f:
|
||||
config = json.load(f)
|
||||
|
||||
max_batch_size = args.batch_size
|
||||
|
||||
# runtime parameters
|
||||
# repetition_penalty = 1
|
||||
top_k = args.top_k
|
||||
output_len = args.output_len
|
||||
test_token_num = 923
|
||||
# top_p = 0.0
|
||||
# random_seed = 5
|
||||
temperature = 1
|
||||
num_beams = args.num_beams
|
||||
|
||||
pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
|
||||
end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
|
||||
|
||||
if test_trt_llm:
|
||||
tensorrt_llm_gpt = TRTGPTNeoX(args, config)
|
||||
|
||||
if test_hf:
|
||||
model = AutoModelForCausalLM.from_pretrained(model_dir)
|
||||
model.cuda()
|
||||
if args.data_type == 'fp16':
|
||||
model.half()
|
||||
|
||||
def summarize_tensorrt_llm(datapoint):
|
||||
batch_size = len(datapoint['article'])
|
||||
|
||||
line = copy.copy(datapoint['article'])
|
||||
line_encoded = []
|
||||
input_lengths = []
|
||||
for i in range(batch_size):
|
||||
line[i] = line[i] + ' TL;DR: '
|
||||
|
||||
line[i] = line[i].strip()
|
||||
line[i] = line[i].replace(" n't", "n't")
|
||||
|
||||
input_id = tokenizer.encode(line[i],
|
||||
return_tensors='pt').type(torch.int32)
|
||||
input_id = input_id[:, -test_token_num:]
|
||||
|
||||
line_encoded.append(input_id)
|
||||
input_lengths.append(input_id.shape[-1])
|
||||
|
||||
# do padding, should move outside the profiling to prevent the overhead
|
||||
max_length = max(input_lengths)
|
||||
if tensorrt_llm_gpt.remove_input_padding:
|
||||
line_encoded = [
|
||||
torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded
|
||||
]
|
||||
else:
|
||||
# do padding, should move outside the profiling to prevent the overhead
|
||||
for i in range(batch_size):
|
||||
pad_size = max_length - input_lengths[i]
|
||||
|
||||
pad = torch.ones([1, pad_size]).type(torch.int32) * pad_id
|
||||
line_encoded[i] = torch.cat(
|
||||
[torch.tensor(line_encoded[i], dtype=torch.int32), pad],
|
||||
axis=-1)
|
||||
|
||||
line_encoded = torch.cat(line_encoded, axis=0).cuda()
|
||||
input_lengths = torch.tensor(input_lengths,
|
||||
dtype=torch.int32).cuda()
|
||||
|
||||
sampling_config = tensorrt_llm.runtime.SamplingConfig(
|
||||
end_id=end_id, pad_id=pad_id, top_k=top_k, num_beams=num_beams)
|
||||
|
||||
with torch.no_grad():
|
||||
tensorrt_llm_gpt.setup(batch_size,
|
||||
max_context_length=max_length,
|
||||
max_new_tokens=output_len,
|
||||
beam_width=num_beams)
|
||||
|
||||
if tensorrt_llm_gpt.remove_input_padding:
|
||||
output_ids = tensorrt_llm_gpt.decode_batch(
|
||||
line_encoded, sampling_config)
|
||||
else:
|
||||
output_ids = tensorrt_llm_gpt.decode(
|
||||
line_encoded,
|
||||
input_lengths,
|
||||
sampling_config,
|
||||
)
|
||||
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Extract a list of tensors of shape beam_width x output_ids.
|
||||
if tensorrt_llm_gpt.mapping.is_first_pp_rank():
|
||||
output_beams_list = [
|
||||
tokenizer.batch_decode(output_ids[batch_idx, :,
|
||||
input_lengths[batch_idx]:],
|
||||
skip_special_tokens=True)
|
||||
for batch_idx in range(batch_size)
|
||||
]
|
||||
return output_beams_list, output_ids[:, :, max_length:].tolist()
|
||||
return [], []
|
||||
|
||||
def summarize_hf(datapoint):
|
||||
batch_size = len(datapoint['article'])
|
||||
if batch_size > 1:
|
||||
logger.warning(
|
||||
f"HF does not support batch_size > 1 to verify correctness due to padding. Current batch size is {batch_size}"
|
||||
)
|
||||
|
||||
line = copy.copy(datapoint['article'])
|
||||
for i in range(batch_size):
|
||||
line[i] = line[i] + ' TL;DR: '
|
||||
|
||||
line[i] = line[i].strip()
|
||||
line[i] = line[i].replace(" n't", "n't")
|
||||
|
||||
line_encoded = tokenizer(line,
|
||||
return_tensors='pt',
|
||||
padding=True,
|
||||
truncation=True)["input_ids"].type(torch.int64)
|
||||
|
||||
line_encoded = line_encoded[:, -test_token_num:]
|
||||
line_encoded = line_encoded.cuda()
|
||||
|
||||
with torch.no_grad():
|
||||
output = model.generate(line_encoded,
|
||||
max_length=len(line_encoded[0]) +
|
||||
output_len,
|
||||
top_k=top_k,
|
||||
temperature=temperature,
|
||||
eos_token_id=tokenizer.eos_token_id,
|
||||
pad_token_id=tokenizer.pad_token_id,
|
||||
num_beams=num_beams,
|
||||
num_return_sequences=num_beams,
|
||||
early_stopping=True)
|
||||
|
||||
tokens_list = output[:, len(line_encoded[0]):].tolist()
|
||||
output = output.reshape([batch_size, num_beams, -1])
|
||||
output_lines_list = [
|
||||
tokenizer.batch_decode(output[:, i, len(line_encoded[0]):],
|
||||
skip_special_tokens=True)
|
||||
for i in range(num_beams)
|
||||
]
|
||||
|
||||
return output_lines_list, tokens_list
|
||||
|
||||
if test_trt_llm:
|
||||
datapoint = dataset_cnn['test'][0:1]
|
||||
summary, _ = summarize_tensorrt_llm(datapoint)
|
||||
if runtime_rank == 0:
|
||||
logger.info(
|
||||
"---------------------------------------------------------")
|
||||
logger.info("XTRT-LLM Generated : ")
|
||||
logger.info(f" Article : {datapoint['article']}")
|
||||
logger.info(f"\n Highlights : {datapoint['highlights']}")
|
||||
logger.info(f"\n Summary : {summary}")
|
||||
logger.info(
|
||||
"---------------------------------------------------------")
|
||||
|
||||
if test_hf:
|
||||
datapoint = dataset_cnn['test'][0:1]
|
||||
summary, _ = summarize_hf(datapoint)
|
||||
logger.info("---------------------------------------------------------")
|
||||
logger.info("HF Generated : ")
|
||||
logger.info(f" Article : {datapoint['article']}")
|
||||
logger.info(f"\n Highlights : {datapoint['highlights']}")
|
||||
logger.info(f"\n Summary : {summary}")
|
||||
logger.info("---------------------------------------------------------")
|
||||
|
||||
metric_tensorrt_llm = [load_metric("rouge") for _ in range(num_beams)]
|
||||
metric_hf = [load_metric("rouge") for _ in range(num_beams)]
|
||||
for i in range(num_beams):
|
||||
metric_tensorrt_llm[i].seed = 0
|
||||
metric_hf[i].seed = 0
|
||||
|
||||
ite_count = 0
|
||||
data_point_idx = 0
|
||||
while (data_point_idx < len(dataset_cnn['test'])) and (ite_count <
|
||||
args.max_ite):
|
||||
if runtime_rank == 0:
|
||||
logger.debug(
|
||||
f"run data_point {data_point_idx} ~ {data_point_idx + max_batch_size}"
|
||||
)
|
||||
datapoint = dataset_cnn['test'][data_point_idx:(data_point_idx +
|
||||
max_batch_size)]
|
||||
|
||||
if test_trt_llm:
|
||||
profiler.start('tensorrt_llm')
|
||||
summary_tensorrt_llm, tokens_tensorrt_llm = summarize_tensorrt_llm(
|
||||
datapoint)
|
||||
profiler.stop('tensorrt_llm')
|
||||
|
||||
if test_hf:
|
||||
profiler.start('hf')
|
||||
summary_hf, tokens_hf = summarize_hf(datapoint)
|
||||
profiler.stop('hf')
|
||||
|
||||
if runtime_rank == 0:
|
||||
if test_trt_llm:
|
||||
for batch_idx in range(len(summary_tensorrt_llm)):
|
||||
for beam_idx in range(num_beams):
|
||||
metric_tensorrt_llm[beam_idx].add_batch(
|
||||
predictions=[
|
||||
summary_tensorrt_llm[batch_idx][beam_idx]
|
||||
],
|
||||
references=[datapoint['highlights'][batch_idx]])
|
||||
if test_hf:
|
||||
for beam_idx in range(num_beams):
|
||||
for i in range(len(summary_hf[beam_idx])):
|
||||
metric_hf[beam_idx].add_batch(
|
||||
predictions=[summary_hf[beam_idx][i]],
|
||||
references=[datapoint['highlights'][i]])
|
||||
|
||||
logger.debug('-' * 100)
|
||||
logger.debug(f"Article : {datapoint['article']}")
|
||||
if test_trt_llm:
|
||||
logger.debug(f'XTRT-LLM Summary: {summary_tensorrt_llm}')
|
||||
if test_hf:
|
||||
logger.debug(f'HF Summary: {summary_hf}')
|
||||
logger.debug(f"highlights : {datapoint['highlights']}")
|
||||
|
||||
data_point_idx += max_batch_size
|
||||
ite_count += 1
|
||||
|
||||
if runtime_rank == 0:
|
||||
if test_trt_llm:
|
||||
np.random.seed(0) # rouge score use sampling to compute the score
|
||||
logger.info(
|
||||
f'XTRT-LLM (total latency: {profiler.elapsed_time_in_sec("tensorrt_llm")} sec)'
|
||||
)
|
||||
for beam_idx in range(num_beams):
|
||||
logger.info(f"XTRT-LLM beam {beam_idx} result")
|
||||
computed_metrics_tensorrt_llm = metric_tensorrt_llm[
|
||||
beam_idx].compute()
|
||||
for key in computed_metrics_tensorrt_llm.keys():
|
||||
logger.info(
|
||||
f' {key} : {computed_metrics_tensorrt_llm[key].mid[2]*100}'
|
||||
)
|
||||
|
||||
if args.check_accuracy and beam_idx == 0:
|
||||
assert computed_metrics_tensorrt_llm['rouge1'].mid[
|
||||
2] * 100 > args.tensorrt_llm_rouge1_threshold
|
||||
if test_hf:
|
||||
np.random.seed(0) # rouge score use sampling to compute the score
|
||||
logger.info(
|
||||
f'Hugging Face (total latency: {profiler.elapsed_time_in_sec("hf")} sec)'
|
||||
)
|
||||
for beam_idx in range(num_beams):
|
||||
logger.info(f"HF beam {beam_idx} result")
|
||||
computed_metrics_hf = metric_hf[beam_idx].compute()
|
||||
for key in computed_metrics_hf.keys():
|
||||
logger.info(
|
||||
f' {key} : {computed_metrics_hf[key].mid[2]*100}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--model_dir',
|
||||
type=str,
|
||||
default='EleutherAI/gpt-neox-20b')
|
||||
parser.add_argument('--test_hf', action='store_true')
|
||||
parser.add_argument('--test_trt_llm', action='store_true')
|
||||
parser.add_argument('--data_type',
|
||||
type=str,
|
||||
choices=['fp32', 'fp16'],
|
||||
default='fp32')
|
||||
parser.add_argument('--dataset_path', type=str, default='')
|
||||
parser.add_argument('--log_level', type=str, default='info')
|
||||
parser.add_argument('--engine_dir', type=str, default='gptneox_engine')
|
||||
parser.add_argument('--batch_size', type=int, default=1)
|
||||
parser.add_argument('--max_ite', type=int, default=20)
|
||||
parser.add_argument('--output_len', type=int, default=100)
|
||||
parser.add_argument('--check_accuracy', action='store_true')
|
||||
parser.add_argument('--tensorrt_llm_rouge1_threshold',
|
||||
type=float,
|
||||
default=15.0)
|
||||
parser.add_argument('--num_beams', type=int, default=1)
|
||||
parser.add_argument('--top_k', type=int, default=1)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
main(args)
|
||||
464
examples/gptneox/weight.py
Normal file
464
examples/gptneox/weight.py
Normal file
@@ -0,0 +1,464 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import time
|
||||
from operator import attrgetter
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
import xtrt_llm
|
||||
from xtrt_llm._utils import pad_vocab_size
|
||||
from xtrt_llm.models import GPTNeoXForCausalLM
|
||||
from xtrt_llm._utils import str_dtype_to_torch
|
||||
|
||||
UINT4_TO_INT4_FLAG = 1
|
||||
GPTQ_FLAG = 1
|
||||
GROUP_SIZE = 128
|
||||
|
||||
|
||||
def numpy_split(v, tp_size, idx, dim=0):
|
||||
if tp_size == 1:
|
||||
return v
|
||||
else:
|
||||
return np.ascontiguousarray(np.split(v, tp_size, axis=dim)[idx])
|
||||
|
||||
|
||||
def torch_split(v, tp_size, idx, dim=0):
|
||||
if tp_size == 1:
|
||||
return v
|
||||
else:
|
||||
return (torch.split(v, v.shape[dim] // tp_size,
|
||||
dim=dim)[idx]).contiguous()
|
||||
|
||||
|
||||
def unpack_int32_into_int8(w_packed):
|
||||
# Unpack inputs packed in int32/float32 into uint4 and store them in int8 format
|
||||
w_packed_int4x2 = w_packed.contiguous().view(torch.uint8)
|
||||
w_unpacked = torch.zeros(w_packed_int4x2.shape[0],
|
||||
w_packed_int4x2.shape[1] * 2,
|
||||
dtype=torch.int8)
|
||||
w_unpacked[:, ::2] = w_packed_int4x2 % 16
|
||||
w_unpacked[:, 1::2] = w_packed_int4x2 // 16
|
||||
return w_unpacked.contiguous()
|
||||
|
||||
|
||||
def preprocess_groupwise_weight_params(qweight_unpacked_int8, scales_fp16,
|
||||
qzeros_unpacked_int8):
|
||||
packer = torch.ops.fastertransformer.pack_int8_tensor_to_packed_int4
|
||||
preprocessor = torch.ops.fastertransformer.preprocess_weights_for_mixed_gemm
|
||||
|
||||
qweight_interleaved = preprocessor(packer(qweight_unpacked_int8),
|
||||
torch.quint4x2).view(torch.float32)
|
||||
|
||||
# zeros = zeros * scales
|
||||
zeros_x_scales_fp16 = (-qzeros_unpacked_int8 + 8 * UINT4_TO_INT4_FLAG -
|
||||
GPTQ_FLAG) * scales_fp16
|
||||
zeros_x_scales_fp16 = zeros_x_scales_fp16.half()
|
||||
|
||||
# return processed interleaved weight, original scales and zeros * scales
|
||||
return qweight_interleaved.contiguous().numpy(), scales_fp16.contiguous(
|
||||
).numpy(), zeros_x_scales_fp16.contiguous().numpy()
|
||||
|
||||
|
||||
def load_from_hf_gpt_neox(xtrt_llm_gpt_neox: GPTNeoXForCausalLM,
|
||||
hf_gpt_neox,
|
||||
dtype="float32",
|
||||
rank=0,
|
||||
tp_size=1,
|
||||
use_weight_only_groupwise_quant_matmul_plugin=False):
|
||||
|
||||
hf_model_gptneox_block_names = [
|
||||
"input_layernorm.weight",
|
||||
"input_layernorm.bias",
|
||||
"post_attention_layernorm.weight",
|
||||
"post_attention_layernorm.bias",
|
||||
]
|
||||
|
||||
xtrt_llm_model_gptneox_block_names = [
|
||||
"input_layernorm.weight",
|
||||
"input_layernorm.bias",
|
||||
"post_attention_layernorm.weight",
|
||||
"post_attention_layernorm.bias",
|
||||
]
|
||||
|
||||
if not use_weight_only_groupwise_quant_matmul_plugin:
|
||||
hf_model_gptneox_block_names += [
|
||||
"attention.dense.weight",
|
||||
"attention.dense.bias",
|
||||
"mlp.dense_h_to_4h.weight",
|
||||
"mlp.dense_h_to_4h.bias",
|
||||
"mlp.dense_4h_to_h.weight",
|
||||
"mlp.dense_4h_to_h.bias",
|
||||
]
|
||||
xtrt_llm_model_gptneox_block_names += [
|
||||
"attention.dense.weight",
|
||||
"attention.dense.bias",
|
||||
"mlp.fc.weight",
|
||||
"mlp.fc.bias",
|
||||
"mlp.proj.weight",
|
||||
"mlp.proj.bias",
|
||||
]
|
||||
|
||||
if not use_weight_only_groupwise_quant_matmul_plugin:
|
||||
xtrt_llm.logger.info('Loading weights from HF GPT-NeoX...')
|
||||
else:
|
||||
xtrt_llm.logger.info(
|
||||
'Loading weights from GPTQ quantized HF GPT-NeoX...')
|
||||
|
||||
tik = time.time()
|
||||
|
||||
torch_dtype = str_dtype_to_torch(dtype)
|
||||
hf_gpt_neox_state_dict = hf_gpt_neox.state_dict()
|
||||
|
||||
# [vocab_size, hidden_size]
|
||||
v = hf_gpt_neox_state_dict.get('gpt_neox.embed_in.weight').to(
|
||||
torch_dtype).cpu().numpy()
|
||||
if xtrt_llm_gpt_neox._use_parallel_embedding:
|
||||
v = numpy_split(v, tp_size, rank,
|
||||
xtrt_llm_gpt_neox._embedding_sharding_dim)
|
||||
xtrt_llm_gpt_neox.embedding.weight.value = v
|
||||
|
||||
n_layer = hf_gpt_neox.config.num_hidden_layers
|
||||
|
||||
for layer_idx in range(n_layer):
|
||||
prefix = "gpt_neox.layers." + str(layer_idx) + "."
|
||||
for idx, hf_attr in enumerate(hf_model_gptneox_block_names):
|
||||
v = hf_gpt_neox_state_dict.get(prefix + hf_attr).to(
|
||||
torch_dtype).cpu().numpy()
|
||||
|
||||
layer = attrgetter(xtrt_llm_model_gptneox_block_names[idx])(
|
||||
xtrt_llm_gpt_neox.layers[layer_idx])
|
||||
|
||||
if tp_size > 1:
|
||||
if 'dense.weight' in hf_attr:
|
||||
# [n=hidden_size, k=hidden_size] ->
|
||||
# [n=hidden_size, k=hidden_size // tp_size]
|
||||
split_v = numpy_split(v, tp_size, rank, dim=1)
|
||||
elif 'dense_h_to_4h.weight' in hf_attr:
|
||||
# [hidden_size * 4, hidden_size] ->
|
||||
# [hidden_size * 4 // tp_size, hidden_size]
|
||||
split_v = numpy_split(v, tp_size, rank, dim=0)
|
||||
elif 'dense_h_to_4h.bias' in hf_attr:
|
||||
# [hidden_size * 4] -> [hidden_size * 4 // tp_size]
|
||||
split_v = numpy_split(v, tp_size, rank, dim=0)
|
||||
elif 'dense_4h_to_h.weight' in hf_attr:
|
||||
# [hidden_size, hidden_size * 4] ->
|
||||
# [hidden_size, hidden_size * 4 // tp_size]
|
||||
split_v = numpy_split(v, tp_size, rank, dim=1)
|
||||
else:
|
||||
split_v = v
|
||||
setattr(layer, 'value', split_v)
|
||||
else:
|
||||
setattr(layer, 'value', v)
|
||||
|
||||
num_heads = hf_gpt_neox.config.num_attention_heads
|
||||
hidden_size = hf_gpt_neox.config.hidden_size
|
||||
head_size = hidden_size // num_heads
|
||||
|
||||
if not use_weight_only_groupwise_quant_matmul_plugin:
|
||||
# Attention QKV Linear
|
||||
# qkv_weights [num_heads x (q|k|v), hidden_size] ->
|
||||
# [(num_heads x q)|(num_heads x k)|(num_heads x v), hidden_size]
|
||||
qkv_weights = hf_gpt_neox_state_dict.get(
|
||||
prefix + "attention.query_key_value.weight")
|
||||
qkv_bias = hf_gpt_neox_state_dict.get(
|
||||
prefix + "attention.query_key_value.bias")
|
||||
|
||||
new_qkv_weight_shape = torch.Size(
|
||||
[num_heads, 3, head_size * qkv_weights.size()[-1]])
|
||||
new_qkv_bias_shape = torch.Size([num_heads, 3, head_size])
|
||||
|
||||
qkv_weights = qkv_weights.view(new_qkv_weight_shape).permute(
|
||||
1, 0, 2).reshape([hidden_size * 3, hidden_size])
|
||||
qkv_bias = qkv_bias.view(new_qkv_bias_shape).permute(
|
||||
1, 0, 2).reshape([hidden_size * 3])
|
||||
|
||||
if tp_size > 1:
|
||||
qkv_weights = qkv_weights.reshape(
|
||||
3, hidden_size, hidden_size).to(torch_dtype).cpu().numpy()
|
||||
split_qkv_weights = numpy_split(
|
||||
qkv_weights, tp_size, rank,
|
||||
dim=1).reshape(3 * (hidden_size // tp_size), hidden_size)
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.weight.value = \
|
||||
np.ascontiguousarray(split_qkv_weights)
|
||||
|
||||
qkv_bias = qkv_bias.reshape(
|
||||
3, hidden_size).to(torch_dtype).cpu().numpy()
|
||||
split_qkv_bias = numpy_split(qkv_bias, tp_size, rank,
|
||||
dim=1).reshape(
|
||||
3 * (hidden_size // tp_size))
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
|
||||
np.ascontiguousarray(split_qkv_bias)
|
||||
else:
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.weight.value = \
|
||||
qkv_weights.to(torch_dtype).cpu().numpy()
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
|
||||
qkv_bias.to(torch_dtype).cpu().numpy()
|
||||
else:
|
||||
# use_weight_only_groupwise_quant_matmul_plugin
|
||||
|
||||
qweight_int32 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "attention.query_key_value.qweight")
|
||||
scales_fp16 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "attention.query_key_value.scales")
|
||||
qzeros_int32 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "attention.query_key_value.qzeros")
|
||||
biases_fp16 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "attention.query_key_value.bias")
|
||||
|
||||
# [hidden_size // 8, hidden_size * 3] -> [hidden_size * 3, hidden_size]
|
||||
qweight_unpacked_int8 = unpack_int32_into_int8(
|
||||
qweight_int32.T).contiguous() - 8
|
||||
# [hidden_size // GROUP_SIZE, hidden_size * 3 // 8] ->
|
||||
# [hidden_size // GROUP_SIZE, hidden_size * 3]
|
||||
qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
|
||||
|
||||
# qkv_weights [num_heads x (q|k|v), hidden_size] ->
|
||||
# [(num_heads x q)|(num_heads x k)|(num_heads x v), hidden_size]
|
||||
new_qkv_weight_shape = torch.Size(
|
||||
[num_heads, 3, head_size * qweight_unpacked_int8.size()[-1]])
|
||||
# [hidden_size * 3, hidden_size]
|
||||
qweight_unpacked_int8 = qweight_unpacked_int8.view(
|
||||
new_qkv_weight_shape).permute(1, 0, 2).reshape(
|
||||
[hidden_size * 3, hidden_size]).contiguous()
|
||||
|
||||
new_qkv_scale_shape = torch.Size(
|
||||
[num_heads, 3, head_size * (hidden_size // GROUP_SIZE)])
|
||||
# [hidden_size * 3, hidden_size // GROUP_SIZE]
|
||||
scales_fp16 = scales_fp16.T.contiguous().view(
|
||||
new_qkv_scale_shape).permute(1, 0, 2).reshape(
|
||||
[hidden_size * 3, hidden_size // GROUP_SIZE]).contiguous()
|
||||
|
||||
new_qkv_zero_shape = torch.Size(
|
||||
[num_heads, 3, head_size * (hidden_size // GROUP_SIZE)])
|
||||
# [hidden_size * 3, hidden_size // GROUP_SIZE]
|
||||
qzeros_unpacked_int8 = qzeros_unpacked_int8.T.contiguous().view(
|
||||
new_qkv_zero_shape).permute(1, 0, 2).reshape(
|
||||
[hidden_size * 3, hidden_size // GROUP_SIZE]).contiguous()
|
||||
|
||||
new_qkv_bias_shape = torch.Size([num_heads, 3, head_size])
|
||||
biases_fp16 = biases_fp16.view(new_qkv_bias_shape).permute(
|
||||
1, 0, 2).reshape([hidden_size * 3]).numpy()
|
||||
|
||||
if tp_size > 1:
|
||||
qweight_unpacked_int8 = qweight_unpacked_int8.reshape(
|
||||
[3, hidden_size, hidden_size])
|
||||
qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
|
||||
tp_size,
|
||||
rank,
|
||||
dim=1)
|
||||
qweight_unpacked_int8 = qweight_unpacked_int8.reshape(
|
||||
[3 * hidden_size // tp_size, hidden_size])
|
||||
|
||||
scales_fp16 = scales_fp16.reshape(
|
||||
[3, hidden_size, hidden_size // GROUP_SIZE])
|
||||
scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=1)
|
||||
scales_fp16 = scales_fp16.reshape(
|
||||
[3 * hidden_size // tp_size, hidden_size // GROUP_SIZE])
|
||||
|
||||
qzeros_unpacked_int8 = qzeros_unpacked_int8.reshape(
|
||||
[3, hidden_size, hidden_size // GROUP_SIZE])
|
||||
qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
|
||||
tp_size,
|
||||
rank,
|
||||
dim=1)
|
||||
qzeros_unpacked_int8 = qzeros_unpacked_int8.reshape(
|
||||
[3 * hidden_size // tp_size, hidden_size // GROUP_SIZE])
|
||||
|
||||
biases_fp16 = biases_fp16.reshape([3, hidden_size])
|
||||
biases_fp16 = numpy_split(biases_fp16, tp_size, rank, dim=1)
|
||||
biases_fp16 = biases_fp16.reshape([3 * hidden_size // tp_size])
|
||||
|
||||
qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
|
||||
qweight_unpacked_int8.T.contiguous(),
|
||||
scales_fp16.T.contiguous(), qzeros_unpacked_int8.T.contiguous())
|
||||
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.qweight.value = \
|
||||
qweight_fp32
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.scale.value = \
|
||||
scales_fp16
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.zero.value = \
|
||||
zeros_fp16
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
|
||||
biases_fp16
|
||||
|
||||
qweight_int32 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "attention.dense.qweight")
|
||||
scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
|
||||
"attention.dense.scales")
|
||||
qzeros_int32 = hf_gpt_neox_state_dict.get(prefix +
|
||||
"attention.dense.qzeros")
|
||||
biases_fp16 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "attention.dense.bias").numpy()
|
||||
|
||||
# [k=hidden_size // 8, n=hidden_size] -> [n=hidden_size, k=hidden_size]
|
||||
qweight_unpacked_int8 = unpack_int32_into_int8(
|
||||
qweight_int32.T).contiguous() - 8
|
||||
# [n=hidden_size, k=hidden_size] -> [k=hidden_size, n=hidden_size]
|
||||
qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
|
||||
# [k=hidden_size // GROUP_SIZE, n=hidden_size // 8] ->
|
||||
# [k=hidden_size // GROUP_SIZE, n=hidden_size]
|
||||
qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
|
||||
|
||||
if tp_size > 1:
|
||||
qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
|
||||
tp_size,
|
||||
rank,
|
||||
dim=0)
|
||||
scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=0)
|
||||
qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
|
||||
tp_size,
|
||||
rank,
|
||||
dim=0)
|
||||
if rank > 0:
|
||||
biases_fp16 = np.zeros_like(biases_fp16)
|
||||
|
||||
qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
|
||||
qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
|
||||
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.qweight.value = \
|
||||
qweight_fp32
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.scale.value = \
|
||||
scales_fp16
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.zero.value = \
|
||||
zeros_fp16
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.bias.value = \
|
||||
biases_fp16
|
||||
|
||||
qweight_int32 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "mlp.dense_h_to_4h.qweight")
|
||||
scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
|
||||
"mlp.dense_h_to_4h.scales")
|
||||
qzeros_int32 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "mlp.dense_h_to_4h.qzeros")
|
||||
biases_fp16 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "mlp.dense_h_to_4h.bias").numpy()
|
||||
|
||||
# [hidden_size // 8, hidden_size * 4] -> [hidden_size, hidden_size * 4]
|
||||
qweight_unpacked_int8 = unpack_int32_into_int8(
|
||||
qweight_int32.T).contiguous() - 8
|
||||
qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
|
||||
|
||||
# [hidden_size // GROUP_SIZE, hidden_size * 4 // 8] ->
|
||||
# [hidden_size // GROUP_SIZE, hidden_size * 4]
|
||||
qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
|
||||
|
||||
if tp_size > 1:
|
||||
# [hidden_size, hidden_size * 4] ->
|
||||
# [hidden_size, hidden_size * 4 // tp_size]
|
||||
qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
|
||||
tp_size,
|
||||
rank,
|
||||
dim=1)
|
||||
# [hidden_size // GROUP_SIZE, hidden_size * 4] ->
|
||||
# [hidden_size // GROUP_SIZE, hidden_size * 4 // tp_size]
|
||||
scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=1)
|
||||
# [hidden_size // GROUP_SIZE, hidden_size * 4] ->
|
||||
# [hidden_size // GROUP_SIZE, hidden_size * 4 // tp_size]
|
||||
qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
|
||||
tp_size,
|
||||
rank,
|
||||
dim=1)
|
||||
# [hidden_size * 4] -> [hidden_size * 4 // tp_size]
|
||||
biases_fp16 = numpy_split(biases_fp16, tp_size, rank, dim=0)
|
||||
|
||||
qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
|
||||
qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
|
||||
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.qweight.value = \
|
||||
qweight_fp32
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.scale.value = \
|
||||
scales_fp16
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.zero.value = \
|
||||
zeros_fp16
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.bias.value = \
|
||||
biases_fp16
|
||||
|
||||
qweight_int32 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "mlp.dense_4h_to_h.qweight")
|
||||
scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
|
||||
"mlp.dense_4h_to_h.scales")
|
||||
qzeros_int32 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "mlp.dense_4h_to_h.qzeros")
|
||||
biases_fp16 = hf_gpt_neox_state_dict.get(
|
||||
prefix + "mlp.dense_4h_to_h.bias").numpy()
|
||||
|
||||
# [hidden_size * 4 // 8, hidden_size] -> [hidden_size * 4, hidden_size]
|
||||
qweight_unpacked_int8 = unpack_int32_into_int8(
|
||||
qweight_int32.T).contiguous() - 8
|
||||
qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
|
||||
|
||||
# [hidden_size * 4 // GROUP_SIZE, hidden_size // 8] ->
|
||||
# [hidden_size * 4 // GROUP_SIZE, hidden_size]
|
||||
qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
|
||||
|
||||
if tp_size > 1:
|
||||
# [hidden_size * 4, hidden_size] ->
|
||||
# [hidden_size * 4 // tp_size, hidden_size]
|
||||
qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
|
||||
tp_size,
|
||||
rank,
|
||||
dim=0)
|
||||
# [hidden_size * 4 // GROUP_SIZE, hidden_size] ->
|
||||
# [hidden_size * 4 // GROUP_SIZE // tp_size, hidden_size] ->
|
||||
scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=0)
|
||||
# [hidden_size * 4 // GROUP_SIZE, hidden_size] ->
|
||||
# [hidden_size * 4 // GROUP_SIZE // tp_size, hidden_size]
|
||||
qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
|
||||
tp_size,
|
||||
rank,
|
||||
dim=0)
|
||||
if rank > 0:
|
||||
biases_fp16 = np.zeros_like(biases_fp16)
|
||||
|
||||
qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
|
||||
qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
|
||||
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.qweight.value = \
|
||||
qweight_fp32
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.scale.value = \
|
||||
scales_fp16
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.zero.value = \
|
||||
zeros_fp16
|
||||
xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.bias.value = \
|
||||
biases_fp16
|
||||
|
||||
v = hf_gpt_neox_state_dict.get('gpt_neox.final_layer_norm.weight')
|
||||
xtrt_llm_gpt_neox.ln_f.weight.value = v.to(torch_dtype).cpu().numpy()
|
||||
|
||||
v = hf_gpt_neox_state_dict.get('gpt_neox.final_layer_norm.bias')
|
||||
xtrt_llm_gpt_neox.ln_f.bias.value = v.to(torch_dtype).cpu().numpy()
|
||||
|
||||
v = hf_gpt_neox_state_dict.get('embed_out.weight').to(
|
||||
torch_dtype).cpu().numpy()
|
||||
if tp_size > 1:
|
||||
# [vocab_size, hidden_size] ->
|
||||
# [vocab_size // tp_size, hidden_size]
|
||||
if v.shape[0] % tp_size != 0:
|
||||
# padding
|
||||
vocab_size_padded = pad_vocab_size(v.shape[0], tp_size)
|
||||
pad_width = vocab_size_padded - v.shape[0]
|
||||
v = np.pad(v, ((0, pad_width), (0, 0)),
|
||||
'constant',
|
||||
constant_values=0)
|
||||
|
||||
split_v = numpy_split(v, tp_size, rank, dim=0)
|
||||
xtrt_llm_gpt_neox.lm_head.weight.value = split_v
|
||||
else:
|
||||
xtrt_llm_gpt_neox.lm_head.weight.value = v
|
||||
|
||||
tok = time.time()
|
||||
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
|
||||
xtrt_llm.logger.info(f'Weights loaded. Total time: {t}')
|
||||
Reference in New Issue
Block a user