add pkgs

2025-08-06 15:49:14 +08:00
parent e80b916c52
commit bf00e72fb2
111 changed files with 21880 additions and 1 deletions
--- a/examples/gptneox/.gitignore
+++ b/examples/gptneox/.gitignore
@@ -0,0 +1,3 @@
+__pycache__/
+gptneox_model/
+*.log
--- a/examples/gptneox/README.md
+++ b/examples/gptneox/README.md
@@ -0,0 +1,93 @@
+# GPT-NeoX
+
+This document explains how to build the [GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) model using XTRT-LLM and run on single node multi-XPU.
+
+## Overview
+
+The XTRT-LLM GPT-NeoX example code is located in [`examples/gptneox`](./). There are several main files in that folder:
+
+ * [`build.py`](./build.py) to build the XTRT engine(s) needed to run the GPT-NeoX model,
+ * [`run.py`](./run.py) to run the inference on an input text,
+
+
+## Support Matrix
+  * FP16
+  * INT8 Weight-Only
+  * Tensor Parallel
+
+## Usage
+
+### 1. Download weights from HuggingFace (HF) Transformers
+
+```bash
+# Weights & config
+sh get_weights.sh
+```
+
+### 2. Build XTRT engine(s)
+
+XTRT-LLM builds XTRT engine(s) using a HF checkpoint. If no checkpoint directory is specified, XTRT-LLM will build engine(s) using dummy weights.
+
+Examples of build invocations:
+
+```bash
+# Build a float16 engine using 2-way tensor parallelism and HF weights.
+# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
+python3 build.py --dtype=float16                    \
+                 --log_level=verbose                \
+                 --use_gpt_attention_plugin float16 \
+                 --use_gemm_plugin float16          \
+                 --use_layernorm_plugin float16     \
+                 --max_batch_size=16                \
+                 --max_input_len=1024               \
+                 --max_output_len=1024              \
+                 --world_size=2                     \
+                 --output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/    \
+                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
+
+# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
+# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
+python3 build.py --dtype=float16                    \
+                 --log_level=verbose                \
+                 --use_gpt_attention_plugin float16 \
+                 --use_gemm_plugin float16          \
+                 --use_layernorm_plugin float16     \
+                 --max_batch_size=16                \
+                 --max_input_len=1024               \
+                 --max_output_len=1024              \
+                 --world_size=2                     \
+                 --use_weight_only                  \
+                 --output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/    \
+                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
+```
+
+### 3. Run
+
+Before running the examples, make sure set the environment variables:
+```bash
+export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
+export XMLIR_D_XPU_L3_SIZE=0           # disable XPytorch use L3.
+```
+
+If NOT using R480-X8, make sure set the environment variables:
+```bash
+export BKCL_PCIE_RING=1
+```
+
+To run a XTRT-LLM GPT-NeoX model using the engines generated by `build.py`:
+
+```bash
+# For 2-way tensor parallelism, FP16
+mpirun -n 2 --allow-run-as-root \
+    python3 run.py \
+    --max_output_len=50 \
+    --engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/  \
+    --tokenizer_dir=./downloads/gptneox_model
+
+# For 2-way tensor parallelism, INT8
+mpirun -n 2 --allow-run-as-root \
+    python3 run.py \
+    --max_output_len=50 \
+    --engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/  \
+    --tokenizer_dir=./downloads/gptneox_model
+```
--- a/examples/gptneox/README_CN.md
+++ b/examples/gptneox/README_CN.md
@@ -0,0 +1,95 @@
+# GPT-NeoX
+
+本文档介绍了如何使用昆仑芯XTRT-LLM在单节点多XPU上构建和运行[GPT-NeoX](https://huggingface.co/EleutherAI/gpt-neox-20b) 模型。
+
+## 概述
+
+XTRT-LLM GPT-NeoX  示例代码位于 [`examples/gptneox`](./)。 此文件夹中有以下几个主要文件：
+
+ * [`build.py`](./build.py) 构建运行GPT-NeoX模型所需的XTRT引擎
+ * [`run.py`](./run.py)  基于输入的文字进行推理
+
+## 支持的矩阵
+
+  * FP16
+  * INT8 Weight-Only
+  * Tensor Parallel
+
+## 使用说明
+
+### 1.从HuggingFace（HF） Transformers下载权重
+
+```bash
+# Weights & config
+sh get_weights.sh
+```
+
+### 2. 构建XTRT引擎
+
+XTRT-LLM从HF checkpoint构建XTRT引擎。如果未指定checkpoint目录，XTRT-LLM将使用伪权重构建引擎。
+
+构建调用示例：
+
+```bash
+# Build a float16 engine using 2-way tensor parallelism and HF weights.
+# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
+python3 build.py --dtype=float16                    \
+                 --log_level=verbose                \
+                 --use_gpt_attention_plugin float16 \
+                 --use_gemm_plugin float16          \
+                 --use_layernorm_plugin float16     \
+                 --max_batch_size=16                \
+                 --max_input_len=1024               \
+                 --max_output_len=1024              \
+                 --world_size=2                     \
+                 --output_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/    \
+                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
+
+# Build a engine using 2-way tensor parallelism and HF weights. Apply INT8 weight-only quantization.
+# Enable several XTRT-LLM plugins to increase runtime performance. It also helps with build time.
+python3 build.py --dtype=float16                    \
+                 --log_level=verbose                \
+                 --use_gpt_attention_plugin float16 \
+                 --use_gemm_plugin float16          \
+                 --use_layernorm_plugin float16     \
+                 --max_batch_size=16                \
+                 --max_input_len=1024               \
+                 --max_output_len=1024              \
+                 --world_size=2                     \
+                 --use_weight_only                  \
+                 --output_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/    \
+                 --model_dir=./downloads/gptneox_model 2>&1 | tee build_tp2.log
+```
+
+### 3. 运行
+
+在运行示例之前，请确保设置环境变量：
+
+```bash
+export PYTORCH_NO_XPU_MEMORY_CACHING=0 # disable XPytorch cache XPU memory.
+export XMLIR_D_XPU_L3_SIZE=0           # disable XPytorch use L3.
+```
+
+如果不使用昆仑芯R480-X8产品，请确保设置环境变量如下：
+
+```bash
+export BKCL_PCIE_RING=1
+```
+
+要使用`build.py`生成的引擎运行XTRT-LLM GPT-NeoX模型，请执行以下操作：
+
+```bash
+# For 2-way tensor parallelism, FP16
+mpirun -n 2 --allow-run-as-root \
+    python3 run.py \
+    --max_output_len=50 \
+    --engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/  \
+    --tokenizer_dir=./downloads/gptneox_model
+
+# For 2-way tensor parallelism, INT8
+mpirun -n 2 --allow-run-as-root \
+    python3 run.py \
+    --max_output_len=50 \
+    --engine_dir=./downloads/gptneox_model/trt_engines/in8/2-XPU/  \
+    --tokenizer_dir=./downloads/gptneox_model
+```
--- a/examples/gptneox/build.py
+++ b/examples/gptneox/build.py
@@ -0,0 +1,442 @@
+# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import json
+import os
+import time
+
+#import tensorrt as trt
+import torch
+import torch.multiprocessing as mp
+from safetensors import safe_open
+from transformers import AutoModelForCausalLM, GPTNeoXConfig
+from weight import load_from_hf_gpt_neox
+
+import xtrt_llm
+from xtrt_llm._utils import str_dtype_to_xtrt
+from xtrt_llm.builder import Builder
+from xtrt_llm.logger import logger
+from xtrt_llm.mapping import Mapping
+from xtrt_llm.models import weight_only_groupwise_quantize, weight_only_quantize
+from xtrt_llm.network import net_guard
+from xtrt_llm.plugin.plugin import ContextFMHAType
+from xtrt_llm.quantization import QuantMode
+
+MODEL_NAME = "gptneox"
+hf_gpt = None
+
+
+class StateDict():
+
+    def __init__(self, quant_ckpt_dir):
+        self.model_state_dict = safe_open(quant_ckpt_dir,
+                                          framework="pt",
+                                          device=0)
+
+    def get(self, k):
+        return self.model_state_dict.get_tensor(k).cpu()
+
+
+class GPTQModel():
+
+    def __init__(self, model_dir, quant_ckpt_dir):
+        with open(model_dir + '/config.json', 'r') as f:
+            model_config = json.load(f)
+            self.config = GPTNeoXConfig()
+            self.config.vocab_size = model_config['vocab_size']
+            self.config.hidden_size = model_config['hidden_size']
+            self.config.num_hidden_layers = model_config['num_hidden_layers']
+            self.config.num_attention_heads = model_config[
+                'num_attention_heads']
+            self.config.intermediate_size = model_config['intermediate_size']
+            self.config.hidden_act = model_config['hidden_act']
+            self.config.rotary_pct = model_config['rotary_pct']
+            self.config.rotary_emb_base = model_config['rotary_emb_base']
+            self.config.max_position_embeddings = model_config[
+                'max_position_embeddings']
+            self.config.initializer_range = model_config['initializer_range']
+            self.config.layer_norm_eps = model_config['layer_norm_eps']
+            self.config.use_cache = model_config['use_cache']
+            self.config.bos_token_id = model_config['bos_token_id']
+            self.config.eos_token_id = model_config['eos_token_id']
+            self.config.tie_word_embeddings = model_config[
+                'tie_word_embeddings']
+        self.model_state_dict = StateDict(quant_ckpt_dir)
+
+    def state_dict(self):
+        return self.model_state_dict
+
+
+def get_engine_name(model, dtype, tp_size, rank):
+    return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank)
+
+
+def serialize_engine(engine, path):
+    logger.info(f'Serializing engine to {path}...')
+    tik = time.time()
+    engine.serialize(path)
+    tok = time.time()
+    t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
+    logger.info(f'Engine serialized. Total time: {t}')
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--world_size',
+                        type=int,
+                        default=1,
+                        help='world size, only support tensor parallelism now')
+    parser.add_argument(
+        '--model_dir',
+        type=str,
+        default=None,
+        help='The path to HF GPT-NeoX model / checkpoints to read weights from')
+    parser.add_argument('--dtype',
+                        type=str,
+                        default='float16',
+                        choices=['float16', 'float32'])
+    parser.add_argument(
+        '--timing_cache',
+        type=str,
+        default='model.cache',
+        help=
+        'The path of to read timing cache from, will be ignored if the file does not exist'
+    )
+    parser.add_argument('--log_level', type=str, default='info')
+    parser.add_argument('--vocab_size', type=int, default=50432)
+    parser.add_argument('--n_layer', type=int, default=44)
+    parser.add_argument('--n_positions', type=int, default=2048)
+    parser.add_argument('--n_embd', type=int, default=6144)
+    parser.add_argument('--n_head', type=int, default=64)
+    parser.add_argument('--hidden_act', type=str, default='gelu')
+    parser.add_argument(
+        '--rotary_pct',
+        type=float,
+        default=0.25,
+        help="Percentage of hidden dimensions to allocate to rotary embeddings."
+    )
+    parser.add_argument('--max_batch_size', type=int, default=64)
+    parser.add_argument('--max_input_len', type=int, default=1024)
+    parser.add_argument('--max_output_len', type=int, default=1024)
+    parser.add_argument('--max_beam_width', type=int, default=1)
+    parser.add_argument('--use_gpt_attention_plugin',
+                        nargs='?',
+                        const='float16',
+                        type=str,
+                        default=False,
+                        choices=['float16', 'float32'])
+    parser.add_argument('--use_gemm_plugin',
+                        nargs='?',
+                        const='float16',
+                        type=str,
+                        default=False,
+                        choices=['float16', 'float32'])
+    parser.add_argument('--use_weight_only_quant_matmul_plugin',
+                        nargs='?',
+                        const='float16',
+                        type=str,
+                        default=False,
+                        choices=['float16'])
+    parser.add_argument('--use_weight_only_groupwise_quant_matmul_plugin',
+                        nargs='?',
+                        const='float16',
+                        type=str,
+                        default=False,
+                        choices=['float16'])
+    parser.add_argument(
+        '--groupwise_quant_safetensors_path',
+        type=str,
+        default=None,
+        help=
+        "The path to groupwise quantized GPT-NeoX model / checkpoints to read weights from."
+    )
+    parser.add_argument('--use_layernorm_plugin',
+                        nargs='?',
+                        const='float16',
+                        type=str,
+                        default=False,
+                        choices=['float16', 'float32'])
+    parser.add_argument('--parallel_build', default=False, action='store_true')
+    parser.add_argument('--enable_context_fmha',
+                        default=False,
+                        action='store_true')
+    parser.add_argument('--enable_context_fmha_fp32_acc',
+                        default=False,
+                        action='store_true')
+    parser.add_argument('--gpus_per_node', type=int, default=8)
+    parser.add_argument(
+        '--output_dir',
+        type=str,
+        default='gpt_outputs',
+        help=
+        'The path to save the serialized engine files, timing cache file and model configs'
+    )
+    parser.add_argument('--remove_input_padding',
+                        default=False,
+                        action='store_true')
+    parser.add_argument(
+        '--use_parallel_embedding',
+        action="store_true",
+        default=False,
+        help=
+        'By default embedding parallelism is disabled. By setting this flag, embedding parallelism is enabled'
+    )
+    parser.add_argument(
+        '--embedding_sharding_dim',
+        type=int,
+        default=1,  # Meta does TP on hidden dim
+        choices=[0, 1],
+        help=
+        'By default the embedding lookup table is sharded along vocab dimension (--embedding_sharding_dim=0). '
+        'To shard it along hidden dimension, set --embedding_sharding_dim=1'
+        'Note: embedding sharing is only enabled when --embedding_sharding_dim=0'
+    )
+    parser.add_argument(
+        '--use_weight_only',
+        default=False,
+        action="store_true",
+        help='Quantize weights for the various GEMMs to INT4/INT8.'
+        'See --weight_only_precision to set the precision')
+    parser.add_argument(
+        '--weight_only_precision',
+        const='int8',
+        type=str,
+        nargs='?',
+        default='int8',
+        choices=['int8', 'int4'],
+        help=
+        'Define the precision for the weights when using weight-only quantization.'
+        'You must also use --use_weight_only for that argument to have an impact.'
+    )
+    parser.add_argument('--inter_size', type=int, default=None)
+
+    args = parser.parse_args()
+
+    xtrt_llm.logger.set_level(args.log_level)
+
+    if args.model_dir is not None:
+        global hf_gpt
+        if not args.use_weight_only_groupwise_quant_matmul_plugin:
+            logger.info(f'Loading HF GPT-NeoX model from {args.model_dir}...')
+            hf_gpt = AutoModelForCausalLM.from_pretrained(args.model_dir)
+            args.n_embd = hf_gpt.config.hidden_size
+            args.n_head = hf_gpt.config.num_attention_heads
+            args.n_layer = hf_gpt.config.num_hidden_layers
+            args.n_positions = hf_gpt.config.max_position_embeddings
+            args.vocab_size = hf_gpt.config.vocab_size
+            args.rotary_pct = hf_gpt.config.rotary_pct
+        else:
+            assert (
+                args.groupwise_quant_safetensors_path is not None
+            ), f'Please set the path to the groupwise quantized GPT-NeoX checkpoints with --groupwise_quant_safetensors_path'
+            logger.info(
+                f'Loading GPTQ quantized HF GPT-NeoX model from {args.groupwise_quant_safetensors_path}...'
+            )
+            hf_gpt = GPTQModel(args.model_dir,
+                               args.groupwise_quant_safetensors_path)
+            args.n_embd = hf_gpt.config.hidden_size
+            args.n_head = hf_gpt.config.num_attention_heads
+            args.n_layer = hf_gpt.config.num_hidden_layers
+            args.n_positions = hf_gpt.config.max_position_embeddings
+            args.vocab_size = hf_gpt.config.vocab_size
+            args.rotary_pct = hf_gpt.config.rotary_pct
+        args.inter_size = hf_gpt.config.intermediate_size
+
+    if args.use_weight_only:
+        args.quant_mode = QuantMode.use_weight_only(
+            args.weight_only_precision == 'int4')
+    else:
+        args.quant_mode = QuantMode(0)
+
+    return args
+
+
+def build_rank_engine(builder: Builder,
+                      builder_config: xtrt_llm.builder.BuilderConfig,
+                      engine_name, rank, args):
+    '''
+       @brief: Build the engine on the given rank.
+       @param rank: The rank to build the engine.
+       @param args: The cmd line arguments.
+       @return: The built engine.
+    '''
+    kv_dtype = str_dtype_to_xtrt(args.dtype)
+    rotary_dim = int((args.n_embd // args.n_head) * args.rotary_pct)
+
+    # Initialize Module
+    xtrt_llm_gpt = xtrt_llm.models.GPTNeoXForCausalLM(
+        num_layers=args.n_layer,
+        num_heads=args.n_head,
+        hidden_size=args.n_embd,
+        vocab_size=args.vocab_size,
+        hidden_act=args.hidden_act,
+        max_position_embeddings=args.n_positions,
+        rotary_dim=rotary_dim,
+        dtype=kv_dtype,
+        mapping=Mapping(world_size=args.world_size,
+                        rank=rank,
+                        tp_size=args.world_size),  # TP only
+        apply_query_key_layer_scaling=builder_config.
+        apply_query_key_layer_scaling,
+        use_parallel_embedding=args.use_parallel_embedding,
+        embedding_sharding_dim=args.embedding_sharding_dim)
+
+    if args.use_weight_only_quant_matmul_plugin:
+        xtrt_llm_gpt = weight_only_quantize(xtrt_llm_gpt)
+
+    if args.use_weight_only_groupwise_quant_matmul_plugin:
+        xtrt_llm_gpt = weight_only_groupwise_quantize(model=xtrt_llm_gpt,
+                                                      quant_mode=QuantMode(0),
+                                                      group_size=128,
+                                                      zero=True)
+
+    if args.model_dir is not None:
+        assert hf_gpt is not None, f'Could not load weights from hf_gpt model as it is not loaded yet.'
+
+        if args.world_size > 1:
+            assert (
+                args.n_embd % args.world_size == 0
+            ), f'Embedding size/hidden size must be divisible by world size.'
+            assert (
+                args.n_head % args.world_size == 0
+            ), f'Number of attention heads must be divisible by world size.'
+
+        load_from_hf_gpt_neox(
+            xtrt_llm_gpt, hf_gpt, args.dtype, rank, args.world_size,
+            args.use_weight_only_groupwise_quant_matmul_plugin)
+
+    # Module -> Network
+    network = builder.create_network()
+    network.trt_network.name = engine_name
+    if args.use_gpt_attention_plugin:
+        network.plugin_config.set_gpt_attention_plugin(
+            dtype=args.use_gpt_attention_plugin)
+    if args.use_gemm_plugin:
+        network.plugin_config.set_gemm_plugin(dtype=args.use_gemm_plugin)
+    if args.use_layernorm_plugin:
+        network.plugin_config.set_layernorm_plugin(
+            dtype=args.use_layernorm_plugin)
+    assert not (args.enable_context_fmha and args.enable_context_fmha_fp32_acc)
+    if args.enable_context_fmha:
+        network.plugin_config.set_context_fmha(ContextFMHAType.enabled)
+    if args.enable_context_fmha_fp32_acc:
+        network.plugin_config.set_context_fmha(
+            ContextFMHAType.enabled_with_fp32_acc)
+    if args.use_weight_only_quant_matmul_plugin:
+        network.plugin_config.set_weight_only_quant_matmul_plugin(
+            dtype=args.use_weight_only_quant_matmul_plugin)
+    if args.use_weight_only_groupwise_quant_matmul_plugin:
+        network.plugin_config.set_weight_only_groupwise_quant_matmul_plugin(
+            dtype=args.use_weight_only_groupwise_quant_matmul_plugin)
+    if args.quant_mode.is_weight_only():
+        builder_config.trt_builder_config.use_weight_only = args.weight_only_precision
+
+    if args.world_size > 1:
+        network.plugin_config.set_nccl_plugin(args.dtype)
+    if args.remove_input_padding:
+        network.plugin_config.enable_remove_input_padding()
+    with net_guard(network):
+        # Prepare
+        network.set_named_parameters(xtrt_llm_gpt.named_parameters())
+
+        # Forward
+        inputs = xtrt_llm_gpt.prepare_inputs(args.max_batch_size,
+                                             args.max_input_len,
+                                             args.max_output_len, True,
+                                             args.max_beam_width)
+        xtrt_llm_gpt(*inputs)
+
+    #xtrt_llm.graph_rewriting.optimize(network)
+
+    engine = None
+
+    # Network -> Engine
+    engine = builder.build_engine(network, builder_config, compiler="gr")
+    if rank == 0:
+        config_path = os.path.join(args.output_dir, 'config.json')
+        builder.save_config(builder_config, config_path)
+    return engine
+
+
+def build(rank, args):
+    #torch.cuda.set_device(rank % args.gpus_per_node)
+    xtrt_llm.logger.set_level(args.log_level)
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+
+    # when doing serializing build, all ranks share one engine
+    apply_query_key_layer_scaling = False
+    builder = Builder()
+
+    cache = None
+    for cur_rank in range(args.world_size):
+        # skip other ranks if parallel_build is enabled
+        if args.parallel_build and cur_rank != rank:
+            continue
+        builder_config = builder.create_builder_config(
+            name=MODEL_NAME,
+            precision=args.dtype,
+            timing_cache=args.timing_cache if cache is None else cache,
+            tensor_parallel=args.world_size,  # TP only
+            parallel_build=args.parallel_build,
+            num_layers=args.n_layer,
+            num_heads=args.n_head,
+            inter_size=args.inter_size,
+            hidden_size=args.n_embd,
+            vocab_size=args.vocab_size,
+            hidden_act=args.hidden_act,
+            max_position_embeddings=args.n_positions,
+            apply_query_key_layer_scaling=apply_query_key_layer_scaling,
+            max_batch_size=args.max_batch_size,
+            max_input_len=args.max_input_len,
+            max_output_len=args.max_output_len,
+            fusion_pattern_list=["remove_dup_mask"])
+
+        engine_name = get_engine_name(MODEL_NAME, args.dtype, args.world_size,
+                                      cur_rank)
+        engine = build_rank_engine(builder, builder_config, engine_name,
+                                   cur_rank, args)
+        assert engine is not None, f'Failed to build engine for rank {cur_rank}'
+
+        # if cur_rank == 0:
+        #     # Use in-memory timing cache for multiple builder passes.
+        #     if not args.parallel_build:
+        #         cache = builder_config.trt_builder_config.get_timing_cache()
+
+        serialize_engine(engine, os.path.join(args.output_dir, engine_name))
+
+    # if rank == 0:
+    #     ok = builder.save_timing_cache(
+    #         builder_config, os.path.join(args.output_dir, "model.cache"))
+    #     assert ok, "Failed to save timing cache."
+
+
+if __name__ == '__main__':
+    args = parse_arguments()
+    tik = time.time()
+    if args.parallel_build and args.world_size > 1 and \
+            torch.cuda.device_count() >= args.world_size:
+        logger.warning(
+            f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
+        )
+        mp.spawn(build, nprocs=args.world_size, args=(args, ))
+    else:
+        args.parallel_build = False
+        logger.info('Serially build TensorRT engines.')
+        build(0, args)
+
+    tok = time.time()
+    t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
+    logger.info(f'Total time of building all {args.world_size} engines: {t}')
--- a/examples/gptneox/get_weights.sh
+++ b/examples/gptneox/get_weights.sh
@@ -0,0 +1,16 @@
+mkdir -p downloads
+pushd downloads
+rm -rf gptneox_model
+git clone https://huggingface.co/EleutherAI/gpt-neox-20b gptneox_model
+
+rm -f gptneox_model/model-*.safetensors
+rm -f gptneox_model/model.safetensors.index.json
+wget -q https://huggingface.co/EleutherAI/gpt-neox-20b/resolve/main/model.safetensors.index.json --directory-prefix gptneox_model
+
+for i in $(seq -f %05g 46)
+do
+  echo -n "Downloading $i of 00046..."
+  wget -q https://huggingface.co/EleutherAI/gpt-neox-20b/resolve/main/model-$i-of-00046.safetensors --directory-prefix gptneox_model
+  echo "Done"
+done
+popd
--- a/examples/gptneox/gptq_convert.sh
+++ b/examples/gptneox/gptq_convert.sh
@@ -0,0 +1,9 @@
+git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git GPTQ-for-LLaMa
+
+pip install -r ./GPTQ-for-LLaMa/requirements.txt
+
+CUDA_VISIBLE_DEVICES=0 python3 GPTQ-for-LLaMa/neox.py ./gptneox_model \
+wikitext2 \
+--wbits 4 \
+--groupsize 128 \
+--save_safetensors ./gptneox_model/gptneox-20b-4bit-gs128.safetensors
--- a/examples/gptneox/requirements.txt
+++ b/examples/gptneox/requirements.txt
@@ -0,0 +1,2 @@
+datasets~=2.3.2
+rouge_score~=0.1.2
--- a/examples/gptneox/run.py
+++ b/examples/gptneox/run.py
@@ -0,0 +1,141 @@
+# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import json
+import os
+import numpy as np
+
+import torch
+from transformers import AutoTokenizer
+
+import xtrt_llm
+from xtrt_llm.runtime import ModelConfig, SamplingConfig
+
+from build import get_engine_name  # isort:skip
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--max_output_len', type=int, required=True)
+    parser.add_argument('--log_level', type=str, default='error')
+    parser.add_argument('--engine_dir', type=str, default='gptneox_outputs')
+    parser.add_argument('--tokenizer_dir',
+                        type=str,
+                        default="gptneox_model",
+                        help="Directory containing the tokenizer.model.")
+    parser.add_argument('--input_text',
+                        type=str,
+                        default='Born in north-east France, Soyer trained as a')
+    parser.add_argument('--performance_test_scale',
+                        type=str,
+                        help=
+                        "Scale for performance test. e.g., 8x1024x64 (batch_size, input_text_length, max_output_length)",
+                        default="")
+    return parser.parse_args()
+
+
+if __name__ == '__main__':
+    args = parse_arguments()
+    xtrt_llm.logger.set_level(args.log_level)
+
+    config_path = os.path.join(args.engine_dir, 'config.json')
+    with open(config_path, 'r') as f:
+        config = json.load(f)
+    use_gpt_attention_plugin = config['plugin_config']['gpt_attention_plugin']
+    remove_input_padding = config['plugin_config']['remove_input_padding']
+    dtype = config['builder_config']['precision']
+    world_size = config['builder_config']['tensor_parallel']
+    assert world_size == xtrt_llm.mpi_world_size(), \
+        f'Engine world size ({world_size}) != Runtime world size ({xtrt_llm.mpi_world_size()})'
+    num_heads = config['builder_config']['num_heads'] // world_size
+    hidden_size = config['builder_config']['hidden_size'] // world_size
+    vocab_size = config['builder_config']['vocab_size']
+    num_layers = config['builder_config']['num_layers']
+
+    runtime_rank = xtrt_llm.mpi_rank()
+    if world_size > 1:
+        os.environ["XCCL_GROUP_ID"] = str(runtime_rank // world_size)
+        os.environ["XCCL_NRANKS"] = str(world_size)
+        os.environ["XCCL_CUR_RANK"] = str(runtime_rank % world_size)
+        os.environ["XCCL_DEVICE_ID"] = str(runtime_rank)
+        os.environ["MP_RUN"] = str(1)
+    runtime_mapping = xtrt_llm.Mapping(world_size,
+                                           runtime_rank,
+                                           tp_size=world_size)
+    torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
+
+    engine_name = get_engine_name('gptneox', dtype, world_size, runtime_rank)
+    #serialize_path = os.path.join(args.engine_dir, engine_name)
+    serialize_path = str(args.engine_dir) + "/" + engine_name
+
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
+    input_ids = torch.tensor(tokenizer.encode(args.input_text),
+                             dtype=torch.int32).cuda().unsqueeze(0)
+
+    model_config = ModelConfig(num_heads=num_heads,
+                               num_kv_heads=num_heads,
+                               hidden_size=hidden_size,
+                               vocab_size=vocab_size,
+                               num_layers=num_layers,
+                               gpt_attention_plugin=use_gpt_attention_plugin,
+                               remove_input_padding=remove_input_padding,
+                               dtype=dtype)
+
+    tokenizer.pad_token = tokenizer.eos_token
+    end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
+    pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
+
+    sampling_config = SamplingConfig(end_id=end_id, pad_id=pad_id)
+    input_lengths = torch.tensor(
+        [input_ids.size(1) for _ in range(input_ids.size(0))]).int().cuda()
+
+    #with open(serialize_path, 'rb') as f:
+    #    engine_buffer = f.read()
+    decoder = xtrt_llm.runtime.GenerationSession(model_config,
+                                                     serialize_path,
+                                                     runtime_mapping,
+                                                     debug_mode=False)
+
+    if args.performance_test_scale != "":
+        performance_test_scale_list = args.performance_test_scale.split("E")
+        for scale in performance_test_scale_list:
+            xtrt_llm.logger.info(f"Running performance test with scale {scale}")
+            bs, seqlen, max_output_len = [int(x) for x in scale.split("x")]
+            _input_ids = torch.from_numpy(
+                np.zeros((bs, seqlen)).astype("int32")).cuda()
+            _input_lengths = torch.from_numpy(
+                np.full((bs, ), seqlen).astype("int32")).cuda()
+
+            import time
+            _t_begin = time.time()
+            decoder.setup(_input_ids.size(0), _input_ids.size(1), max_output_len)
+            _output_ids = decoder.decode(_input_ids,
+                                            _input_lengths,
+                                            sampling_config)
+            _t_end = time.time()
+            xtrt_llm.logger.info(
+                f"Total latency: {(_t_end - _t_begin) * 1000:.3f} ms")
+
+    if remove_input_padding:
+        decoder.setup(1, torch.max(input_lengths).item(), args.max_output_len)
+    else:
+        decoder.setup(input_ids.size(0), input_ids.size(1), args.max_output_len)
+    output_ids = decoder.decode(input_ids, input_lengths, sampling_config)
+    torch.cuda.synchronize()
+
+    output_ids = output_ids.tolist()[0][0][input_ids.size(1):]
+    output_text = tokenizer.decode(output_ids)
+    print(f'Input: \"{args.input_text}\"')
+    print(f'Output: \"{output_text}\"')
--- a/examples/gptneox/run.sh
+++ b/examples/gptneox/run.sh
@@ -0,0 +1,8 @@
+BKCL_PCIE_RING=1 PYTORCH_NO_XPU_MEMORY_CACHING=1 XMLIR_D_XPU_L3_SIZE=0 \
+mpirun -n 2 --allow-run-as-root \
+python3 run.py \
+--max_output_len=50 \
+--engine_dir=./downloads/gptneox_model/trt_engines/fp16/2-XPU/ \
+--tokenizer_dir=./downloads/gptneox_model \
+--performance_test_scale=1x512x256E2x512x256E4x512x256E8x512x256 \
+--log_level=info
--- a/examples/gptneox/summarize.py
+++ b/examples/gptneox/summarize.py
@@ -0,0 +1,373 @@
+# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import copy
+import json
+import os
+
+import numpy as np
+import torch
+from datasets import load_dataset, load_metric
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+import tensorrt_llm
+import tensorrt_llm.profiler as profiler
+from tensorrt_llm.logger import logger
+
+from build import get_engine_name  # isort:skip
+
+
+def TRTGPTNeoX(args, config):
+    dtype = config['builder_config']['precision']
+    world_size = config['builder_config']['tensor_parallel']
+    assert world_size == tensorrt_llm.mpi_world_size(), \
+        f'Engine world size ({world_size}) != Runtime world size ({tensorrt_llm.mpi_world_size()})'
+
+    world_size = config['builder_config']['tensor_parallel']
+    num_heads = config['builder_config']['num_heads'] // world_size
+    hidden_size = config['builder_config']['hidden_size'] // world_size
+    vocab_size = config['builder_config']['vocab_size']
+    num_layers = config['builder_config']['num_layers']
+    use_gpt_attention_plugin = bool(
+        config['plugin_config']['gpt_attention_plugin'])
+    remove_input_padding = config['plugin_config']['remove_input_padding']
+
+    model_config = tensorrt_llm.runtime.ModelConfig(
+        vocab_size=vocab_size,
+        num_layers=num_layers,
+        num_heads=num_heads,
+        num_kv_heads=num_heads,
+        hidden_size=hidden_size,
+        gpt_attention_plugin=use_gpt_attention_plugin,
+        remove_input_padding=remove_input_padding,
+        dtype=dtype)
+
+    runtime_rank = tensorrt_llm.mpi_rank()
+    runtime_mapping = tensorrt_llm.Mapping(world_size,
+                                           runtime_rank,
+                                           tp_size=world_size)
+    torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
+
+    engine_name = get_engine_name('gptneox', dtype, world_size, runtime_rank)
+    serialize_path = os.path.join(args.engine_dir, engine_name)
+
+    tensorrt_llm.logger.set_level(args.log_level)
+
+    with open(serialize_path, 'rb') as f:
+        engine_buffer = f.read()
+    decoder = tensorrt_llm.runtime.GenerationSession(model_config,
+                                                     engine_buffer,
+                                                     runtime_mapping)
+
+    return decoder
+
+
+def main(args):
+    runtime_rank = tensorrt_llm.mpi_rank()
+    logger.set_level(args.log_level)
+
+    test_hf = args.test_hf and runtime_rank == 0  # only run hf on rank 0
+    test_trt_llm = args.test_trt_llm
+    model_dir = args.model_dir
+
+    tokenizer = AutoTokenizer.from_pretrained(model_dir,
+                                              padding_side='left',
+                                              model_max_length=2048,
+                                              truncation=True)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    dataset_cnn = load_dataset("ccdv/cnn_dailymail",
+                               '3.0.0',
+                               cache_dir=args.dataset_path)
+
+    config_path = os.path.join(args.engine_dir, 'config.json')
+    with open(config_path, 'r') as f:
+        config = json.load(f)
+
+    max_batch_size = args.batch_size
+
+    # runtime parameters
+    # repetition_penalty = 1
+    top_k = args.top_k
+    output_len = args.output_len
+    test_token_num = 923
+    # top_p = 0.0
+    # random_seed = 5
+    temperature = 1
+    num_beams = args.num_beams
+
+    pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
+    end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
+
+    if test_trt_llm:
+        tensorrt_llm_gpt = TRTGPTNeoX(args, config)
+
+    if test_hf:
+        model = AutoModelForCausalLM.from_pretrained(model_dir)
+        model.cuda()
+        if args.data_type == 'fp16':
+            model.half()
+
+    def summarize_tensorrt_llm(datapoint):
+        batch_size = len(datapoint['article'])
+
+        line = copy.copy(datapoint['article'])
+        line_encoded = []
+        input_lengths = []
+        for i in range(batch_size):
+            line[i] = line[i] + ' TL;DR: '
+
+            line[i] = line[i].strip()
+            line[i] = line[i].replace(" n't", "n't")
+
+            input_id = tokenizer.encode(line[i],
+                                        return_tensors='pt').type(torch.int32)
+            input_id = input_id[:, -test_token_num:]
+
+            line_encoded.append(input_id)
+            input_lengths.append(input_id.shape[-1])
+
+        # do padding, should move outside the profiling to prevent the overhead
+        max_length = max(input_lengths)
+        if tensorrt_llm_gpt.remove_input_padding:
+            line_encoded = [
+                torch.tensor(t, dtype=torch.int32).cuda() for t in line_encoded
+            ]
+        else:
+            # do padding, should move outside the profiling to prevent the overhead
+            for i in range(batch_size):
+                pad_size = max_length - input_lengths[i]
+
+                pad = torch.ones([1, pad_size]).type(torch.int32) * pad_id
+                line_encoded[i] = torch.cat(
+                    [torch.tensor(line_encoded[i], dtype=torch.int32), pad],
+                    axis=-1)
+
+            line_encoded = torch.cat(line_encoded, axis=0).cuda()
+            input_lengths = torch.tensor(input_lengths,
+                                         dtype=torch.int32).cuda()
+
+        sampling_config = tensorrt_llm.runtime.SamplingConfig(
+            end_id=end_id, pad_id=pad_id, top_k=top_k, num_beams=num_beams)
+
+        with torch.no_grad():
+            tensorrt_llm_gpt.setup(batch_size,
+                                   max_context_length=max_length,
+                                   max_new_tokens=output_len,
+                                   beam_width=num_beams)
+
+            if tensorrt_llm_gpt.remove_input_padding:
+                output_ids = tensorrt_llm_gpt.decode_batch(
+                    line_encoded, sampling_config)
+            else:
+                output_ids = tensorrt_llm_gpt.decode(
+                    line_encoded,
+                    input_lengths,
+                    sampling_config,
+                )
+
+            torch.cuda.synchronize()
+
+        # Extract a list of tensors of shape beam_width x output_ids.
+        if tensorrt_llm_gpt.mapping.is_first_pp_rank():
+            output_beams_list = [
+                tokenizer.batch_decode(output_ids[batch_idx, :,
+                                                  input_lengths[batch_idx]:],
+                                       skip_special_tokens=True)
+                for batch_idx in range(batch_size)
+            ]
+            return output_beams_list, output_ids[:, :, max_length:].tolist()
+        return [], []
+
+    def summarize_hf(datapoint):
+        batch_size = len(datapoint['article'])
+        if batch_size > 1:
+            logger.warning(
+                f"HF does not support batch_size > 1 to verify correctness due to padding. Current batch size is {batch_size}"
+            )
+
+        line = copy.copy(datapoint['article'])
+        for i in range(batch_size):
+            line[i] = line[i] + ' TL;DR: '
+
+            line[i] = line[i].strip()
+            line[i] = line[i].replace(" n't", "n't")
+
+        line_encoded = tokenizer(line,
+                                 return_tensors='pt',
+                                 padding=True,
+                                 truncation=True)["input_ids"].type(torch.int64)
+
+        line_encoded = line_encoded[:, -test_token_num:]
+        line_encoded = line_encoded.cuda()
+
+        with torch.no_grad():
+            output = model.generate(line_encoded,
+                                    max_length=len(line_encoded[0]) +
+                                    output_len,
+                                    top_k=top_k,
+                                    temperature=temperature,
+                                    eos_token_id=tokenizer.eos_token_id,
+                                    pad_token_id=tokenizer.pad_token_id,
+                                    num_beams=num_beams,
+                                    num_return_sequences=num_beams,
+                                    early_stopping=True)
+
+        tokens_list = output[:, len(line_encoded[0]):].tolist()
+        output = output.reshape([batch_size, num_beams, -1])
+        output_lines_list = [
+            tokenizer.batch_decode(output[:, i, len(line_encoded[0]):],
+                                   skip_special_tokens=True)
+            for i in range(num_beams)
+        ]
+
+        return output_lines_list, tokens_list
+
+    if test_trt_llm:
+        datapoint = dataset_cnn['test'][0:1]
+        summary, _ = summarize_tensorrt_llm(datapoint)
+        if runtime_rank == 0:
+            logger.info(
+                "---------------------------------------------------------")
+            logger.info("XTRT-LLM Generated : ")
+            logger.info(f" Article : {datapoint['article']}")
+            logger.info(f"\n Highlights : {datapoint['highlights']}")
+            logger.info(f"\n Summary : {summary}")
+            logger.info(
+                "---------------------------------------------------------")
+
+    if test_hf:
+        datapoint = dataset_cnn['test'][0:1]
+        summary, _ = summarize_hf(datapoint)
+        logger.info("---------------------------------------------------------")
+        logger.info("HF Generated : ")
+        logger.info(f" Article : {datapoint['article']}")
+        logger.info(f"\n Highlights : {datapoint['highlights']}")
+        logger.info(f"\n Summary : {summary}")
+        logger.info("---------------------------------------------------------")
+
+    metric_tensorrt_llm = [load_metric("rouge") for _ in range(num_beams)]
+    metric_hf = [load_metric("rouge") for _ in range(num_beams)]
+    for i in range(num_beams):
+        metric_tensorrt_llm[i].seed = 0
+        metric_hf[i].seed = 0
+
+    ite_count = 0
+    data_point_idx = 0
+    while (data_point_idx < len(dataset_cnn['test'])) and (ite_count <
+                                                           args.max_ite):
+        if runtime_rank == 0:
+            logger.debug(
+                f"run data_point {data_point_idx} ~ {data_point_idx + max_batch_size}"
+            )
+        datapoint = dataset_cnn['test'][data_point_idx:(data_point_idx +
+                                                        max_batch_size)]
+
+        if test_trt_llm:
+            profiler.start('tensorrt_llm')
+            summary_tensorrt_llm, tokens_tensorrt_llm = summarize_tensorrt_llm(
+                datapoint)
+            profiler.stop('tensorrt_llm')
+
+        if test_hf:
+            profiler.start('hf')
+            summary_hf, tokens_hf = summarize_hf(datapoint)
+            profiler.stop('hf')
+
+        if runtime_rank == 0:
+            if test_trt_llm:
+                for batch_idx in range(len(summary_tensorrt_llm)):
+                    for beam_idx in range(num_beams):
+                        metric_tensorrt_llm[beam_idx].add_batch(
+                            predictions=[
+                                summary_tensorrt_llm[batch_idx][beam_idx]
+                            ],
+                            references=[datapoint['highlights'][batch_idx]])
+            if test_hf:
+                for beam_idx in range(num_beams):
+                    for i in range(len(summary_hf[beam_idx])):
+                        metric_hf[beam_idx].add_batch(
+                            predictions=[summary_hf[beam_idx][i]],
+                            references=[datapoint['highlights'][i]])
+
+            logger.debug('-' * 100)
+            logger.debug(f"Article : {datapoint['article']}")
+            if test_trt_llm:
+                logger.debug(f'XTRT-LLM Summary: {summary_tensorrt_llm}')
+            if test_hf:
+                logger.debug(f'HF Summary: {summary_hf}')
+            logger.debug(f"highlights : {datapoint['highlights']}")
+
+        data_point_idx += max_batch_size
+        ite_count += 1
+
+    if runtime_rank == 0:
+        if test_trt_llm:
+            np.random.seed(0)  # rouge score use sampling to compute the score
+            logger.info(
+                f'XTRT-LLM (total latency: {profiler.elapsed_time_in_sec("tensorrt_llm")} sec)'
+            )
+            for beam_idx in range(num_beams):
+                logger.info(f"XTRT-LLM beam {beam_idx} result")
+                computed_metrics_tensorrt_llm = metric_tensorrt_llm[
+                    beam_idx].compute()
+                for key in computed_metrics_tensorrt_llm.keys():
+                    logger.info(
+                        f' {key} : {computed_metrics_tensorrt_llm[key].mid[2]*100}'
+                    )
+
+                if args.check_accuracy and beam_idx == 0:
+                    assert computed_metrics_tensorrt_llm['rouge1'].mid[
+                        2] * 100 > args.tensorrt_llm_rouge1_threshold
+        if test_hf:
+            np.random.seed(0)  # rouge score use sampling to compute the score
+            logger.info(
+                f'Hugging Face (total latency: {profiler.elapsed_time_in_sec("hf")} sec)'
+            )
+            for beam_idx in range(num_beams):
+                logger.info(f"HF beam {beam_idx} result")
+                computed_metrics_hf = metric_hf[beam_idx].compute()
+                for key in computed_metrics_hf.keys():
+                    logger.info(
+                        f' {key} : {computed_metrics_hf[key].mid[2]*100}')
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--model_dir',
+                        type=str,
+                        default='EleutherAI/gpt-neox-20b')
+    parser.add_argument('--test_hf', action='store_true')
+    parser.add_argument('--test_trt_llm', action='store_true')
+    parser.add_argument('--data_type',
+                        type=str,
+                        choices=['fp32', 'fp16'],
+                        default='fp32')
+    parser.add_argument('--dataset_path', type=str, default='')
+    parser.add_argument('--log_level', type=str, default='info')
+    parser.add_argument('--engine_dir', type=str, default='gptneox_engine')
+    parser.add_argument('--batch_size', type=int, default=1)
+    parser.add_argument('--max_ite', type=int, default=20)
+    parser.add_argument('--output_len', type=int, default=100)
+    parser.add_argument('--check_accuracy', action='store_true')
+    parser.add_argument('--tensorrt_llm_rouge1_threshold',
+                        type=float,
+                        default=15.0)
+    parser.add_argument('--num_beams', type=int, default=1)
+    parser.add_argument('--top_k', type=int, default=1)
+
+    args = parser.parse_args()
+
+    main(args)
--- a/examples/gptneox/weight.py
+++ b/examples/gptneox/weight.py
@@ -0,0 +1,464 @@
+# SPDX-FileCopyrightText: Copyright (c) 2022-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+from operator import attrgetter
+
+import numpy as np
+import torch
+
+import xtrt_llm
+from xtrt_llm._utils import pad_vocab_size
+from xtrt_llm.models import GPTNeoXForCausalLM
+from xtrt_llm._utils import str_dtype_to_torch
+
+UINT4_TO_INT4_FLAG = 1
+GPTQ_FLAG = 1
+GROUP_SIZE = 128
+
+
+def numpy_split(v, tp_size, idx, dim=0):
+    if tp_size == 1:
+        return v
+    else:
+        return np.ascontiguousarray(np.split(v, tp_size, axis=dim)[idx])
+
+
+def torch_split(v, tp_size, idx, dim=0):
+    if tp_size == 1:
+        return v
+    else:
+        return (torch.split(v, v.shape[dim] // tp_size,
+                            dim=dim)[idx]).contiguous()
+
+
+def unpack_int32_into_int8(w_packed):
+    # Unpack inputs packed in int32/float32 into uint4 and store them in int8 format
+    w_packed_int4x2 = w_packed.contiguous().view(torch.uint8)
+    w_unpacked = torch.zeros(w_packed_int4x2.shape[0],
+                             w_packed_int4x2.shape[1] * 2,
+                             dtype=torch.int8)
+    w_unpacked[:, ::2] = w_packed_int4x2 % 16
+    w_unpacked[:, 1::2] = w_packed_int4x2 // 16
+    return w_unpacked.contiguous()
+
+
+def preprocess_groupwise_weight_params(qweight_unpacked_int8, scales_fp16,
+                                       qzeros_unpacked_int8):
+    packer = torch.ops.fastertransformer.pack_int8_tensor_to_packed_int4
+    preprocessor = torch.ops.fastertransformer.preprocess_weights_for_mixed_gemm
+
+    qweight_interleaved = preprocessor(packer(qweight_unpacked_int8),
+                                       torch.quint4x2).view(torch.float32)
+
+    # zeros = zeros * scales
+    zeros_x_scales_fp16 = (-qzeros_unpacked_int8 + 8 * UINT4_TO_INT4_FLAG -
+                           GPTQ_FLAG) * scales_fp16
+    zeros_x_scales_fp16 = zeros_x_scales_fp16.half()
+
+    # return processed interleaved weight, original scales and zeros * scales
+    return qweight_interleaved.contiguous().numpy(), scales_fp16.contiguous(
+    ).numpy(), zeros_x_scales_fp16.contiguous().numpy()
+
+
+def load_from_hf_gpt_neox(xtrt_llm_gpt_neox: GPTNeoXForCausalLM,
+                          hf_gpt_neox,
+                          dtype="float32",
+                          rank=0,
+                          tp_size=1,
+                          use_weight_only_groupwise_quant_matmul_plugin=False):
+
+    hf_model_gptneox_block_names = [
+        "input_layernorm.weight",
+        "input_layernorm.bias",
+        "post_attention_layernorm.weight",
+        "post_attention_layernorm.bias",
+    ]
+
+    xtrt_llm_model_gptneox_block_names = [
+        "input_layernorm.weight",
+        "input_layernorm.bias",
+        "post_attention_layernorm.weight",
+        "post_attention_layernorm.bias",
+    ]
+
+    if not use_weight_only_groupwise_quant_matmul_plugin:
+        hf_model_gptneox_block_names += [
+            "attention.dense.weight",
+            "attention.dense.bias",
+            "mlp.dense_h_to_4h.weight",
+            "mlp.dense_h_to_4h.bias",
+            "mlp.dense_4h_to_h.weight",
+            "mlp.dense_4h_to_h.bias",
+        ]
+        xtrt_llm_model_gptneox_block_names += [
+            "attention.dense.weight",
+            "attention.dense.bias",
+            "mlp.fc.weight",
+            "mlp.fc.bias",
+            "mlp.proj.weight",
+            "mlp.proj.bias",
+        ]
+
+    if not use_weight_only_groupwise_quant_matmul_plugin:
+        xtrt_llm.logger.info('Loading weights from HF GPT-NeoX...')
+    else:
+        xtrt_llm.logger.info(
+            'Loading weights from GPTQ quantized HF GPT-NeoX...')
+
+    tik = time.time()
+
+    torch_dtype = str_dtype_to_torch(dtype)
+    hf_gpt_neox_state_dict = hf_gpt_neox.state_dict()
+
+    # [vocab_size, hidden_size]
+    v = hf_gpt_neox_state_dict.get('gpt_neox.embed_in.weight').to(
+        torch_dtype).cpu().numpy()
+    if xtrt_llm_gpt_neox._use_parallel_embedding:
+        v = numpy_split(v, tp_size, rank,
+                        xtrt_llm_gpt_neox._embedding_sharding_dim)
+    xtrt_llm_gpt_neox.embedding.weight.value = v
+
+    n_layer = hf_gpt_neox.config.num_hidden_layers
+
+    for layer_idx in range(n_layer):
+        prefix = "gpt_neox.layers." + str(layer_idx) + "."
+        for idx, hf_attr in enumerate(hf_model_gptneox_block_names):
+            v = hf_gpt_neox_state_dict.get(prefix + hf_attr).to(
+                torch_dtype).cpu().numpy()
+
+            layer = attrgetter(xtrt_llm_model_gptneox_block_names[idx])(
+                xtrt_llm_gpt_neox.layers[layer_idx])
+
+            if tp_size > 1:
+                if 'dense.weight' in hf_attr:
+                    # [n=hidden_size, k=hidden_size] ->
+                    # [n=hidden_size, k=hidden_size // tp_size]
+                    split_v = numpy_split(v, tp_size, rank, dim=1)
+                elif 'dense_h_to_4h.weight' in hf_attr:
+                    # [hidden_size * 4, hidden_size] ->
+                    # [hidden_size * 4 // tp_size, hidden_size]
+                    split_v = numpy_split(v, tp_size, rank, dim=0)
+                elif 'dense_h_to_4h.bias' in hf_attr:
+                    # [hidden_size * 4] -> [hidden_size * 4 // tp_size]
+                    split_v = numpy_split(v, tp_size, rank, dim=0)
+                elif 'dense_4h_to_h.weight' in hf_attr:
+                    # [hidden_size, hidden_size * 4] ->
+                    # [hidden_size, hidden_size * 4 // tp_size]
+                    split_v = numpy_split(v, tp_size, rank, dim=1)
+                else:
+                    split_v = v
+                setattr(layer, 'value', split_v)
+            else:
+                setattr(layer, 'value', v)
+
+        num_heads = hf_gpt_neox.config.num_attention_heads
+        hidden_size = hf_gpt_neox.config.hidden_size
+        head_size = hidden_size // num_heads
+
+        if not use_weight_only_groupwise_quant_matmul_plugin:
+            # Attention QKV Linear
+            # qkv_weights [num_heads x (q|k|v), hidden_size] ->
+            # [(num_heads x q)|(num_heads x k)|(num_heads x v), hidden_size]
+            qkv_weights = hf_gpt_neox_state_dict.get(
+                prefix + "attention.query_key_value.weight")
+            qkv_bias = hf_gpt_neox_state_dict.get(
+                prefix + "attention.query_key_value.bias")
+
+            new_qkv_weight_shape = torch.Size(
+                [num_heads, 3, head_size * qkv_weights.size()[-1]])
+            new_qkv_bias_shape = torch.Size([num_heads, 3, head_size])
+
+            qkv_weights = qkv_weights.view(new_qkv_weight_shape).permute(
+                1, 0, 2).reshape([hidden_size * 3, hidden_size])
+            qkv_bias = qkv_bias.view(new_qkv_bias_shape).permute(
+                1, 0, 2).reshape([hidden_size * 3])
+
+            if tp_size > 1:
+                qkv_weights = qkv_weights.reshape(
+                    3, hidden_size, hidden_size).to(torch_dtype).cpu().numpy()
+                split_qkv_weights = numpy_split(
+                    qkv_weights, tp_size, rank,
+                    dim=1).reshape(3 * (hidden_size // tp_size), hidden_size)
+                xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.weight.value = \
+                    np.ascontiguousarray(split_qkv_weights)
+
+                qkv_bias = qkv_bias.reshape(
+                    3, hidden_size).to(torch_dtype).cpu().numpy()
+                split_qkv_bias = numpy_split(qkv_bias, tp_size, rank,
+                                             dim=1).reshape(
+                                                 3 * (hidden_size // tp_size))
+                xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
+                    np.ascontiguousarray(split_qkv_bias)
+            else:
+                xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.weight.value = \
+                    qkv_weights.to(torch_dtype).cpu().numpy()
+                xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
+                    qkv_bias.to(torch_dtype).cpu().numpy()
+        else:
+            # use_weight_only_groupwise_quant_matmul_plugin
+
+            qweight_int32 = hf_gpt_neox_state_dict.get(
+                prefix + "attention.query_key_value.qweight")
+            scales_fp16 = hf_gpt_neox_state_dict.get(
+                prefix + "attention.query_key_value.scales")
+            qzeros_int32 = hf_gpt_neox_state_dict.get(
+                prefix + "attention.query_key_value.qzeros")
+            biases_fp16 = hf_gpt_neox_state_dict.get(
+                prefix + "attention.query_key_value.bias")
+
+            # [hidden_size // 8, hidden_size * 3] -> [hidden_size * 3, hidden_size]
+            qweight_unpacked_int8 = unpack_int32_into_int8(
+                qweight_int32.T).contiguous() - 8
+            # [hidden_size // GROUP_SIZE, hidden_size * 3 // 8] ->
+            # [hidden_size // GROUP_SIZE, hidden_size * 3]
+            qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
+
+            # qkv_weights [num_heads x (q|k|v), hidden_size] ->
+            # [(num_heads x q)|(num_heads x k)|(num_heads x v), hidden_size]
+            new_qkv_weight_shape = torch.Size(
+                [num_heads, 3, head_size * qweight_unpacked_int8.size()[-1]])
+            # [hidden_size * 3, hidden_size]
+            qweight_unpacked_int8 = qweight_unpacked_int8.view(
+                new_qkv_weight_shape).permute(1, 0, 2).reshape(
+                    [hidden_size * 3, hidden_size]).contiguous()
+
+            new_qkv_scale_shape = torch.Size(
+                [num_heads, 3, head_size * (hidden_size // GROUP_SIZE)])
+            # [hidden_size * 3, hidden_size // GROUP_SIZE]
+            scales_fp16 = scales_fp16.T.contiguous().view(
+                new_qkv_scale_shape).permute(1, 0, 2).reshape(
+                    [hidden_size * 3, hidden_size // GROUP_SIZE]).contiguous()
+
+            new_qkv_zero_shape = torch.Size(
+                [num_heads, 3, head_size * (hidden_size // GROUP_SIZE)])
+            # [hidden_size * 3, hidden_size // GROUP_SIZE]
+            qzeros_unpacked_int8 = qzeros_unpacked_int8.T.contiguous().view(
+                new_qkv_zero_shape).permute(1, 0, 2).reshape(
+                    [hidden_size * 3, hidden_size // GROUP_SIZE]).contiguous()
+
+            new_qkv_bias_shape = torch.Size([num_heads, 3, head_size])
+            biases_fp16 = biases_fp16.view(new_qkv_bias_shape).permute(
+                1, 0, 2).reshape([hidden_size * 3]).numpy()
+
+            if tp_size > 1:
+                qweight_unpacked_int8 = qweight_unpacked_int8.reshape(
+                    [3, hidden_size, hidden_size])
+                qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
+                                                    tp_size,
+                                                    rank,
+                                                    dim=1)
+                qweight_unpacked_int8 = qweight_unpacked_int8.reshape(
+                    [3 * hidden_size // tp_size, hidden_size])
+
+                scales_fp16 = scales_fp16.reshape(
+                    [3, hidden_size, hidden_size // GROUP_SIZE])
+                scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=1)
+                scales_fp16 = scales_fp16.reshape(
+                    [3 * hidden_size // tp_size, hidden_size // GROUP_SIZE])
+
+                qzeros_unpacked_int8 = qzeros_unpacked_int8.reshape(
+                    [3, hidden_size, hidden_size // GROUP_SIZE])
+                qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
+                                                   tp_size,
+                                                   rank,
+                                                   dim=1)
+                qzeros_unpacked_int8 = qzeros_unpacked_int8.reshape(
+                    [3 * hidden_size // tp_size, hidden_size // GROUP_SIZE])
+
+                biases_fp16 = biases_fp16.reshape([3, hidden_size])
+                biases_fp16 = numpy_split(biases_fp16, tp_size, rank, dim=1)
+                biases_fp16 = biases_fp16.reshape([3 * hidden_size // tp_size])
+
+            qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
+                qweight_unpacked_int8.T.contiguous(),
+                scales_fp16.T.contiguous(), qzeros_unpacked_int8.T.contiguous())
+
+            xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.qweight.value = \
+                qweight_fp32
+            xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.scale.value = \
+                scales_fp16
+            xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.zero.value = \
+                zeros_fp16
+            xtrt_llm_gpt_neox.layers[layer_idx].attention.qkv.bias.value = \
+                biases_fp16
+
+            qweight_int32 = hf_gpt_neox_state_dict.get(
+                prefix + "attention.dense.qweight")
+            scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
+                                                     "attention.dense.scales")
+            qzeros_int32 = hf_gpt_neox_state_dict.get(prefix +
+                                                      "attention.dense.qzeros")
+            biases_fp16 = hf_gpt_neox_state_dict.get(
+                prefix + "attention.dense.bias").numpy()
+
+            # [k=hidden_size // 8, n=hidden_size] -> [n=hidden_size, k=hidden_size]
+            qweight_unpacked_int8 = unpack_int32_into_int8(
+                qweight_int32.T).contiguous() - 8
+            # [n=hidden_size, k=hidden_size] -> [k=hidden_size, n=hidden_size]
+            qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
+            # [k=hidden_size // GROUP_SIZE, n=hidden_size // 8] ->
+            # [k=hidden_size // GROUP_SIZE, n=hidden_size]
+            qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
+
+            if tp_size > 1:
+                qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
+                                                    tp_size,
+                                                    rank,
+                                                    dim=0)
+                scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=0)
+                qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
+                                                   tp_size,
+                                                   rank,
+                                                   dim=0)
+                if rank > 0:
+                    biases_fp16 = np.zeros_like(biases_fp16)
+
+            qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
+                qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
+
+            xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.qweight.value = \
+                qweight_fp32
+            xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.scale.value = \
+                scales_fp16
+            xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.zero.value = \
+                zeros_fp16
+            xtrt_llm_gpt_neox.layers[layer_idx].attention.dense.bias.value = \
+                biases_fp16
+
+            qweight_int32 = hf_gpt_neox_state_dict.get(
+                prefix + "mlp.dense_h_to_4h.qweight")
+            scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
+                                                     "mlp.dense_h_to_4h.scales")
+            qzeros_int32 = hf_gpt_neox_state_dict.get(
+                prefix + "mlp.dense_h_to_4h.qzeros")
+            biases_fp16 = hf_gpt_neox_state_dict.get(
+                prefix + "mlp.dense_h_to_4h.bias").numpy()
+
+            # [hidden_size // 8, hidden_size * 4] -> [hidden_size, hidden_size * 4]
+            qweight_unpacked_int8 = unpack_int32_into_int8(
+                qweight_int32.T).contiguous() - 8
+            qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
+
+            # [hidden_size // GROUP_SIZE, hidden_size * 4 // 8] ->
+            # [hidden_size // GROUP_SIZE, hidden_size * 4]
+            qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
+
+            if tp_size > 1:
+                # [hidden_size, hidden_size * 4] ->
+                # [hidden_size, hidden_size * 4 // tp_size]
+                qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
+                                                    tp_size,
+                                                    rank,
+                                                    dim=1)
+                # [hidden_size // GROUP_SIZE, hidden_size * 4] ->
+                # [hidden_size // GROUP_SIZE, hidden_size * 4 // tp_size]
+                scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=1)
+                # [hidden_size // GROUP_SIZE, hidden_size * 4] ->
+                # [hidden_size // GROUP_SIZE, hidden_size * 4 // tp_size]
+                qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
+                                                   tp_size,
+                                                   rank,
+                                                   dim=1)
+                # [hidden_size * 4] -> [hidden_size * 4 // tp_size]
+                biases_fp16 = numpy_split(biases_fp16, tp_size, rank, dim=0)
+
+            qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
+                qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
+
+            xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.qweight.value = \
+                qweight_fp32
+            xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.scale.value = \
+                scales_fp16
+            xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.zero.value = \
+                zeros_fp16
+            xtrt_llm_gpt_neox.layers[layer_idx].mlp.fc.bias.value = \
+                biases_fp16
+
+            qweight_int32 = hf_gpt_neox_state_dict.get(
+                prefix + "mlp.dense_4h_to_h.qweight")
+            scales_fp16 = hf_gpt_neox_state_dict.get(prefix +
+                                                     "mlp.dense_4h_to_h.scales")
+            qzeros_int32 = hf_gpt_neox_state_dict.get(
+                prefix + "mlp.dense_4h_to_h.qzeros")
+            biases_fp16 = hf_gpt_neox_state_dict.get(
+                prefix + "mlp.dense_4h_to_h.bias").numpy()
+
+            # [hidden_size * 4 // 8, hidden_size] -> [hidden_size * 4, hidden_size]
+            qweight_unpacked_int8 = unpack_int32_into_int8(
+                qweight_int32.T).contiguous() - 8
+            qweight_unpacked_int8 = qweight_unpacked_int8.T.contiguous()
+
+            # [hidden_size * 4 // GROUP_SIZE, hidden_size // 8] ->
+            # [hidden_size * 4 // GROUP_SIZE, hidden_size]
+            qzeros_unpacked_int8 = unpack_int32_into_int8(qzeros_int32)
+
+            if tp_size > 1:
+                # [hidden_size * 4, hidden_size] ->
+                # [hidden_size * 4 // tp_size, hidden_size]
+                qweight_unpacked_int8 = torch_split(qweight_unpacked_int8,
+                                                    tp_size,
+                                                    rank,
+                                                    dim=0)
+                # [hidden_size * 4 // GROUP_SIZE, hidden_size] ->
+                # [hidden_size * 4 // GROUP_SIZE // tp_size, hidden_size] ->
+                scales_fp16 = torch_split(scales_fp16, tp_size, rank, dim=0)
+                # [hidden_size * 4 // GROUP_SIZE, hidden_size] ->
+                # [hidden_size * 4 // GROUP_SIZE // tp_size, hidden_size]
+                qzeros_unpacked_int8 = torch_split(qzeros_unpacked_int8,
+                                                   tp_size,
+                                                   rank,
+                                                   dim=0)
+                if rank > 0:
+                    biases_fp16 = np.zeros_like(biases_fp16)
+
+            qweight_fp32, scales_fp16, zeros_fp16 = preprocess_groupwise_weight_params(
+                qweight_unpacked_int8, scales_fp16, qzeros_unpacked_int8)
+
+            xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.qweight.value = \
+                qweight_fp32
+            xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.scale.value = \
+                scales_fp16
+            xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.zero.value = \
+                zeros_fp16
+            xtrt_llm_gpt_neox.layers[layer_idx].mlp.proj.bias.value = \
+                biases_fp16
+
+    v = hf_gpt_neox_state_dict.get('gpt_neox.final_layer_norm.weight')
+    xtrt_llm_gpt_neox.ln_f.weight.value = v.to(torch_dtype).cpu().numpy()
+
+    v = hf_gpt_neox_state_dict.get('gpt_neox.final_layer_norm.bias')
+    xtrt_llm_gpt_neox.ln_f.bias.value = v.to(torch_dtype).cpu().numpy()
+
+    v = hf_gpt_neox_state_dict.get('embed_out.weight').to(
+        torch_dtype).cpu().numpy()
+    if tp_size > 1:
+        # [vocab_size, hidden_size] ->
+        # [vocab_size // tp_size, hidden_size]
+        if v.shape[0] % tp_size != 0:
+            # padding
+            vocab_size_padded = pad_vocab_size(v.shape[0], tp_size)
+            pad_width = vocab_size_padded - v.shape[0]
+            v = np.pad(v, ((0, pad_width), (0, 0)),
+                       'constant',
+                       constant_values=0)
+
+        split_v = numpy_split(v, tp_size, rank, dim=0)
+        xtrt_llm_gpt_neox.lm_head.weight.value = split_v
+    else:
+        xtrt_llm_gpt_neox.lm_head.weight.value = v
+
+    tok = time.time()
+    t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
+    xtrt_llm.logger.info(f'Weights loaded. Total time: {t}')