初始化项目，由ModelHub XC社区提供模型

Model: TheBloke/AquilaChat2-34B-AWQ Source: Original Platform
2026-05-30 06:20:15 +08:00
commit a6a34f579b
20 changed files with 303121 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,37 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 model-00001-of-00002.safetensors filter=lfs diff=lfs merge=lfs -text
 model-00002-of-00002.safetensors filter=lfs diff=lfs merge=lfs -text
--- a/-Agreement.pdf
+++ b/-Agreement.pdf
--- a/0
+++ b/0
--- a/README.md
+++ b/README.md
@@ -0,0 +1,398 @@
 ---
 base_model: BAAI/AquilaChat2-34B
 inference: false
 license: other
 model_creator: Beijing Academy of Artificial Intelligence
 model_name: AquilaChat2 34B
 model_type: aquila
 prompt_template: 'System: A chat between a curious human and an artificial intelligence
  assistant. The assistant gives helpful, detailed, and polite answers to the human''s
  questions.
  Human: {prompt}
  Assistant:
  '
 quantized_by: TheBloke
 ---
 <!-- markdownlint-disable MD041 -->
 <!-- header start -->
 <!-- 200823 -->
 <div style="width: auto; margin-left: auto; margin-right: auto">
 <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
 </div>
 <div style="display: flex; justify-content: space-between; width: 100%;">
    <div style="display: flex; flex-direction: column; align-items: flex-start;">
        <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://discord.gg/theblokeai">Chat & support: TheBloke's Discord server</a></p>
    </div>
    <div style="display: flex; flex-direction: column; align-items: flex-end;">
        <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
    </div>
 </div>
 <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">TheBloke's LLM work is generously supported by a grant from <a href="https://a16z.com">andreessen horowitz (a16z)</a></p></div>
 <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
 <!-- header end -->
 # AquilaChat2 34B - AWQ
 - Model creator: [Beijing Academy of Artificial Intelligence](https://huggingface.co/BAAI)
 - Original model: [AquilaChat2 34B](https://huggingface.co/BAAI/AquilaChat2-34B)
 <!-- description start -->
 ## Description
 This repo contains AWQ model files for [Beijing Academy of Artificial Intelligence's AquilaChat2 34B](https://huggingface.co/BAAI/AquilaChat2-34B).
 These files were quantised using hardware kindly provided by [Massed Compute](https://massedcompute.com/).
 ### About AWQ
 AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
 It is supported by:
 - [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ
 - [vLLM](https://github.com/vllm-project/vllm) - Llama and Mistral models only
 - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
 - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code
 <!-- description end -->
 <!-- repositories-available start -->
 ## Repositories available
 * [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/AquilaChat2-34B-AWQ)
 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/AquilaChat2-34B-GPTQ)
 * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/AquilaChat2-34B-GGUF)
 * [Beijing Academy of Artificial Intelligence's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/BAAI/AquilaChat2-34B)
 <!-- repositories-available end -->
 <!-- prompt-template start -->
 ## Prompt template: AquilaChat
 ```
 System: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
 Human: {prompt}
 Assistant:
 ```
 <!-- prompt-template end -->
 <!-- README_AWQ.md-provided-files start -->
 ## Provided files, and AWQ parameters
 For my first release of AWQ models, I am releasing 128g models only. I will consider adding 32g as well if there is interest, and once I have done perplexity and evaluation comparisons, but at this time 32g models are still not fully tested with AutoAWQ and vLLM.
 Models are released as sharded safetensors files.
 | Branch | Bits | GS | AWQ Dataset | Seq Len | Size |
 | ------ | ---- | -- | ----------- | ------- | ---- |
 | [main](https://huggingface.co/TheBloke/AquilaChat2-34B-AWQ/tree/main) | 4 | 128 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 19.33 GB
 <!-- README_AWQ.md-provided-files end -->
 <!-- README_AWQ.md-text-generation-webui start -->
 ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
 Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/AquilaChat2-34B-AWQ`.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done".
 5. In the top left, click the refresh icon next to **Model**.
 6. In the **Model** dropdown, choose the model you just downloaded: `AquilaChat2-34B-AWQ`
 7. Select **Loader: AutoAWQ**.
 8. Click Load, and the model will load and is now ready for use.
 9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
 10. Once you're ready, click the **Text Generation** tab and enter a prompt to get started!
 <!-- README_AWQ.md-text-generation-webui end -->
 <!-- README_AWQ.md-use-from-vllm start -->
 ## Multi-user inference server: vLLM
 Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
 - Please ensure you are using vLLM version 0.2 or later.
 - When using vLLM as a server, pass the `--quantization awq` parameter.
 For example:
 ```shell
 python3 python -m vllm.entrypoints.api_server --model TheBloke/AquilaChat2-34B-AWQ --quantization awq
 ```
 - When using vLLM from Python code, again set `quantization=awq`.
 For example:
 ```python
 from vllm import LLM, SamplingParams
 prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
 ]
 prompt_template=f'''System: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
 Human: {prompt}
 Assistant:
 '''
 prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 llm = LLM(model="TheBloke/AquilaChat2-34B-AWQ", quantization="awq", dtype="auto")
 outputs = llm.generate(prompts, sampling_params)
 # Print the outputs.
 for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```
 <!-- README_AWQ.md-use-from-vllm start -->
 <!-- README_AWQ.md-use-from-tgi start -->
 ## Multi-user inference server: Hugging Face Text Generation Inference (TGI)
 Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
 Example Docker parameters:
 ```shell
 --model-id TheBloke/AquilaChat2-34B-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
 ```
 Example Python code for interfacing with TGI (requires [huggingface-hub](https://github.com/huggingface/huggingface_hub) 0.17.0 or later):
 ```shell
 pip3 install huggingface-hub
 ```
 ```python
 from huggingface_hub import InferenceClient
 endpoint_url = "https://your-endpoint-url-here"
 prompt = "Tell me about AI"
 prompt_template=f'''System: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
 Human: {prompt}
 Assistant:
 '''
 client = InferenceClient(endpoint_url)
 response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)
 print(f"Model output: ", response)
 ```
 <!-- README_AWQ.md-use-from-tgi end -->
 <!-- README_AWQ.md-use-from-python start -->
 ## Inference from Python code using AutoAWQ
 ### Install the AutoAWQ package
 Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.1 or later.
 ```shell
 pip3 install autoawq
 ```
 If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead:
 ```shell
 pip3 uninstall -y autoawq
 git clone https://github.com/casper-hansen/AutoAWQ
 cd AutoAWQ
 pip3 install .
 ```
 ### AutoAWQ example code
 ```python
 from awq import AutoAWQForCausalLM
 from transformers import AutoTokenizer
 model_name_or_path = "TheBloke/AquilaChat2-34B-AWQ"
 # Load tokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
 # Load model
 model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=True, safetensors=True)
 prompt = "Tell me about AI"
 prompt_template=f'''System: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
 Human: {prompt}
 Assistant:
 '''
 print("*** Running model.generate:")
 token_input = tokenizer(
    prompt_template,
    return_tensors='pt'
 ).input_ids.cuda()
 # Generate output
 generation_output = model.generate(
    token_input,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
 )
 # Get the tokens from the output, decode them, print them
 token_output = generation_output[0]
 text_output = tokenizer.decode(token_output)
 print("LLM output: ", text_output)
 """
 # Inference should be possible with transformers pipeline as well in future
 # But currently this is not yet supported by AutoAWQ (correct as of September 25th 2023)
 from transformers import pipeline
 print("*** Pipeline:")
 pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
 )
 print(pipe(prompt_template)[0]['generated_text'])
 """
 ```
 <!-- README_AWQ.md-use-from-python end -->
 <!-- README_AWQ.md-compatibility start -->
 ## Compatibility
 The files provided are tested to work with:
 - [text-generation-webui](https://github.com/oobabooga/text-generation-webui) using `Loader: AutoAWQ`.
 - [vLLM](https://github.com/vllm-project/vllm) version 0.2.0 and later.
 - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) version 1.1.0 and later.
 - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) version 0.1.1 and later.
 <!-- README_AWQ.md-compatibility end -->
 <!-- footer start -->
 <!-- 200823 -->
 ## Discord
 For further support, and discussions on these models and AI in general, join us at:
 [TheBloke AI's Discord server](https://discord.gg/theblokeai)
 ## Thanks, and how to contribute
 Thanks to the [chirper.ai](https://chirper.ai) team!
 Thanks to Clay from [gpus.llm-utils.org](llm-utils)!
 I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
 If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
 Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
 * Patreon: https://patreon.com/TheBlokeAI
 * Ko-Fi: https://ko-fi.com/TheBlokeAI
 **Special thanks to**: Aemon Algiz.
 **Patreon special mentions**: Brandon Frisco, LangChain4j, Spiking Neurons AB, transmissions 11, Joseph William Delisle, Nitin Borwankar, Willem Michiel, Michael Dempsey, vamX, Jeffrey Morgan, zynix, jjj, Omer Bin Jawed, Sean Connelly, jinyuan sun, Jeromy Smith, Shadi, Pawan Osman, Chadd, Elijah Stavena, Illia Dulskyi, Sebastain Graf, Stephen Murray, terasurfer, Edmond Seymore, Celu Ramasamy, Mandus, Alex, biorpg, Ajan Kanaga, Clay Pascal, Raven Klaugh, 阿明, K, ya boyyy, usrbinkat, Alicia Loh, John Villwock, ReadyPlayerEmma, Chris Smitley, Cap'n Zoog, fincy, GodLy, S_X, sidney chen, Cory Kujawski, OG, Mano Prime, AzureBlack, Pieter, Kalila, Spencer Kim, Tom X Nguyen, Stanislav Ovsiannikov, Michael Levine, Andrey, Trailburnt, Vadim, Enrico Ros, Talal Aujan, Brandon Phillips, Jack West, Eugene Pentland, Michael Davis, Will Dee, webtim, Jonathan Leane, Alps Aficionado, Rooh Singh, Tiffany J. Kim, theTransient, Luke @flexchar, Elle, Caitlyn Gatomon, Ari Malik, subjectnull, Johann-Peter Hartmann, Trenton Dambrowitz, Imad Khwaja, Asp the Wyvern, Emad Mostaque, Rainer Wilmers, Alexandros Triantafyllidis, Nicholas, Pedro Madruga, SuperWojo, Harry Royden McLaughlin, James Bentley, Olakabola, David Ziegler, Ai Maven, Jeff Scroggin, Nikolai Manek, Deo Leter, Matthew Berman, Fen Risland, Ken Nordquist, Manuel Alberto Morcote, Luke Pendergrass, TL, Fred von Graf, Randy H, Dan Guido, NimbleBox.ai, Vitor Caleffi, Gabriel Tamborski, knownsqashed, Lone Striker, Erik Bjäreholt, John Detwiler, Leonard Tan, Iucharbius
 Thank you to all my generous patrons and donaters!
 And thank you again to a16z for their generous grant.
 <!-- footer end -->
 # Original model card: Beijing Academy of Artificial Intelligence's AquilaChat2 34B
 ![Aquila_logo](./log.jpeg)
 <h4 align="center">
    <p>
        <b>English</b> |
        <a href="https://huggingface.co/BAAI/AquilaChat2-34B/blob/main/README_zh.md">简体中文</a> 
    </p>
 </h4>
 <p align="center">
  <a href="https://github.com/FlagAI-Open/Aquila2" target="_blank">Github</a> • <a href="https://github.com/FlagAI-Open/Aquila2/blob/main/assets/wechat-qrcode.jpg" target="_blank">WeChat</a> <br>
 </p>
 We opensource our **Aquila2** series, now including **Aquila2**, the base language models, namely **Aquila2-7B** and **Aquila2-34B**, as well as **AquilaChat2**, the chat models, namely **AquilaChat2-7B** and **AquilaChat2-34B**, as well as the long-text chat models, namely **AquilaChat2-7B-16k** and **AquilaChat2-34B-16k**
 2023.10.25 🔥 **AquilaChat2-34B v1.2** is based on the previous **AquilaChat2-34B**. 
 The AquilaChat2-34B model is close to or exceeds the level of GPT3.5 in the subjective evaluation of 8 secondary ability dimensions.
 The additional details of the Aquila model will be presented in the official technical report. Please stay tuned for updates on official channels.
 ## Quick Start  AquilaChat2-34B（Chat model）
 ### 1. Inference
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from transformers import BitsAndBytesConfig
 import torch
 device = torch.device("cuda:0")
 model_info = "BAAI/AquilaChat2-34B"
 tokenizer = AutoTokenizer.from_pretrained(model_info, trust_remote_code=True)
 quantization_config=BitsAndBytesConfig(
                        load_in_4bit=True,
                        bnb_4bit_use_double_quant=True,
                        bnb_4bit_quant_type="nf4",
                        bnb_4bit_compute_dtype=torch.bfloat16,
                    )
 model = AutoModelForCausalLM.from_pretrained(model_info, trust_remote_code=True, torch_dtype=torch.bfloat16,
                                                # quantization_config=quantization_config, # Uncomment this line for 4bit quantization
                                                )
 model.eval()
 model.to(device)
 text = "请给出10个要到北京旅游的理由。"
 from predict import predict
 out = predict(model, text, tokenizer=tokenizer, max_gen_len=200, top_p=0.9,
              seed=123, topk=15, temperature=1.0, sft=True, device=device,
              model_name="AquilaChat2-34B")
 print(out)
 ```
 ## License
 Aquila2 series open-source model is licensed under [ BAAI Aquila Model Licence Agreement](https://huggingface.co/BAAI/AquilaChat2-34B/blob/main/BAAI-Aquila-Model-License%20-Agreement.pdf)
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1,10 @@
 {
  "</s>": 100007,
  "<|LDWANG|>": 100002,
  "<|endofpiece|>": 100001,
  "<|startofpiece|>": 100000,
  "[CLS]": 100006,
  "[MASK]": 100003,
  "[gMASK]": 100004,
  "[sMASK]": 100005
 }
--- a/config.json
+++ b/config.json
@@ -0,0 +1,38 @@
 {
  "_name_or_path": "/workspace/process/baai_aquilachat2-34b/source",
  "architectures": [
    "AquilaForCausalLM"
  ],
  "auto_map": {
    "AutoConfig": "configuration_aquila.AquilaConfig",
    "AutoModelForCausalLM": "modeling_aquila.AquilaForCausalLM"
  },
  "bos_token_id": 100006,
  "eos_token_id": 100007,
  "hidden_act": "silu",
  "hidden_size": 6144,
  "initializer_range": 0.02,
  "intermediate_size": 24576,
  "max_position_embeddings": 4096,
  "model_type": "aquila",
  "num_attention_heads": 48,
  "num_hidden_layers": 60,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.34.1",
  "use_cache": true,
  "vocab_size": 100008,
  "quantization_config": {
    "quant_method": "awq",
    "zero_point": true,
    "group_size": 128,
    "bits": 4,
    "version": "gemm"
  }
 }
--- a/configuration.json
+++ b/configuration.json
@@ -0,0 +1 @@
 {"framework": "pytorch", "task": "text-generation", "allow_remote": true}
--- a/configuration_aquila.py
+++ b/configuration_aquila.py
@@ -0,0 +1,128 @@
 # coding=utf-8
 # Copyright 2023 EleutherAI and the HuggingFace Inc. team. All rights reserved.
 #
 # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
 # and OPT implementations in this library. It has been modified from its
 # original forms to accommodate minor architectural differences compared
 # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ Aquila model configuration"""
 from transformers import PretrainedConfig
 class AquilaConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`AquilaModel`]. It is used to instantiate an Aquila
    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the Aquila-7B.
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        vocab_size (`int`, *optional*, defaults to 32000):
            Vocabulary size of the Aquila model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`AquilaModel`]
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 11008):
            Dimension of the MLP representations.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the rms normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        tie_word_embeddings(`bool`, *optional*, defaults to `False`):
            Whether to tie weight embeddings
        Example:
    ```python
    >>> from transformers import AquilaModel, AquilaConfig
    >>> # Initializing a Aquila aquila-7b style configuration
    >>> configuration = AquilaConfig()
    >>> # Initializing a model from the aquila-7b style configuration
    >>> model = AquilaModel(configuration)
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""
    model_type = "aquila"
    keys_to_ignore_at_inference = ["past_key_values"]
    def __init__(
        self,
        vocab_size=100008,
        hidden_size=4096,
        intermediate_size=11008,
        num_hidden_layers=32,
        num_attention_heads=32,
        num_key_value_heads=None,
        hidden_act="silu",
        max_position_embeddings=2048,
        initializer_range=0.02,
        rms_norm_eps=1e-6,
        use_cache=True,
        pad_token_id=0,
        bos_token_id=1,
        eos_token_id=2,
        pretraining_tp=1,
        tie_word_embeddings=False,
        rope_theta=10000.0,
        rope_scaling=None,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads
        self.num_key_value_heads = num_key_value_heads
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.pretraining_tp = pretraining_tp
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,7 @@
 {
  "_from_model_config": true,
  "bos_token_id": 100006,
  "eos_token_id": 100007,
  "pad_token_id": 0,
  "transformers_version": "4.31.0"
 }
--- a/merges.txt
+++ b/merges.txt
--- a/model-00001-of-00002.safetensors
+++ b/model-00001-of-00002.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:4acf105abdcb4a019af739302593c3f5ceb83f938001f7ee1868557646b9f305
 size 9989612296
--- a/model-00002-of-00002.safetensors
+++ b/model-00002-of-00002.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:4aa7c8535369a7e1c26739c5a0de308d2a9df5476809509ffabf8305a1640e87
 size 9335846128
--- a/model.safetensors.index.json
+++ b/model.safetensors.index.json
--- a/modeling_aquila.py
+++ b/modeling_aquila.py
--- a/predict.py
+++ b/predict.py
@@ -0,0 +1,446 @@
 """
 Copied from https://github.com/lm-sys/FastChat.
 Later we will contribute our changes into it.
 """
 import dataclasses
 from enum import auto, IntEnum
 from typing import List, Any, Dict
 import math
 from typing import List, Optional, Tuple, Union
 import random
 import numpy as np
 import torch
 import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from transformers.activations import ACT2FN
 from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
 from transformers.modeling_utils import PreTrainedModel
 from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
 from transformers import (
    LogitsProcessorList,
    MinLengthLogitsProcessor,
    TopKLogitsWarper,
    TemperatureLogitsWarper,
    TopPLogitsWarper,
    StoppingCriteriaList,
    MaxLengthCriteria,
    BitsAndBytesConfig,
 )
 class SeparatorStyle(IntEnum):
    """Separator styles."""
    ADD_COLON_SINGLE = auto()
    ADD_COLON_TWO = auto()
    ADD_COLON_SPACE_SINGLE = auto()
    NO_COLON_SINGLE = auto()
    NO_COLON_TWO = auto()
    ADD_NEW_LINE_SINGLE = auto()
@dataclasses.dataclass
 class Conversation:
    """A class that manages prompt templates and keeps all conversation history."""
    # The name of this template
    name: str
    # The template of the system prompt
    system_template: str = "{system_message}"
    # The system message
    system_message: str = ""
    # The names of two roles
    roles: List[str] = (("USER", "ASSISTANT"),)
    # All messages. Each item is (role, message).
    messages: List[List[str]] = ()
    # The number of few shot examples
    offset: int = 0
    # The separator style and configurations
    sep_style: SeparatorStyle = SeparatorStyle.ADD_COLON_SINGLE
    sep: str = "\n"
    sep2: str = None
    # Stop criteria (the default one is EOS token)
    stop_str: str = None
    # Stops generation if meeting any token in this list
    stop_token_ids: List[int] = None
    def get_prompt(self) -> str:
        """Get the prompt for generation."""
        system_prompt = self.system_template.format(system_message=self.system_message)
        if self.sep_style == SeparatorStyle.ADD_COLON_SINGLE:
            ret = system_prompt + self.sep
            for role, message in self.messages:
                if message:
                    ret += role + ": " + message + self.sep
                else:
                    ret += role + ":"
            return ret
        elif self.sep_style == SeparatorStyle.ADD_COLON_TWO:
            seps = [self.sep, self.sep2]
            ret = system_prompt + seps[0]
            for i, (role, message) in enumerate(self.messages):
                if message:
                    ret += role + ": " + message + seps[i % 2]
                else:
                    ret += role + ":"
            return ret
        elif self.sep_style == SeparatorStyle.ADD_COLON_SPACE_SINGLE:
            ret = system_prompt + self.sep
            for role, message in self.messages:
                if message:
                    ret += role + ": " + message + self.sep
                else:
                    ret += role + ": "  # must be end with a space
            return ret
        elif self.sep_style == SeparatorStyle.ADD_NEW_LINE_SINGLE:
            ret = "" if system_prompt == "" else system_prompt + self.sep
            for role, message in self.messages:
                if message:
                    ret += role + "\n" + message + self.sep
                else:
                    ret += role + "\n"
            return ret
        elif self.sep_style == SeparatorStyle.NO_COLON_SINGLE:
            ret = system_prompt
            for role, message in self.messages:
                if message:
                    ret += role + message + self.sep
                else:
                    ret += role
            return ret
        elif self.sep_style == SeparatorStyle.NO_COLON_TWO:
            seps = [self.sep, self.sep2]
            ret = system_prompt
            for i, (role, message) in enumerate(self.messages):
                if message:
                    ret += role + message + seps[i % 2]
                else:
                    ret += role
            return ret
    def set_system_message(self, system_message: str):
        """Set the system message."""
        self.system_message = system_message
    def append_message(self, role: str, message: str):
        """Append a new message."""
        self.messages.append([role, message])
    def update_last_message(self, message: str):
        """Update the last output.
        The last message is typically set to be None when constructing the prompt,
        so we need to update it in-place after getting the response from a model.
        """
        self.messages[-1][1] = message
    def copy(self):
        return Conversation(
            name=self.name,
            system_template=self.system_template,
            system_message=self.system_message,
            roles=self.roles,
            messages=[[x, y] for x, y in self.messages],
            offset=self.offset,
            sep_style=self.sep_style,
            sep=self.sep,
            sep2=self.sep2,
            stop_str=self.stop_str,
            stop_token_ids=self.stop_token_ids,
        )
    def dict(self):
        return {
            "template_name": self.name,
            "system_message": self.system_message,
            "roles": self.roles,
            "messages": self.messages,
            "offset": self.offset,
        }
 # A global registry for all conversation templates
 conv_templates: Dict[str, Conversation] = {}
 def register_conv_template(template: Conversation, override: bool = False):
    """Register a new conversation template."""
    if not override:
        assert (
            template.name not in conv_templates
        ), f"{template.name} has been registered."
    conv_templates[template.name] = template
 def get_conv_template(name: str) -> Conversation:
    """Get a conversation template."""
    return conv_templates[name].copy()
 def get_conversation_template(model_path: str) -> Conversation:
    """Get the default conversation template."""
    if "aquila-v1" in model_path:
        return get_conv_template("aquila-v1")
    elif "aquila-chat" in model_path:
        return get_conv_template("aquila-chat")
    elif "aquila-legacy" in model_path:
        return get_conv_template("aquila-legacy")
    else:
        return get_conv_template("aquila")
 # AquilaChat default template
 # source: https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-chat/cyg_conversation.py
 register_conv_template(
    Conversation(
        name="aquila-chat",
        system_message="A chat between a curious human and an artificial intelligence assistant. "
        "The assistant gives helpful, detailed, and polite answers to the human's questions.",
        roles=("Human", "Assistant", "System"),
        messages=(),
        offset=0,
        sep_style=SeparatorStyle.ADD_COLON_SINGLE,
        sep="###",
        sep2="",
        stop_str=["###", "</s>", "[UNK]"],
    )
 )
 register_conv_template(
    Conversation(
        name="aquila-legacy",
        system_message="A chat between a curious human and an artificial intelligence assistant. "
        "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
        roles=("### Human: ", "### Assistant: ", "System"),
        messages=(),
        offset=0,
        sep_style=SeparatorStyle.NO_COLON_TWO,
        sep="\n",
        sep2="</s>",
        stop_str=["</s>", "[UNK]"],
    )
 )
 register_conv_template(
    Conversation(
        name="aquila",
        system_message="A chat between a curious human and an artificial intelligence assistant. "
        "The assistant gives helpful, detailed, and polite answers to the human's questions.",
        roles=("Human", "Assistant", "System"),
        messages=(),
        offset=0,
        sep_style=SeparatorStyle.ADD_COLON_TWO,
        sep="###",
        sep2="</s>",
        stop_str=["</s>", "[UNK]"],
    )
 )
 register_conv_template(
    Conversation(
        name="aquila-v1",
        roles=("<|startofpiece|>", "<|endofpiece|>", ""),
        messages=(),
        offset=0,
        sep_style=SeparatorStyle.NO_COLON_TWO,
        sep="",
        sep2="</s>",
        stop_str=["</s>", "<|endoftext|>"],
    )
 )
 if __name__ == "__main__":
    print("aquila template:")
    conv = get_conv_template("aquila")
    conv.append_message(conv.roles[0], "Hello!")
    conv.append_message(conv.roles[1], "Hi!")
    conv.append_message(conv.roles[0], "How are you?")
    conv.append_message(conv.roles[1], None)
    print(conv.get_prompt())
    print("\n")
    print("aquila-chat template:")
    conv = get_conv_template("aquila-chat")
    conv.append_message(conv.roles[0], "Hello!")
    conv.append_message(conv.roles[1], "Hi!")
    conv.append_message(conv.roles[0], "How are you?")
    conv.append_message(conv.roles[1], None)
    print(conv.get_prompt())
    print("\n")
    print("aquila-v1 template:")
    conv = get_conv_template("aquila-v1")
    conv.append_message(conv.roles[0], "Hello!")
    conv.append_message(conv.roles[1], "Hi!")
    conv.append_message(conv.roles[0], "How are you?")
    conv.append_message(conv.roles[1], None)
    print(conv.get_prompt())
    print("\n")
    print("aquila-legacy template:")
    conv = get_conv_template("aquila-legacy")
    conv.append_message(conv.roles[0], "Hello!")
    conv.append_message(conv.roles[1], "Hi!")
    conv.append_message(conv.roles[0], "How are you?")
    conv.append_message(conv.roles[1], None)
    print(conv.get_prompt())
    print("\n")
 def set_random_seed(seed):
    """Set random seed for reproducability."""
    if seed is not None and seed > 0:
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
 def covert_prompt_to_input_ids_with_history(text, history, tokenizer, max_token, convo_template="aquila-chat"):
    # aquila-chat as default
    conv = get_conv_template(convo_template)
    conv.append_message(conv.roles[1], None)
    conv.append_message(conv.roles[0], text)
    example = tokenizer.encode_plus(f"{conv.get_prompt()} ", None, max_length=None)['input_ids']
    if history is None or not isinstance(history, list):
      history = []
    while(len(history) > 0 and (len(example) < max_token)):
        tmp = history.pop()
        if tmp[0] == 'ASSISTANT':
            conv.append_message(conv.roles[1], tmp[1])
        else:
            conv.append_message(conv.roles[0], tmp[1])
        example = tokenizer.encode_plus(f"{conv.get_prompt()} ", None, max_length=None)['input_ids']
    if len(example) >= max_token:
        conv.messages.pop()
    conv.messages = conv.messages[::-1]
    print('model in:', conv.get_prompt())
    example = tokenizer.encode_plus(f"{conv.get_prompt()} ", None, max_length=None)['input_ids']
    return example
 def predict(model, text, tokenizer=None,
            max_gen_len=200, top_p=0.95,
            seed=1234, topk=100,
            temperature=0.9, 
            sft=True, convo_template = "",
            device = "cuda",
            model_name="AquilaChat2-7B",
            history=None,
            **kwargs):
    vocab = tokenizer.get_vocab()
    id2word = {v:k for k, v in vocab.items()}
    template_map = {"AquilaChat2-7B": "aquila-v1",
                    "AquilaChat2-34B": "aquila-legacy",
                    "AquilaChat2-7B-16K": "aquila",
                    "AquilaChat2-34B-16K": "aquila"}
    if not convo_template:
        convo_template=template_map.get(model_name, "aquila-chat")
    set_random_seed(seed)
    if temperature == 0:
        topk = 1
        temperature = 1.0
    if sft:
        tokens = covert_prompt_to_input_ids_with_history(text, history=history, tokenizer=tokenizer, max_token=2048, convo_template=convo_template)
        tokens = torch.tensor(tokens)[None,].to(device)
    else :
        tokens = tokenizer.encode_plus(text)["input_ids"]
        print(tokenizer.decode(tokens))
        tokens = torch.tensor(tokens)[None,].to(device)
    input_length = len(tokens[0])
    with torch.no_grad():
        # instantiate logits processors
        logits_processor = LogitsProcessorList(
            [
                MinLengthLogitsProcessor(1, eos_token_id=100007),
            ]
        )
        # instantiate logits processors
        logits_warper = LogitsProcessorList(
            [
                TopPLogitsWarper(top_p),
                TopKLogitsWarper(topk),
                TemperatureLogitsWarper(temperature),
            ]
        )
        stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=input_length + max_gen_len)])
        out = model.sample(
                            tokens,
                            logits_processor=logits_processor,
                            logits_warper=logits_warper,
                            stopping_criteria=stopping_criteria,
                            return_dict_in_generate=True, 
                            output_scores=True,
                        )
        # print(out)
        out_ids = out["sequences"][0][input_length:].cpu().numpy()
        out_scores = out["scores"]
        out_scores = torch.cat(out_scores, dim=0)
        out_scores = torch.nn.functional.softmax(out_scores, dim=-1).cpu().numpy()
        probs = []
        for i in range(len(out_ids)):
            probs.append(float(out_scores[i][out_ids[i]]))
        # print(f"probs is {probs}")
        convert_tokens = []
        for t in out_ids:
            if t == 100006:
                convert_tokens.append("[CLS]")
            else :
                convert_tokens.append(id2word.get(t, "[unkonwn_token]"))
        out_text = tokenizer.decode(out_ids.tolist())
        out = out_text
    if "[UNK]" in out:
        special_index = out.index("[UNK]")
        out = out[:special_index]
        token_length = len(tokenizer.encode_plus(out)["input_ids"])
        convert_tokens = convert_tokens[:token_length]
        probs = probs[:token_length]
    if "</s>" in out:
        special_index = out.index("</s>")
        out = out[: special_index]
        token_length = len(tokenizer.encode_plus(out)["input_ids"])
        convert_tokens = convert_tokens[:token_length]
        probs = probs[:token_length]
    if len(out) > 0 and out[0] == " ":
        out = out[1:]
        convert_tokens = convert_tokens[1:]
        probs = probs[1:]
    if isinstance(history, list):
        # Update history
        history.insert(0, ('ASSISTANT', out))
        history.insert(0, ('USER', text))
    return out 
--- a/quant_config.json
+++ b/quant_config.json
@@ -0,0 +1,6 @@
 {
    "zero_point": true,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
 }
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,6 @@
 {
  "bos_token": "[CLS]",
  "eos_token": "</s>",
  "unk_token": "<|endoftext|>",
  "pad_token": "<|endoftext|>"
 }
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,10 @@
 {
  "add_prefix_space": false,
  "bos_token": "[CLS]",
  "clean_up_tokenization_spaces": true,
  "eos_token": "</s>",
  "model_max_length": 4096,
  "padding_side": "right",
  "tokenizer_class": "GPT2Tokenizer",
  "unk_token": "<|endoftext|>"
 }
--- a/vocab.json
+++ b/vocab.json
		`@@ -0,0 +1 @@`
							`{"framework": "pytorch", "task": "text-generation", "allow_remote": true}`