初始化项目,由ModelHub XC社区提供模型

Model: TheBloke/AquilaChat2-34B-AWQ
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-30 06:20:15 +08:00
commit a6a34f579b
20 changed files with 303121 additions and 0 deletions

37
.gitattributes vendored Normal file
View File

@@ -0,0 +1,37 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
model-00001-of-00002.safetensors filter=lfs diff=lfs merge=lfs -text
model-00002-of-00002.safetensors filter=lfs diff=lfs merge=lfs -text

Binary file not shown.

0
LICENSE Normal file
View File

398
README.md Normal file
View File

@@ -0,0 +1,398 @@
---
base_model: BAAI/AquilaChat2-34B
inference: false
license: other
model_creator: Beijing Academy of Artificial Intelligence
model_name: AquilaChat2 34B
model_type: aquila
prompt_template: 'System: A chat between a curious human and an artificial intelligence
assistant. The assistant gives helpful, detailed, and polite answers to the human''s
questions.
Human: {prompt}
Assistant:
'
quantized_by: TheBloke
---
<!-- markdownlint-disable MD041 -->
<!-- header start -->
<!-- 200823 -->
<div style="width: auto; margin-left: auto; margin-right: auto">
<img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
</div>
<div style="display: flex; justify-content: space-between; width: 100%;">
<div style="display: flex; flex-direction: column; align-items: flex-start;">
<p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://discord.gg/theblokeai">Chat & support: TheBloke's Discord server</a></p>
</div>
<div style="display: flex; flex-direction: column; align-items: flex-end;">
<p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
</div>
</div>
<div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">TheBloke's LLM work is generously supported by a grant from <a href="https://a16z.com">andreessen horowitz (a16z)</a></p></div>
<hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
<!-- header end -->
# AquilaChat2 34B - AWQ
- Model creator: [Beijing Academy of Artificial Intelligence](https://huggingface.co/BAAI)
- Original model: [AquilaChat2 34B](https://huggingface.co/BAAI/AquilaChat2-34B)
<!-- description start -->
## Description
This repo contains AWQ model files for [Beijing Academy of Artificial Intelligence's AquilaChat2 34B](https://huggingface.co/BAAI/AquilaChat2-34B).
These files were quantised using hardware kindly provided by [Massed Compute](https://massedcompute.com/).
### About AWQ
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
It is supported by:
- [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ
- [vLLM](https://github.com/vllm-project/vllm) - Llama and Mistral models only
- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code
<!-- description end -->
<!-- repositories-available start -->
## Repositories available
* [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/AquilaChat2-34B-AWQ)
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/AquilaChat2-34B-GPTQ)
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/AquilaChat2-34B-GGUF)
* [Beijing Academy of Artificial Intelligence's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/BAAI/AquilaChat2-34B)
<!-- repositories-available end -->
<!-- prompt-template start -->
## Prompt template: AquilaChat
```
System: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
Human: {prompt}
Assistant:
```
<!-- prompt-template end -->
<!-- README_AWQ.md-provided-files start -->
## Provided files, and AWQ parameters
For my first release of AWQ models, I am releasing 128g models only. I will consider adding 32g as well if there is interest, and once I have done perplexity and evaluation comparisons, but at this time 32g models are still not fully tested with AutoAWQ and vLLM.
Models are released as sharded safetensors files.
| Branch | Bits | GS | AWQ Dataset | Seq Len | Size |
| ------ | ---- | -- | ----------- | ------- | ---- |
| [main](https://huggingface.co/TheBloke/AquilaChat2-34B-AWQ/tree/main) | 4 | 128 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 19.33 GB
<!-- README_AWQ.md-provided-files end -->
<!-- README_AWQ.md-text-generation-webui start -->
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
1. Click the **Model tab**.
2. Under **Download custom model or LoRA**, enter `TheBloke/AquilaChat2-34B-AWQ`.
3. Click **Download**.
4. The model will start downloading. Once it's finished it will say "Done".
5. In the top left, click the refresh icon next to **Model**.
6. In the **Model** dropdown, choose the model you just downloaded: `AquilaChat2-34B-AWQ`
7. Select **Loader: AutoAWQ**.
8. Click Load, and the model will load and is now ready for use.
9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
10. Once you're ready, click the **Text Generation** tab and enter a prompt to get started!
<!-- README_AWQ.md-text-generation-webui end -->
<!-- README_AWQ.md-use-from-vllm start -->
## Multi-user inference server: vLLM
Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
- Please ensure you are using vLLM version 0.2 or later.
- When using vLLM as a server, pass the `--quantization awq` parameter.
For example:
```shell
python3 python -m vllm.entrypoints.api_server --model TheBloke/AquilaChat2-34B-AWQ --quantization awq
```
- When using vLLM from Python code, again set `quantization=awq`.
For example:
```python
from vllm import LLM, SamplingParams
prompts = [
"Tell me about AI",
"Write a story about llamas",
"What is 291 - 150?",
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''System: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
Human: {prompt}
Assistant:
'''
prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="TheBloke/AquilaChat2-34B-AWQ", quantization="awq", dtype="auto")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
<!-- README_AWQ.md-use-from-vllm start -->
<!-- README_AWQ.md-use-from-tgi start -->
## Multi-user inference server: Hugging Face Text Generation Inference (TGI)
Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
Example Docker parameters:
```shell
--model-id TheBloke/AquilaChat2-34B-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
```
Example Python code for interfacing with TGI (requires [huggingface-hub](https://github.com/huggingface/huggingface_hub) 0.17.0 or later):
```shell
pip3 install huggingface-hub
```
```python
from huggingface_hub import InferenceClient
endpoint_url = "https://your-endpoint-url-here"
prompt = "Tell me about AI"
prompt_template=f'''System: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
Human: {prompt}
Assistant:
'''
client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
repetition_penalty=1.1)
print(f"Model output: ", response)
```
<!-- README_AWQ.md-use-from-tgi end -->
<!-- README_AWQ.md-use-from-python start -->
## Inference from Python code using AutoAWQ
### Install the AutoAWQ package
Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.1 or later.
```shell
pip3 install autoawq
```
If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead:
```shell
pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .
```
### AutoAWQ example code
```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name_or_path = "TheBloke/AquilaChat2-34B-AWQ"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
trust_remote_code=True, safetensors=True)
prompt = "Tell me about AI"
prompt_template=f'''System: A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
Human: {prompt}
Assistant:
'''
print("*** Running model.generate:")
token_input = tokenizer(
prompt_template,
return_tensors='pt'
).input_ids.cuda()
# Generate output
generation_output = model.generate(
token_input,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
max_new_tokens=512
)
# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("LLM output: ", text_output)
"""
# Inference should be possible with transformers pipeline as well in future
# But currently this is not yet supported by AutoAWQ (correct as of September 25th 2023)
from transformers import pipeline
print("*** Pipeline:")
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
repetition_penalty=1.1
)
print(pipe(prompt_template)[0]['generated_text'])
"""
```
<!-- README_AWQ.md-use-from-python end -->
<!-- README_AWQ.md-compatibility start -->
## Compatibility
The files provided are tested to work with:
- [text-generation-webui](https://github.com/oobabooga/text-generation-webui) using `Loader: AutoAWQ`.
- [vLLM](https://github.com/vllm-project/vllm) version 0.2.0 and later.
- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) version 1.1.0 and later.
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) version 0.1.1 and later.
<!-- README_AWQ.md-compatibility end -->
<!-- footer start -->
<!-- 200823 -->
## Discord
For further support, and discussions on these models and AI in general, join us at:
[TheBloke AI's Discord server](https://discord.gg/theblokeai)
## Thanks, and how to contribute
Thanks to the [chirper.ai](https://chirper.ai) team!
Thanks to Clay from [gpus.llm-utils.org](llm-utils)!
I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
* Patreon: https://patreon.com/TheBlokeAI
* Ko-Fi: https://ko-fi.com/TheBlokeAI
**Special thanks to**: Aemon Algiz.
**Patreon special mentions**: Brandon Frisco, LangChain4j, Spiking Neurons AB, transmissions 11, Joseph William Delisle, Nitin Borwankar, Willem Michiel, Michael Dempsey, vamX, Jeffrey Morgan, zynix, jjj, Omer Bin Jawed, Sean Connelly, jinyuan sun, Jeromy Smith, Shadi, Pawan Osman, Chadd, Elijah Stavena, Illia Dulskyi, Sebastain Graf, Stephen Murray, terasurfer, Edmond Seymore, Celu Ramasamy, Mandus, Alex, biorpg, Ajan Kanaga, Clay Pascal, Raven Klaugh, 阿明, K, ya boyyy, usrbinkat, Alicia Loh, John Villwock, ReadyPlayerEmma, Chris Smitley, Cap'n Zoog, fincy, GodLy, S_X, sidney chen, Cory Kujawski, OG, Mano Prime, AzureBlack, Pieter, Kalila, Spencer Kim, Tom X Nguyen, Stanislav Ovsiannikov, Michael Levine, Andrey, Trailburnt, Vadim, Enrico Ros, Talal Aujan, Brandon Phillips, Jack West, Eugene Pentland, Michael Davis, Will Dee, webtim, Jonathan Leane, Alps Aficionado, Rooh Singh, Tiffany J. Kim, theTransient, Luke @flexchar, Elle, Caitlyn Gatomon, Ari Malik, subjectnull, Johann-Peter Hartmann, Trenton Dambrowitz, Imad Khwaja, Asp the Wyvern, Emad Mostaque, Rainer Wilmers, Alexandros Triantafyllidis, Nicholas, Pedro Madruga, SuperWojo, Harry Royden McLaughlin, James Bentley, Olakabola, David Ziegler, Ai Maven, Jeff Scroggin, Nikolai Manek, Deo Leter, Matthew Berman, Fen Risland, Ken Nordquist, Manuel Alberto Morcote, Luke Pendergrass, TL, Fred von Graf, Randy H, Dan Guido, NimbleBox.ai, Vitor Caleffi, Gabriel Tamborski, knownsqashed, Lone Striker, Erik Bjäreholt, John Detwiler, Leonard Tan, Iucharbius
Thank you to all my generous patrons and donaters!
And thank you again to a16z for their generous grant.
<!-- footer end -->
# Original model card: Beijing Academy of Artificial Intelligence's AquilaChat2 34B
![Aquila_logo](./log.jpeg)
<h4 align="center">
<p>
<b>English</b> |
<a href="https://huggingface.co/BAAI/AquilaChat2-34B/blob/main/README_zh.md">简体中文</a>
</p>
</h4>
<p align="center">
<a href="https://github.com/FlagAI-Open/Aquila2" target="_blank">Github</a><a href="https://github.com/FlagAI-Open/Aquila2/blob/main/assets/wechat-qrcode.jpg" target="_blank">WeChat</a> <br>
</p>
We opensource our **Aquila2** series, now including **Aquila2**, the base language models, namely **Aquila2-7B** and **Aquila2-34B**, as well as **AquilaChat2**, the chat models, namely **AquilaChat2-7B** and **AquilaChat2-34B**, as well as the long-text chat models, namely **AquilaChat2-7B-16k** and **AquilaChat2-34B-16k**
2023.10.25 🔥 **AquilaChat2-34B v1.2** is based on the previous **AquilaChat2-34B**.
The AquilaChat2-34B model is close to or exceeds the level of GPT3.5 in the subjective evaluation of 8 secondary ability dimensions.
The additional details of the Aquila model will be presented in the official technical report. Please stay tuned for updates on official channels.
## Quick Start AquilaChat2-34BChat model
### 1. Inference
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
import torch
device = torch.device("cuda:0")
model_info = "BAAI/AquilaChat2-34B"
tokenizer = AutoTokenizer.from_pretrained(model_info, trust_remote_code=True)
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(model_info, trust_remote_code=True, torch_dtype=torch.bfloat16,
# quantization_config=quantization_config, # Uncomment this line for 4bit quantization
)
model.eval()
model.to(device)
text = "请给出10个要到北京旅游的理由。"
from predict import predict
out = predict(model, text, tokenizer=tokenizer, max_gen_len=200, top_p=0.9,
seed=123, topk=15, temperature=1.0, sft=True, device=device,
model_name="AquilaChat2-34B")
print(out)
```
## License
Aquila2 series open-source model is licensed under [ BAAI Aquila Model Licence Agreement](https://huggingface.co/BAAI/AquilaChat2-34B/blob/main/BAAI-Aquila-Model-License%20-Agreement.pdf)

10
added_tokens.json Normal file
View File

@@ -0,0 +1,10 @@
{
"</s>": 100007,
"<|LDWANG|>": 100002,
"<|endofpiece|>": 100001,
"<|startofpiece|>": 100000,
"[CLS]": 100006,
"[MASK]": 100003,
"[gMASK]": 100004,
"[sMASK]": 100005
}

38
config.json Normal file
View File

@@ -0,0 +1,38 @@
{
"_name_or_path": "/workspace/process/baai_aquilachat2-34b/source",
"architectures": [
"AquilaForCausalLM"
],
"auto_map": {
"AutoConfig": "configuration_aquila.AquilaConfig",
"AutoModelForCausalLM": "modeling_aquila.AquilaForCausalLM"
},
"bos_token_id": 100006,
"eos_token_id": 100007,
"hidden_act": "silu",
"hidden_size": 6144,
"initializer_range": 0.02,
"intermediate_size": 24576,
"max_position_embeddings": 4096,
"model_type": "aquila",
"num_attention_heads": 48,
"num_hidden_layers": 60,
"num_key_value_heads": 8,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.34.1",
"use_cache": true,
"vocab_size": 100008,
"quantization_config": {
"quant_method": "awq",
"zero_point": true,
"group_size": 128,
"bits": 4,
"version": "gemm"
}
}

1
configuration.json Normal file
View File

@@ -0,0 +1 @@
{"framework": "pytorch", "task": "text-generation", "allow_remote": true}

128
configuration_aquila.py Normal file
View File

@@ -0,0 +1,128 @@
# coding=utf-8
# Copyright 2023 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Aquila model configuration"""
from transformers import PretrainedConfig
class AquilaConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`AquilaModel`]. It is used to instantiate an Aquila
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the Aquila-7B.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the Aquila model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`AquilaModel`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 11008):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer encoder.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 2048):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
tie_word_embeddings(`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
Example:
```python
>>> from transformers import AquilaModel, AquilaConfig
>>> # Initializing a Aquila aquila-7b style configuration
>>> configuration = AquilaConfig()
>>> # Initializing a model from the aquila-7b style configuration
>>> model = AquilaModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "aquila"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size=100008,
hidden_size=4096,
intermediate_size=11008,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=None,
hidden_act="silu",
max_position_embeddings=2048,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
pretraining_tp=1,
tie_word_embeddings=False,
rope_theta=10000.0,
rope_scaling=None,
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
# for backward compatibility
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.pretraining_tp = pretraining_tp
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)

7
generation_config.json Normal file
View File

@@ -0,0 +1,7 @@
{
"_from_model_config": true,
"bos_token_id": 100006,
"eos_token_id": 100007,
"pad_token_id": 0,
"transformers_version": "4.31.0"
}

99744
merges.txt Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4acf105abdcb4a019af739302593c3f5ceb83f938001f7ee1868557646b9f305
size 9989612296

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4aa7c8535369a7e1c26739c5a0de308d2a9df5476809509ffabf8305a1640e87
size 9335846128

1390
model.safetensors.index.json Normal file

File diff suppressed because it is too large Load Diff

1030
modeling_aquila.py Normal file

File diff suppressed because it is too large Load Diff

446
predict.py Normal file
View File

@@ -0,0 +1,446 @@
"""
Copied from https://github.com/lm-sys/FastChat.
Later we will contribute our changes into it.
"""
import dataclasses
from enum import auto, IntEnum
from typing import List, Any, Dict
import math
from typing import List, Optional, Tuple, Union
import random
import numpy as np
import torch
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
from transformers import (
LogitsProcessorList,
MinLengthLogitsProcessor,
TopKLogitsWarper,
TemperatureLogitsWarper,
TopPLogitsWarper,
StoppingCriteriaList,
MaxLengthCriteria,
BitsAndBytesConfig,
)
class SeparatorStyle(IntEnum):
"""Separator styles."""
ADD_COLON_SINGLE = auto()
ADD_COLON_TWO = auto()
ADD_COLON_SPACE_SINGLE = auto()
NO_COLON_SINGLE = auto()
NO_COLON_TWO = auto()
ADD_NEW_LINE_SINGLE = auto()
@dataclasses.dataclass
class Conversation:
"""A class that manages prompt templates and keeps all conversation history."""
# The name of this template
name: str
# The template of the system prompt
system_template: str = "{system_message}"
# The system message
system_message: str = ""
# The names of two roles
roles: List[str] = (("USER", "ASSISTANT"),)
# All messages. Each item is (role, message).
messages: List[List[str]] = ()
# The number of few shot examples
offset: int = 0
# The separator style and configurations
sep_style: SeparatorStyle = SeparatorStyle.ADD_COLON_SINGLE
sep: str = "\n"
sep2: str = None
# Stop criteria (the default one is EOS token)
stop_str: str = None
# Stops generation if meeting any token in this list
stop_token_ids: List[int] = None
def get_prompt(self) -> str:
"""Get the prompt for generation."""
system_prompt = self.system_template.format(system_message=self.system_message)
if self.sep_style == SeparatorStyle.ADD_COLON_SINGLE:
ret = system_prompt + self.sep
for role, message in self.messages:
if message:
ret += role + ": " + message + self.sep
else:
ret += role + ":"
return ret
elif self.sep_style == SeparatorStyle.ADD_COLON_TWO:
seps = [self.sep, self.sep2]
ret = system_prompt + seps[0]
for i, (role, message) in enumerate(self.messages):
if message:
ret += role + ": " + message + seps[i % 2]
else:
ret += role + ":"
return ret
elif self.sep_style == SeparatorStyle.ADD_COLON_SPACE_SINGLE:
ret = system_prompt + self.sep
for role, message in self.messages:
if message:
ret += role + ": " + message + self.sep
else:
ret += role + ": " # must be end with a space
return ret
elif self.sep_style == SeparatorStyle.ADD_NEW_LINE_SINGLE:
ret = "" if system_prompt == "" else system_prompt + self.sep
for role, message in self.messages:
if message:
ret += role + "\n" + message + self.sep
else:
ret += role + "\n"
return ret
elif self.sep_style == SeparatorStyle.NO_COLON_SINGLE:
ret = system_prompt
for role, message in self.messages:
if message:
ret += role + message + self.sep
else:
ret += role
return ret
elif self.sep_style == SeparatorStyle.NO_COLON_TWO:
seps = [self.sep, self.sep2]
ret = system_prompt
for i, (role, message) in enumerate(self.messages):
if message:
ret += role + message + seps[i % 2]
else:
ret += role
return ret
def set_system_message(self, system_message: str):
"""Set the system message."""
self.system_message = system_message
def append_message(self, role: str, message: str):
"""Append a new message."""
self.messages.append([role, message])
def update_last_message(self, message: str):
"""Update the last output.
The last message is typically set to be None when constructing the prompt,
so we need to update it in-place after getting the response from a model.
"""
self.messages[-1][1] = message
def copy(self):
return Conversation(
name=self.name,
system_template=self.system_template,
system_message=self.system_message,
roles=self.roles,
messages=[[x, y] for x, y in self.messages],
offset=self.offset,
sep_style=self.sep_style,
sep=self.sep,
sep2=self.sep2,
stop_str=self.stop_str,
stop_token_ids=self.stop_token_ids,
)
def dict(self):
return {
"template_name": self.name,
"system_message": self.system_message,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
}
# A global registry for all conversation templates
conv_templates: Dict[str, Conversation] = {}
def register_conv_template(template: Conversation, override: bool = False):
"""Register a new conversation template."""
if not override:
assert (
template.name not in conv_templates
), f"{template.name} has been registered."
conv_templates[template.name] = template
def get_conv_template(name: str) -> Conversation:
"""Get a conversation template."""
return conv_templates[name].copy()
def get_conversation_template(model_path: str) -> Conversation:
"""Get the default conversation template."""
if "aquila-v1" in model_path:
return get_conv_template("aquila-v1")
elif "aquila-chat" in model_path:
return get_conv_template("aquila-chat")
elif "aquila-legacy" in model_path:
return get_conv_template("aquila-legacy")
else:
return get_conv_template("aquila")
# AquilaChat default template
# source: https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-chat/cyg_conversation.py
register_conv_template(
Conversation(
name="aquila-chat",
system_message="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant", "System"),
messages=(),
offset=0,
sep_style=SeparatorStyle.ADD_COLON_SINGLE,
sep="###",
sep2="",
stop_str=["###", "</s>", "[UNK]"],
)
)
register_conv_template(
Conversation(
name="aquila-legacy",
system_message="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("### Human: ", "### Assistant: ", "System"),
messages=(),
offset=0,
sep_style=SeparatorStyle.NO_COLON_TWO,
sep="\n",
sep2="</s>",
stop_str=["</s>", "[UNK]"],
)
)
register_conv_template(
Conversation(
name="aquila",
system_message="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant", "System"),
messages=(),
offset=0,
sep_style=SeparatorStyle.ADD_COLON_TWO,
sep="###",
sep2="</s>",
stop_str=["</s>", "[UNK]"],
)
)
register_conv_template(
Conversation(
name="aquila-v1",
roles=("<|startofpiece|>", "<|endofpiece|>", ""),
messages=(),
offset=0,
sep_style=SeparatorStyle.NO_COLON_TWO,
sep="",
sep2="</s>",
stop_str=["</s>", "<|endoftext|>"],
)
)
if __name__ == "__main__":
print("aquila template:")
conv = get_conv_template("aquila")
conv.append_message(conv.roles[0], "Hello!")
conv.append_message(conv.roles[1], "Hi!")
conv.append_message(conv.roles[0], "How are you?")
conv.append_message(conv.roles[1], None)
print(conv.get_prompt())
print("\n")
print("aquila-chat template:")
conv = get_conv_template("aquila-chat")
conv.append_message(conv.roles[0], "Hello!")
conv.append_message(conv.roles[1], "Hi!")
conv.append_message(conv.roles[0], "How are you?")
conv.append_message(conv.roles[1], None)
print(conv.get_prompt())
print("\n")
print("aquila-v1 template:")
conv = get_conv_template("aquila-v1")
conv.append_message(conv.roles[0], "Hello!")
conv.append_message(conv.roles[1], "Hi!")
conv.append_message(conv.roles[0], "How are you?")
conv.append_message(conv.roles[1], None)
print(conv.get_prompt())
print("\n")
print("aquila-legacy template:")
conv = get_conv_template("aquila-legacy")
conv.append_message(conv.roles[0], "Hello!")
conv.append_message(conv.roles[1], "Hi!")
conv.append_message(conv.roles[0], "How are you?")
conv.append_message(conv.roles[1], None)
print(conv.get_prompt())
print("\n")
def set_random_seed(seed):
"""Set random seed for reproducability."""
if seed is not None and seed > 0:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
def covert_prompt_to_input_ids_with_history(text, history, tokenizer, max_token, convo_template="aquila-chat"):
# aquila-chat as default
conv = get_conv_template(convo_template)
conv.append_message(conv.roles[1], None)
conv.append_message(conv.roles[0], text)
example = tokenizer.encode_plus(f"{conv.get_prompt()} ", None, max_length=None)['input_ids']
if history is None or not isinstance(history, list):
history = []
while(len(history) > 0 and (len(example) < max_token)):
tmp = history.pop()
if tmp[0] == 'ASSISTANT':
conv.append_message(conv.roles[1], tmp[1])
else:
conv.append_message(conv.roles[0], tmp[1])
example = tokenizer.encode_plus(f"{conv.get_prompt()} ", None, max_length=None)['input_ids']
if len(example) >= max_token:
conv.messages.pop()
conv.messages = conv.messages[::-1]
print('model in:', conv.get_prompt())
example = tokenizer.encode_plus(f"{conv.get_prompt()} ", None, max_length=None)['input_ids']
return example
def predict(model, text, tokenizer=None,
max_gen_len=200, top_p=0.95,
seed=1234, topk=100,
temperature=0.9,
sft=True, convo_template = "",
device = "cuda",
model_name="AquilaChat2-7B",
history=None,
**kwargs):
vocab = tokenizer.get_vocab()
id2word = {v:k for k, v in vocab.items()}
template_map = {"AquilaChat2-7B": "aquila-v1",
"AquilaChat2-34B": "aquila-legacy",
"AquilaChat2-7B-16K": "aquila",
"AquilaChat2-34B-16K": "aquila"}
if not convo_template:
convo_template=template_map.get(model_name, "aquila-chat")
set_random_seed(seed)
if temperature == 0:
topk = 1
temperature = 1.0
if sft:
tokens = covert_prompt_to_input_ids_with_history(text, history=history, tokenizer=tokenizer, max_token=2048, convo_template=convo_template)
tokens = torch.tensor(tokens)[None,].to(device)
else :
tokens = tokenizer.encode_plus(text)["input_ids"]
print(tokenizer.decode(tokens))
tokens = torch.tensor(tokens)[None,].to(device)
input_length = len(tokens[0])
with torch.no_grad():
# instantiate logits processors
logits_processor = LogitsProcessorList(
[
MinLengthLogitsProcessor(1, eos_token_id=100007),
]
)
# instantiate logits processors
logits_warper = LogitsProcessorList(
[
TopPLogitsWarper(top_p),
TopKLogitsWarper(topk),
TemperatureLogitsWarper(temperature),
]
)
stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=input_length + max_gen_len)])
out = model.sample(
tokens,
logits_processor=logits_processor,
logits_warper=logits_warper,
stopping_criteria=stopping_criteria,
return_dict_in_generate=True,
output_scores=True,
)
# print(out)
out_ids = out["sequences"][0][input_length:].cpu().numpy()
out_scores = out["scores"]
out_scores = torch.cat(out_scores, dim=0)
out_scores = torch.nn.functional.softmax(out_scores, dim=-1).cpu().numpy()
probs = []
for i in range(len(out_ids)):
probs.append(float(out_scores[i][out_ids[i]]))
# print(f"probs is {probs}")
convert_tokens = []
for t in out_ids:
if t == 100006:
convert_tokens.append("[CLS]")
else :
convert_tokens.append(id2word.get(t, "[unkonwn_token]"))
out_text = tokenizer.decode(out_ids.tolist())
out = out_text
if "[UNK]" in out:
special_index = out.index("[UNK]")
out = out[:special_index]
token_length = len(tokenizer.encode_plus(out)["input_ids"])
convert_tokens = convert_tokens[:token_length]
probs = probs[:token_length]
if "</s>" in out:
special_index = out.index("</s>")
out = out[: special_index]
token_length = len(tokenizer.encode_plus(out)["input_ids"])
convert_tokens = convert_tokens[:token_length]
probs = probs[:token_length]
if len(out) > 0 and out[0] == " ":
out = out[1:]
convert_tokens = convert_tokens[1:]
probs = probs[1:]
if isinstance(history, list):
# Update history
history.insert(0, ('ASSISTANT', out))
history.insert(0, ('USER', text))
return out

6
quant_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"zero_point": true,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}

6
special_tokens_map.json Normal file
View File

@@ -0,0 +1,6 @@
{
"bos_token": "[CLS]",
"eos_token": "</s>",
"unk_token": "<|endoftext|>",
"pad_token": "<|endoftext|>"
}

199863
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

10
tokenizer_config.json Normal file
View File

@@ -0,0 +1,10 @@
{
"add_prefix_space": false,
"bos_token": "[CLS]",
"clean_up_tokenization_spaces": true,
"eos_token": "</s>",
"model_max_length": 4096,
"padding_side": "right",
"tokenizer_class": "GPT2Tokenizer",
"unk_token": "<|endoftext|>"
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long