初始化项目,由ModelHub XC社区提供模型

Model: MediaTek-Research/Breeze-7B-32k-Base-v1_0
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-24 14:09:13 +08:00
commit 80468b210b
12 changed files with 140939 additions and 0 deletions

36
.gitattributes vendored Normal file
View File

@@ -0,0 +1,36 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
model.safetensors filter=lfs diff=lfs merge=lfs -text

121
README.md Normal file
View File

@@ -0,0 +1,121 @@
---
pipeline_tag: text-generation
license: apache-2.0
language:
- zh
- en
---
# Model Card for MediaTek Research Breeze-7B-32k-Base-v1_0
MediaTek Research Breeze-7B (hereinafter referred to as Breeze-7B) is a language model family that builds on top of [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), specifically intended for Traditional Chinese use.
[Breeze-7B-Base](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v1_0) is the base model for the Breeze-7B series.
It is suitable for use if you have substantial fine-tuning data to tune it for your specific use case.
[Breeze-7B-Instruct](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v1_0) derives from the base model Breeze-7B-Base, making the resulting model amenable to be used as-is for commonly seen tasks.
[Breeze-7B-32k-Base](https://huggingface.co/MediaTek-Research/Breeze-7B-32k-Base-v1_0) is extended from the base model with more data, base change, and the disabling of the sliding window.
Roughly speaking, that is equivalent to 44k Traditional Chinese characters.
[Breeze-7B-32k-Instruct](https://huggingface.co/MediaTek-Research/Breeze-7B-32k-Instruct-v1_0) derives from the base model Breeze-7B-32k-Base, making the resulting model amenable to be used as-is for commonly seen tasks.
Practicality-wise:
- Breeze-7B-Base expands the original vocabulary with additional 30,000 Traditional Chinese tokens. With the expanded vocabulary, everything else being equal, Breeze-7B operates at twice the inference speed for Traditional Chinese to Mistral-7B and Llama 7B. [See [Inference Performance](#inference-performance).]
- Breeze-7B-Instruct can be used as is for common tasks such as Q&A, RAG, multi-round chat, and summarization.
- Breeze-7B-32k-Instruct can perform tasks at a document level (For Chinese, 20 ~ 40 pages).
*A project by the members (in alphabetical order): Chan-Jan Hsu 許湛然, Feng-Ting Liao 廖峰挺, Po-Chun Hsu 許博竣, Yi-Chang Chen 陳宜昌, and the supervisor Da-Shan Shiu 許大山.*
## Features
- Breeze-7B-32k-Base-v1_0
- Expanding the vocabulary dictionary size from 32k to 62k to better support Traditional Chinese
- 32k-token context length
- Breeze-7B-32k-Instruct-v1_0
- Expanding the vocabulary dictionary size from 32k to 62k to better support Traditional Chinese
- 32k-token context length
- Multi-turn dialogue (without special handling for harmfulness)
## Model Details
- Breeze-7B-32k-Base-v1_0
- Pretrained from: [Breeze-7B-Base](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v1_0)
- Model type: Causal decoder-only transformer language model
- Language: English and Traditional Chinese (zh-tw)
- Breeze-7B-32k-Instruct-v1_0
- Finetuned from: [Breeze-7B-32k-Base](https://huggingface.co/MediaTek-Research/Breeze-7B-32k-Base-v1_0)
- Model type: Causal decoder-only transformer language model
- Language: English and Traditional Chinese (zh-tw)
## Long-context Performance
#### Needle-in-a-haystack Performance
We use the passkey retrieval task to test the model's ability to attend to different various depths in a given sequence.
A key in placed within a long context distracting document for the model to retrieve.
The key position is binned into 16 bins, and there are 20 testcases for each bin.
Breeze-7B-32k-Base clears the tasks with 90+% accuracy, shown in the figure below.
![Needle-in-a-haystack Performance](https://huggingface.co/MediaTek-Research/Breeze-7B-32k-Base-v1_0/resolve/main/needle-in-a-haystack-performance.png)
#### Long-DRCD Performance
| **Model/Performance(EM)** | **DRCD** | **DRCD-16k** | **DRCD-32k** |
|---------------------------|----------|--------------|--------------|
| **Breeze-7B-32k-Instruct-v1\_0** | 76.9 | 54.82 | 44.26 |
| **Breeze-7B-32k-Base-v1\_0** | 79.73 | 69.68 | 61.55 |
| **Breeze-7B-Base-v1\_0** | 80.61 | 21.79 | 15.29 |
#### Short-Benchmark Performance
| **Model/Performance(EM)** | **TMMLU+** | **MMLU** | **TABLE** | **MT-Bench-tw** | **MT-Bench** |
|---------------------------|----------|--------------|--------------|-----|-----|
| **Breeze-7B-32k-Instruct-v1\_0** | 41.37 | 61.34 | 34 | 5.8 | 7.4 |
| **Breeze-7B-Instruct-v1\_0** | 42.67 | 62.73 | 39.58 | 6.0 | 7.4 |
## Use in Transformers
First, install direct dependencies:
```
pip install transformers torch accelerate
```
<p style="color:red;">Flash-attention2 is strongly recommended for long context scenarios.</p>
```bash
pip install packaging ninja
pip install flash-attn
```
Then load the model in transformers:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"MediaTek-Research/Breeze-7B-32k-Base-v1_0",
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2" # optional but highly recommended
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("MediaTek-Research/Breeze-7B-32k-Base-v1_0")
tokenizer.tokenize("你好,我可以幫助您解決各種問題、提供資訊和協助您完成許多不同的任務。例如:回答技術問題、提供建議、翻譯文字、尋找資料或協助您安排行程等。請告訴我如何能幫助您。")
# Tokenized results
# ['▁', '你好', '', '我', '可以', '幫助', '您', '解決', '各種', '問題', '、', '提供', '資訊', '和', '協助', '您', '完成', '許多', '不同', '的', '任務', '。', '例如', '', '回答', '技術', '問題', '、', '提供', '建議', '、', '翻譯', '文字', '、', '尋找', '資料', '或', '協助', '您', '安排', '行程', '等', '。', '請', '告訴', '我', '如何', '能', '幫助', '您', '。']
```
## Citation
```
@article{MediaTek-Research2024breeze7b,
title={Breeze-7B Technical Report},
author={Chan-Jan Hsu and Chang-Le Liu and Feng-Ting Liao and Po-Chun Hsu and Yi-Chang Chen and Da-Shan Shiu},
year={2024},
eprint={2403.02712},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

4
added_tokens.json Normal file
View File

@@ -0,0 +1,4 @@
{
"<EOD>": 61873,
"<PAD>": 61874
}

26
config.json Normal file
View File

@@ -0,0 +1,26 @@
{
"architectures": [
"MistralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_theta": 1000000.0,
"sliding_window": 65536,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.37.2",
"use_cache": true,
"vocab_size": 62464
}

1
configuration.json Normal file
View File

@@ -0,0 +1 @@
{"framework": "pytorch", "task": "text-generation", "allow_remote": true}

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"transformers_version": "4.37.2"
}

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7664752c98a6b0fa96298e32b638b3d2232fbb5c5311a592631098082f62a4b2
size 14982620440

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

24
special_tokens_map.json Normal file
View File

@@ -0,0 +1,24 @@
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": "</s>",
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

140658
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

3
tokenizer.model Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9298e56c094f0d30431b0e52ad53287f0cadc99ac8ca17cc2144b0eb4753f130
size 911034

57
tokenizer_config.json Normal file
View File

@@ -0,0 +1,57 @@
{
"add_bos_token": true,
"add_eos_token": false,
"added_tokens_decoder": {
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"61873": {
"content": "<EOD>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"61874": {
"content": "<PAD>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<s>",
"clean_up_tokenization_spaces": false,
"eos_token": "</s>",
"legacy": true,
"model_max_length": 1000000000000000019884624838656,
"pad_token": "</s>",
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false
}