初始化项目,由ModelHub XC社区提供模型
Model: ibm-granite/granite-3.1-3b-a800m-base Source: Original Platform
This commit is contained in:
37
.gitattributes
vendored
Normal file
37
.gitattributes
vendored
Normal file
@@ -0,0 +1,37 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
model-00001-of-00002.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
model-00002-of-00002.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
323
README.md
Normal file
323
README.md
Normal file
@@ -0,0 +1,323 @@
|
||||
---
|
||||
pipeline_tag: text-generation
|
||||
inference: false
|
||||
license: apache-2.0
|
||||
library_name: transformers
|
||||
tags:
|
||||
- language
|
||||
- granite-3.1
|
||||
---
|
||||
|
||||
# Granite-3.1-3B-A800M-Base
|
||||
|
||||
**Model Summary:**
|
||||
Granite-3.1-3B-A800M-Base extends the context length of Granite-3.0-3B-A800M-Base from 4K to 128K using a progressive training strategy by increasing the supported context length in increments while adjusting RoPE theta until the model has successfully adapted to desired length of 128K. This long-context pre-training stage was performed using approximately 500B tokens.
|
||||
|
||||
- **Developers:** Granite Team, IBM
|
||||
- **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
|
||||
- **Website**: [Granite Docs](https://www.ibm.com/granite/docs/)
|
||||
- **Paper:** [Granite 3.1 Language Models (coming soon)](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d)
|
||||
- **Release Date**: December 18th, 2024
|
||||
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
||||
|
||||
**Supported Languages:**
|
||||
English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages.
|
||||
|
||||
**Intended Use:**
|
||||
Prominent use cases of LLMs in text-to-text generation include summarization, text classification, extraction, question-answering, and more. All Granite Base models are able to handle these tasks as they were trained on a large amount of data from various domains. Moreover, they can serve as baseline to create specialized models for specific application scenarios.
|
||||
|
||||
**Generation:**
|
||||
This is a simple example of how to use Granite-3.1-3B-A800M-Base model.
|
||||
|
||||
Install the following libraries:
|
||||
|
||||
```shell
|
||||
pip install torch torchvision torchaudio
|
||||
pip install accelerate
|
||||
pip install transformers
|
||||
```
|
||||
Then, copy the code snippet below to run the example.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
device = "auto"
|
||||
model_path = "ibm-granite/granite-3.1-3b-a800m-base"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
# drop device_map if running on CPU
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
|
||||
model.eval()
|
||||
# change input text as desired
|
||||
input_text = "Where is the Thomas J. Watson Research Center located?"
|
||||
# tokenize the text
|
||||
input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
|
||||
# generate output tokens
|
||||
output = model.generate(**input_tokens,
|
||||
max_length=4000)
|
||||
# decode output tokens into text
|
||||
output = tokenizer.batch_decode(output)
|
||||
# print output
|
||||
print(output)
|
||||
```
|
||||
**Evaluation Results:**
|
||||
<table>
|
||||
<caption><b>HuggingFace Open LLM Leaderboard V1</b></caption>
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">ARC-Challenge</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">Hellaswag</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">MMLU</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">TruthfulQA</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">Winogrande</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">GSM8K</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">Avg</th>
|
||||
</tr></thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Granite-3.1-8B-Base</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">63.99</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">83.27</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">63.45</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">51.29</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">78.92</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">60.19</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">66.85</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">Granite-3.1-2B-Base</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">53.58</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">77.67</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">52.86</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">39.02</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">72.84</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">47.99</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">57.32</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #DAE8FF; color: #2D2D2D;">Granite-3.1-3B-A800M-Base</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">50.76</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">74.45</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">48.31</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">39.91</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">69.29</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">40.56</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">53.88</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">Granite-3.1-1B-A400M-Base</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">39.42</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">66.13</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">26.53</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">37.67</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">2.03</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">18.87</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">31.78</td>
|
||||
</tr>
|
||||
</tbody></table>
|
||||
|
||||
<table>
|
||||
<caption><b>HuggingFace Open LLM Leaderboard V2</b></caption>
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align:left; background-color: #001d6c; color: white;">Models</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">IFEval</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">BBH</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">MATH Lvl 5</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">GPQA</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">MUSR</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">MMLU-Pro</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">Avg</th>
|
||||
</tr></thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Granite-3.1-8B-Base</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">42.21</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">26.02</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">9.52</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">9.51</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">8.36</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">24.8</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">20.07</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">Granite-3.1-2B-Base</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">35.22</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">16.84</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">5.59</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">3.69</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">3.9</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">13.9</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">13.19</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #DAE8FF; color: #2D2D2D;">Granite-3.1-3B-A800M-Base</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">29.96</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">11.91</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">4</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">3.69</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">1.11</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">8.81</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: #2D2D2D;">9.91</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: #2D2D2D;">Granite-3.1-1B-A400M-Base</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">25.19</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">6.43</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">2.19</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">0.22</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">1.76</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">1.55</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: #2D2D2D;">6.22</td>
|
||||
</tr>
|
||||
</tbody></table>
|
||||
|
||||
**Model Architecture:**
|
||||
Granite-3.1-3B-A800M-Base is based on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. Core components of this architecture are: Fine-grained Experts, Dropless Token Routing, and Load Balancing Loss.
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align:left; background-color: #001d6c; color: white;">Model</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">2B Dense</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">8B Dense</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">1B MoE</th>
|
||||
<th style="text-align:center; background-color: #001d6c; color: white;">3B MoE</th>
|
||||
</tr></thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Embedding size</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">2048</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">4096</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">1024</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">1536</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Number of layers</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">40</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">40</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">24</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">32</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Attention head size</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">64</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">128</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">64</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">64</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Number of attention heads</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">32</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">32</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">16</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">24</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Number of KV heads</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">8</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">8</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">8</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">8</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">MLP hidden size</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">8192</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">12800</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">512</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">512</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">MLP activation</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">SwiGLU</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">SwiGLU</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">SwiGLU</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">SwiGLU</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Number of experts</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">—</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">—</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">32</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">40</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">MoE TopK</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">—</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">—</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">8</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">8</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Initialization std</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">0.1</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">0.1</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">0.1</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">0.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Sequence length</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">128K</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">128K</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">128K</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">128K</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;">Position embedding</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">RoPE</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">RoPE</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">RoPE</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">RoPE</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;"># Parameters</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">2.5B</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">8.1B</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">1.3B</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">3.3B</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;"># Active parameters</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">2.5B</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">8.1B</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">400M</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">800M</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left; background-color: #FFFFFF; color: black;"># Training tokens</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">12T</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">12T</td>
|
||||
<td style="text-align:center; background-color: #FFFFFF; color: black;">10T</td>
|
||||
<td style="text-align:center; background-color: #DAE8FF; color: black;">10T</td>
|
||||
</tr>
|
||||
</tbody></table>
|
||||
|
||||
**Training Data:**
|
||||
This model is trained on a mix of open source and proprietary data following a two-stage training strategy.
|
||||
* Stage 1 data: The data for stage 1 is sourced from diverse domains, such as: web, code, academic sources, books, and math data.
|
||||
* Stage 2 data: The data for stage 2 comprises a curated mix of high-quality data from the same domains, plus multilingual and instruction data. The goal of this second training phase is to enhance the model’s performance on specific tasks.
|
||||
* Stage 3 data: The data for stage 3 consists of original stage-2 pretraining data with additional synthetic long-context data in form of QA/summary pairs where the answer contains a recitation of the related paragraph before the answer.
|
||||
|
||||
A detailed attribution of datasets can be found in the [Granite 3.0 Technical Report](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/paper.pdf), [Granite 3.1 Technical Report (coming soon)](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf).
|
||||
|
||||
**Infrastructure:**
|
||||
We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|
||||
|
||||
**Ethical Considerations and Limitations:**
|
||||
The use of Large Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-3.1-3B-A800M-Base model is not the exception in this regard. Even though this model is suited for multiple generative AI tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying text verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use Granite-3.1-3B-A800M-Base model with ethical intentions and in a responsible way.
|
||||
|
||||
**Resources**
|
||||
- ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
|
||||
- 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
|
||||
- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
|
||||
|
||||
<!-- ## Citation
|
||||
```
|
||||
@misc{granite-models,
|
||||
author = {author 1, author2, ...},
|
||||
title = {},
|
||||
journal = {},
|
||||
volume = {},
|
||||
year = {2024},
|
||||
url = {https://arxiv.org/abs/0000.00000},
|
||||
}
|
||||
``` -->
|
||||
35
config.json
Normal file
35
config.json
Normal file
@@ -0,0 +1,35 @@
|
||||
{
|
||||
"architectures": [
|
||||
"GraniteMoeForCausalLM"
|
||||
],
|
||||
"attention_bias": false,
|
||||
"attention_dropout": 0.1,
|
||||
"attention_multiplier": 0.015625,
|
||||
"bos_token_id": 0,
|
||||
"embedding_multiplier": 12.0,
|
||||
"eos_token_id": 0,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 1536,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 512,
|
||||
"logits_scaling": 6.0,
|
||||
"max_position_embeddings": 131072,
|
||||
"model_type": "granitemoe",
|
||||
"num_attention_heads": 24,
|
||||
"num_experts_per_tok": 8,
|
||||
"num_hidden_layers": 32,
|
||||
"num_key_value_heads": 8,
|
||||
"num_local_experts": 40,
|
||||
"output_router_logits": false,
|
||||
"pad_token_id": 0,
|
||||
"residual_multiplier": 0.22,
|
||||
"rms_norm_eps": 1e-06,
|
||||
"rope_scaling": null,
|
||||
"rope_theta": 10000000.0,
|
||||
"router_aux_loss_coef": 0.001,
|
||||
"tie_word_embeddings": true,
|
||||
"torch_dtype": "bfloat16",
|
||||
"transformers_version": "4.46.0",
|
||||
"use_cache": true,
|
||||
"vocab_size": 49152
|
||||
}
|
||||
1
configuration.json
Normal file
1
configuration.json
Normal file
@@ -0,0 +1 @@
|
||||
{"framework": "pytorch", "task": "text-generation", "allow_remote": true}
|
||||
7
generation_config.json
Normal file
7
generation_config.json
Normal file
@@ -0,0 +1,7 @@
|
||||
{
|
||||
"_from_model_config": true,
|
||||
"bos_token_id": 0,
|
||||
"eos_token_id": 0,
|
||||
"pad_token_id": 0,
|
||||
"transformers_version": "4.46.0"
|
||||
}
|
||||
3
model-00001-of-00002.safetensors
Normal file
3
model-00001-of-00002.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:bc37c61db3b08c0ea0fe5df20cef994060ee84801c2c843d920b88d616e1f425
|
||||
size 4998539488
|
||||
3
model-00002-of-00002.safetensors
Normal file
3
model-00002-of-00002.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:13ea399d25de5d60cef80a58e3c7e490f8b317542e0967735d0d425bcc588369
|
||||
size 1599073464
|
||||
297
model.safetensors.index.json
Normal file
297
model.safetensors.index.json
Normal file
@@ -0,0 +1,297 @@
|
||||
{
|
||||
"metadata": {
|
||||
"total_size": 6597577728
|
||||
},
|
||||
"weight_map": {
|
||||
"model.embed_tokens.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.24.block_sparse_moe.input_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.24.block_sparse_moe.output_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.24.block_sparse_moe.router.layer.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.24.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.24.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.24.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.24.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.25.block_sparse_moe.input_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.25.block_sparse_moe.output_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.25.block_sparse_moe.router.layer.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.block_sparse_moe.input_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.block_sparse_moe.output_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.block_sparse_moe.router.layer.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.block_sparse_moe.input_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.block_sparse_moe.output_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.block_sparse_moe.router.layer.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.block_sparse_moe.input_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.block_sparse_moe.output_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.block_sparse_moe.router.layer.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.block_sparse_moe.input_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.block_sparse_moe.output_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.block_sparse_moe.router.layer.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.3.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.3.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.3.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.30.block_sparse_moe.input_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.30.block_sparse_moe.output_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.30.block_sparse_moe.router.layer.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.block_sparse_moe.input_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.block_sparse_moe.output_linear.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.block_sparse_moe.router.layer.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
|
||||
"model.layers.4.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.4.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.4.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.block_sparse_moe.input_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.block_sparse_moe.output_linear.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.block_sparse_moe.router.layer.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
|
||||
"model.norm.weight": "model-00002-of-00002.safetensors"
|
||||
}
|
||||
}
|
||||
51
special_tokens_map.json
Normal file
51
special_tokens_map.json
Normal file
@@ -0,0 +1,51 @@
|
||||
{
|
||||
"additional_special_tokens": [
|
||||
"<|endoftext|>",
|
||||
"<fim_prefix>",
|
||||
"<fim_middle>",
|
||||
"<fim_suffix>",
|
||||
"<fim_pad>",
|
||||
"<filename>",
|
||||
"<gh_stars>",
|
||||
"<issue_start>",
|
||||
"<issue_comment>",
|
||||
"<issue_closed>",
|
||||
"<jupyter_start>",
|
||||
"<jupyter_text>",
|
||||
"<jupyter_code>",
|
||||
"<jupyter_output>",
|
||||
"<empty_output>",
|
||||
"<commit_before>",
|
||||
"<commit_msg>",
|
||||
"<commit_after>",
|
||||
"<reponame>"
|
||||
],
|
||||
"bos_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"eos_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"pad_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"unk_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
}
|
||||
98258
tokenizer.json
Normal file
98258
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
187
tokenizer_config.json
Normal file
187
tokenizer_config.json
Normal file
@@ -0,0 +1,187 @@
|
||||
{
|
||||
"add_prefix_space": false,
|
||||
"added_tokens_decoder": {
|
||||
"0": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"1": {
|
||||
"content": "<fim_prefix>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"2": {
|
||||
"content": "<fim_middle>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"3": {
|
||||
"content": "<fim_suffix>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"4": {
|
||||
"content": "<fim_pad>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"5": {
|
||||
"content": "<filename>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"6": {
|
||||
"content": "<gh_stars>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"7": {
|
||||
"content": "<issue_start>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"8": {
|
||||
"content": "<issue_comment>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"9": {
|
||||
"content": "<issue_closed>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"10": {
|
||||
"content": "<jupyter_start>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"11": {
|
||||
"content": "<jupyter_text>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"12": {
|
||||
"content": "<jupyter_code>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"13": {
|
||||
"content": "<jupyter_output>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"14": {
|
||||
"content": "<empty_output>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"15": {
|
||||
"content": "<commit_before>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"16": {
|
||||
"content": "<commit_msg>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"17": {
|
||||
"content": "<commit_after>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"18": {
|
||||
"content": "<reponame>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
}
|
||||
},
|
||||
"additional_special_tokens": [
|
||||
"<|endoftext|>",
|
||||
"<fim_prefix>",
|
||||
"<fim_middle>",
|
||||
"<fim_suffix>",
|
||||
"<fim_pad>",
|
||||
"<filename>",
|
||||
"<gh_stars>",
|
||||
"<issue_start>",
|
||||
"<issue_comment>",
|
||||
"<issue_closed>",
|
||||
"<jupyter_start>",
|
||||
"<jupyter_text>",
|
||||
"<jupyter_code>",
|
||||
"<jupyter_output>",
|
||||
"<empty_output>",
|
||||
"<commit_before>",
|
||||
"<commit_msg>",
|
||||
"<commit_after>",
|
||||
"<reponame>"
|
||||
],
|
||||
"bos_token": "<|endoftext|>",
|
||||
"clean_up_tokenization_spaces": true,
|
||||
"eos_token": "<|endoftext|>",
|
||||
"model_max_length": 9223372036854775807,
|
||||
"pad_token": "<|endoftext|>",
|
||||
"padding_side": "left",
|
||||
"tokenizer_class": "GPT2Tokenizer",
|
||||
"unk_token": "<|endoftext|>",
|
||||
"vocab_size": 49152
|
||||
}
|
||||
Reference in New Issue
Block a user