初始化项目,由ModelHub XC社区提供模型

Model: eryk-mazus/polka-1.1b
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-21 09:58:26 +08:00
commit 64222493cb
9 changed files with 123603 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

119
README.md Normal file
View File

@@ -0,0 +1,119 @@
---
license: apache-2.0
base_model: eryk-mazus/tinyllama-with-custom-tokenizer
datasets:
- allenai/MADLAD-400
- eryk-mazus/polka-pretrain-en-pl-v1
language:
- pl
- en
pipeline_tag: text-generation
widget:
- text: "Wiedźmin 3 to fabularna gra akcji wyprodukowana"
output:
text: " przez studio CD Projekt RED. Akcja rozgrywa się w świecie fantasy, a jej bohaterem jest Geralt z Rivii,"
- text: "Gdy już będziecie w Warszawie, miejscem, które koniecznie musicie odwiedzić jest"
output:
text: " Muzeum Powstania Warszawskiego. To jedyne tego typu muzeum w Europie"
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61bf0e11c88f3fd22f654059/EMSrPEzAFkjY9nvbaJoC3.png)
# polka-1.1b
`polka-1.1b` takes the [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) model and enhances it by continuing pretraining on an additional **5.7 billion Polish tokens**, primarily sourced from the [MADLAD-400](https://arxiv.org/abs/2309.04662) dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using [DSIR](https://github.com/p-lambda/dsir). Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.
The training took 680 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO-2.
Context size: 2,048 tokens.
## Notes
This base model was initially developed as the foundation for instruction tuning, which resulted in [polka-1.1b-chat](https://huggingface.co/eryk-mazus/polka-1.1b-chat). Nonetheless, I'm sharing it with the community because I see potential value in its combination of relatively good performance and an efficient bilingual tokenizer.
The model is capable of producing coherent Polish text, but due to its size, it is likely to suffer from hallucination.
## Evaluation
Performed by [OPI-PG](https://huggingface.co/OPI-PG), the authors of Qra models.
### PolEval-2018
<table>
<thead>
<tr><th>Model</th><th>Perplexity</th></tr>
</thead>
<tr><td colspan="2"><strong>English models</strong></td></tr>
<tr><td>meta-llama/Llama-2-7b-hf</td><td>24.3</td></tr>
<tr><td>meta-llama/Llama-2-13b-hf</td><td>21.4</td></tr>
<tr><td>mistralai/Mistral-7B-v0.1</td><td>21.4</td></tr>
<tr><td>TinyLlama/TinyLlama-1.1B</td><td>40.4</td></tr>
<tr><td colspan="2"><strong>Polish models</strong></td></tr>
<tr><td>sdadas/polish-gpt2-small</td><td>134.4</td></tr>
<tr><td>sdadas/polish-gpt2-medium</td><td>100.8</td></tr>
<tr><td>sdadas/polish-gpt2-large</td><td>93.2</td></tr>
<tr><td>sdadas/polish-gpt2-xl</td><td>94.1</td></tr>
<tr><td>Azurro/APT3-275M-Base</td><td>129.8</td></tr>
<tr><td>Azurro/APT3-500M-Base</td><td>153.1</td></tr>
<tr><td>Azurro/APT3-1B-Base</td><td>106.8</td></tr>
<tr><td><b>eryk-mazus/polka-1.1b</b></td><td><b>18.1</b></td></tr>
<tr><td>szymonrucinski/Curie-7B-v1</td><td>13.5</td></tr>
<tr><td>OPI-PG/Qra-1b</td><td>14.7</td></tr>
</table>
### Long documents (2024)
Currently, LLMs support contexts of thousands of tokens. Their practical applications usually also involve processing long documents. Therefore, evaluating perplexity on a sentence-based dataset such as PolEval-2018 may not be meaningful. Additionally, the PolEval corpus has been publicly available on the internet for the past few years, which raises the possibility that for some models the training sets have been contaminated by this data. For this reason, we have prepared a new collection consisting of long papers published exclusively in 2024, which will allow us to more reliably test the perplexities of the models on new knowledge that was not available to them at the time of training. The corpus consists of 5,000 documents ranging from several hundred to about 20,000 tokens. Half of the set consists of press texts from Polish news portals from February 2024, the other half are scientific articles published since January 2024. Most of the documents exceed the context size of the evaluated models. To calculate perplexity for these documents, we divided them into chunks of size equal to the model's context length with a stride of 512 tokens, following [this example](https://huggingface.co/docs/transformers/en/perplexity).
<table>
<thead>
<tr><th>Model</th><th>Context</th><th>Perplexity</th></tr>
</thead>
<tr><td colspan="3"><strong>English models</strong></td></tr>
<tr><td>meta-llama/Llama-2-7b-hf</td><td>4096</td><td>5.9</td></tr>
<tr><td>meta-llama/Llama-2-13b-hf</td><td>4096</td><td>5.3</td></tr>
<tr><td>mistralai/Mistral-7B-v0.1</td><td>4096</td><td>4.9</td></tr>
<tr><td>TinyLlama/TinyLlama-1.1B</td><td>2048</td><td>9.6</td></tr>
<tr><td colspan="3"><strong>Polish models</strong></td></tr>
<tr><td>sdadas/polish-gpt2-small</td><td>2048</td><td>27.3</td></tr>
<tr><td>sdadas/polish-gpt2-medium</td><td>2048</td><td>20.3</td></tr>
<tr><td>sdadas/polish-gpt2-large</td><td>1536</td><td>18.0</td></tr>
<tr><td>sdadas/polish-gpt2-xl</td><td>1536</td><td>16.6</td></tr>
<tr><td>Azurro/APT3-275M-Base</td><td>2048</td><td>77.0</td></tr>
<tr><td>Azurro/APT3-500M-Base</td><td>2048</td><td>50.5</td></tr>
<tr><td>Azurro/APT3-1B-Base</td><td>2048</td><td>19.1</td></tr>
<tr><td><b>eryk-mazus/polka-1.1b</b></td><td><b>2048</b></td><td><b>6.9</b></td></tr>
<tr><td>szymonrucinski/Curie-7B-v1</td><td>4096</td><td>4.8</td></tr>
<tr><td>OPI-PG/Qra-1b</td><td>4096</td><td>6.1</td></tr>
</table>
## Sample code
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "eryk-mazus/polka-1.1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
prompt = """Przykładowe zapytanie do modelu"""
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
with torch.no_grad():
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
do_sample=True,
penalty_alpha=0.6,
top_k=5
)
output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)
```

28
config.json Normal file
View File

@@ -0,0 +1,28 @@
{
"_name_or_path": "eryk-mazus/tinyllama-with-custom-tokenizer",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 5632,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 22,
"num_key_value_heads": 4,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.36.2",
"use_cache": false,
"vocab_size": 43904
}

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"transformers_version": "4.36.2"
}

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:28de3b055185991453e08268918a5df89e1e1b5edf394cc51c76c4ebd4f73d47
size 2297637448

3
pytorch_model.bin Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:aa585b9221266e354bf136da9aab42bc6d22e1ba7fa6f295e910e8a48adf5cb3
size 2297641916

24
special_tokens_map.json Normal file
View File

@@ -0,0 +1,24 @@
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": "</s>",
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

123344
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

41
tokenizer_config.json Normal file
View File

@@ -0,0 +1,41 @@
{
"add_bos_token": true,
"add_eos_token": false,
"added_tokens_decoder": {
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<s>",
"clean_up_tokenization_spaces": false,
"eos_token": "</s>",
"legacy": false,
"model_max_length": 1000000000000000019884624838656,
"pad_token": "</s>",
"padding_side": "right",
"sp_model_kwargs": {},
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false
}