初始化项目,由ModelHub XC社区提供模型

Model: uzlm/alloma-1B-Instruct
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-10 15:15:03 +08:00
commit c31dd5e3a1
9 changed files with 2481 additions and 0 deletions

36
.gitattributes vendored Normal file
View File

@@ -0,0 +1,36 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text

184
README.md Normal file
View File

@@ -0,0 +1,184 @@
---
license: llama3.2
language:
- uz
- en
base_model: meta-llama/Llama-3.2-1B-Instruct
library_name: transformers
tags:
- llama
- uzbek
- uzbekllm
- uzbeknlp
- text-generation
- translation
- summarization
- question-answering
- tokenizer
datasets:
- HuggingFaceFW/fineweb-2
- tahrirchi/uz-crawl
- yakhyo/uz-wiki
- wikipedia
- tatsu-lab/alpaca
- behbudiy/alpaca-cleaned-uz
- UAzimov/uzbek-instruct-llm
- behbudiy/translation-instruction
metrics:
- bleu
- comet
- accuracy
pipeline_tag: text-generation
---
### Model Description
This is the 1B parameter version of our Uzbek-optimized Llama series. Also, check out our other models:
* **[alloma-3B-Instruct](https://huggingface.co/beruniy/Llama-3.2-3B-Instruct-Uz)**
* **[alloma-8B-Instruct](https://huggingface.co/beruniy/Llama-3.1-8B-Instruct-Uz)**
---
Our **alloma-1B-Instruct** model has been continually pretrained with context length of 2048 tokens, on 2.4B tokens (75% English, 25% Uzbek), then SFT fine-tuned. Our customized tokenizer averages 1.7 tokens per Uzbek word vs. ~3.5 in the original Llama models, meaning 2x faster inference and longer effective context length on Uzbek text. Youll be able to run this model on just 2 GB of VRAM (with quantization), perfect for small GPUs, edge devices, or even mobile scenarios.
## Methodology: Efficient Vocabulary Adaptation for Uzbek
The primary motivation for our technical approach is to create a model with a more efficient tokenizer for the Uzbek language. This ensures both faster inference speeds and a longer effective context length when processing Uzbek text, as fewer tokens are needed to represent the same amount of information.
To avoid the prohibitive cost of training from scratch, we adapted the powerful meta-llama/Llama-3.2 base model using an in-place vocabulary replacement strategy. We identified less relevant non-ASCII tokens in the original vocabulary and replaced them with our custom Uzbek tokens. This was performed without altering the model's architecture or total vocabulary size, carefully merging the new Uzbek BPE rules while preserving the original English ones.
To give the new tokens a meaningful starting point for training, we initialized their embeddings using subtoken averaging. Each new Uzbek token was broken down by the original tokenizer, and its new embedding was created by averaging the embeddings of its subtokens. This method allowed for highly efficient continual pretraining on our bilingual dataset, resulting in a model fully optimized for Uzbek.
---
### Benchmarks 1B, 3B
| Model | BLEU Uz→En (Zero_shot) | BLEU En→Uz (Zero_shot) | COMET Uz→En | COMET En→Uz | Uzbek Sentiment Analysis | Uzbek News Classification | MMLU-uz (Zero_shot) | MMLU (English) (Zero_shot) |
| --------------------------------- | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
| **[Llama-3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)** | 3.62 | 0.44 | 56.72 | 35.52 | 54.77 | 42.16 | 24.37 |38.15 |
| **[alloma-1B-Instruct](https://huggingface.co/beruniy/Llama-3.2-1B-Instruct-uz)** | 16.64 | 10.20 | 81.42 | 82.73 | 63.49 | 10.75 | 26.27 | 26.29 |
| **[Llama-3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)** | 11.91 | 2.54 | 71.96 | 55.62 | 56.01 | 70.60 | 31.88 | 52.04 |
| **[alloma-3B-Instruct](https://huggingface.co/beruniy/Llama-3.2-3B-Instruct-Uz)** | 25.19 | 14.66 | 85.08 | 86.82 | 81.64 | 41.56 | 39.30 | 45.91 |
### Benchmarks 8B
| Model | BLEU Uz→En (Zero_shot) | BLEU En→Uz (Zero_shot) | COMET Uz→En | COMET En→Uz | Uzbek Sentiment Analysis | Uzbek News Classification | MMLU-uz (Zero_shot) | MMLU (English) (Zero_shot) |
| --------------------------------- | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
| **[Llama-3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)** | 24.23 | 8.28 | 83.12 | 82.22 | 69.77 | 73.63 | 40.51 | 60.59 |
| **[Behbudiy Mistral 7B Uz](https://huggingface.co/behbudiy/Mistral-7B-Instruct-Uz)** | 28.09 | 15.96 | 86.26 | 88.42 | 83.41 | 55.51 | 36.56 | 47.09 |
| **[Behbudiy Llama 8B Uz](https://huggingface.co/behbudiy/Llama-3.1-8B-Instruct-Uz)** | 27.08 | 13.29 | 84.76 | 85.62 | 81.66 | 68.22 | 41.28 | 59.18 |
| **[alloma-8B-Instruct](https://huggingface.co/beruniy/Llama-3.1-8B-Instruct-Uz)** | 31.16 | 15.58 | 87.24 | 87.64 | 82.66 | 65.65 | 41.89 | 53.35 |
<!-- | **[Behbudiy Nemo 12B Uz](https://huggingface.co/behbudiy/Mistral-Nemo-Instruct-Uz)** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -->
The results show that our Uzbek-optimized models consistently outperform their base counterparts in translation benchmarks (BLEU and COMET) on the FLORES+ Uz-En / En-Uz evaluation datasets and sentiment analysis in Uzbek language. Also, on the MMLU benchmark, which measures general language understanding across multiple tasks in English, and News classification tasks, our Uzbek optimized model showed slight decline because of catastrophic forgetting of original English instruction following. (The official Llama models MMLU score may differ from our score due to our evaluation method. Refer to the links below to see evaluation details.)
Were eager to see how these models will contribute to Uzbek open-source and be used by our Uzbek 🇺🇿 community. 🚀
## How to use
The uzlm/alloma-1B-Instruct model can be used with transformers in the following way. We recommend preprocessing Uzbek input to replace apostrophe (') with sequence (APST) to achieve our model's lower tokenizer fertility.
### Use with transformers
```python
import re, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import langid
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DTYPE = torch.bfloat16
MODEL_ID = "uzlm/alloma-1B-Instruct"
PATTERN = r"[’‘‚‛ʻʼʽʾʿˈˊˋˌˍ'\']"
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
tok.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=DTYPE,
device_map="auto"
)
EOT = "<|eot_id|>"
SYSTEM = (
f"{tok.bos_token}<|start_header_id|>system<|end_header_id|>\n"
"You are a helpful assistant<|eot_id|>"
)
def prompt(user: str) -> str:
return (
SYSTEM +
"<|start_header_id|>user<|end_header_id|>\n" +
f"{user}{EOT}" +
"<|start_header_id|>assistant<|end_header_id|>"
)
def generate(user: str, max_new: int = 256) -> str:
lang, confidence = langid.classify(user)
clean_text = re.sub(PATTERN, "APST", user) if lang != "en" else user
enc = tok(prompt(clean_text), return_tensors="pt").to(DEVICE)
out = model.generate(**enc,
max_new_tokens=max_new,
bos_token_id=tok.bos_token_id,
eos_token_id=tok.convert_tokens_to_ids(EOT),
pad_token_id=tok.pad_token_id,
do_sample=False)
txt = tok.decode(out[0], skip_special_tokens=False)
txt = txt.split("<|start_header_id|>assistant<|end_header_id|>", 1)[1]
return txt.split(EOT, 1)[0].replace("APST", "'").strip()
print(generate("Menga Alisher Navoiy haqida aytib ber."))
```
## Information on Evaluation Method
To evaluate on the translation task, we used FLORES+ Uz-En / En-Uz datasets.
We used the following prompt to do zero-shot Uz-En evaluation both for the base model and Uzbek-optimized model (for En-Uz eval, we changed the positions of the words "English" and "Uzbek").
```python
prompt = f"Input: {clean_text} \n\nYour task is to accurately translate the given Uzbek text into English.\n"
"Output only the English translation, without any additional comments.\n"
"\nPlease translate the following Uzbek text into English."
```
To assess the model's ability in Uzbek sentiment analysis, we used the **risqaliyevds/uzbek-sentiment-analysis** dataset (refer to **behbudiy/uzbek-sentiment-analysis** dataset).
We used the following prompt for the evaluation:
```python
prompt = f'''Input: {clean_text} \n\nGiven the following text, determine the sentiment as either 'Positive' or 'Negative.' Respond with only the word 'Positive' or 'Negative' without any additional text or explanation."
'''
```
For Uzbek News Classification, we used **risqaliyevds/uzbek-zero-shot-classification** dataset and asked the model to predict the category of the news using the following prompt:
```python
prompt = f'''Input: {clean_text}\n\nClassify the given news article in Uzbek.
0 - Siyosat - If the text is about politics.
1 - Iqtisodiyot - If the text is about the economy.
2 - Texnologiya - If the text is about technology.
3 - Sport - If the text is about sports.
4 - Madaniyat - If the text is about culture.
5 - Salomatlik - If the text is about health.
6 - Oila va Jamiyat - If the text is about family and society.
7 - TaAPSTlim - If the text is about education.
8 - Ekologiya - If the text is about ecology.
9 - Xorijiy Yangiliklar - If the text is about foreign news.
Print only one digit ID of the corresponding class.
'''
```
On MMLU, we performed 0-shot evaluation using the following **template** and extracted the first token generated by the model for measuring accuracy:
```python
template = "Given the above question and choices, choose the single best answer (A, B, C, or D). Respond with only one letter..
```
## Acknowledgements
This project was developed by the teams at **[Examy.me](https://examy.me/)** and **[Teamwork.uz](https://teamwork.uz/)**. Their collaboration and resources were essential to the creation and success of the `alloma` model series.
## More
For more details and examples, refer to the base model below:
https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

93
chat_template.jinja Normal file
View File

@@ -0,0 +1,93 @@
{{- bos_token }}
{%- if custom_tools is defined %}
{%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
{%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
{%- if strftime_now is defined %}
{%- set date_string = strftime_now("%d %b %Y") %}
{%- else %}
{%- set date_string = "26 Jul 2024" %}
{%- endif %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = none %}
{%- endif %}
{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
{%- set system_message = messages[0]['content']|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{%- set system_message = "" %}
{%- endif %}
{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
{{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- "Today Date: " + date_string + "\n\n" }}
{%- if tools is not none and not tools_in_user_message %}
{{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{%- endif %}
{{- system_message }}
{{- "<|eot_id|>" }}
{#- Custom tools are passed in a user message with some extra guidance #}
{%- if tools_in_user_message and not tools is none %}
{#- Extract the first user message so we can plug it in here #}
{%- if messages | length != 0 %}
{%- set first_user_message = messages[0]['content']|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
{%- endif %}
{{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
{{- "Given the following functions, please respond with a JSON for a function call " }}
{{- "with its proper arguments that best answers the given prompt.\n\n" }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{{- first_user_message + "<|eot_id|>"}}
{%- endif %}
{%- for message in messages %}
{%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
{%- elif 'tool_calls' in message %}
{%- if not message.tool_calls|length == 1 %}
{{- raise_exception("This model only supports single tool-calls at once!") }}
{%- endif %}
{%- set tool_call = message.tool_calls[0].function %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
{{- '{"name": "' + tool_call.name + '", ' }}
{{- '"parameters": ' }}
{{- tool_call.arguments | tojson }}
{{- "}" }}
{{- "<|eot_id|>" }}
{%- elif message.role == "tool" or message.role == "ipython" %}
{{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
{%- if message.content is mapping or message.content is iterable %}
{{- message.content | tojson }}
{%- else %}
{{- message.content }}
{%- endif %}
{{- "<|eot_id|>" }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}

36
config.json Normal file
View File

@@ -0,0 +1,36 @@
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128009,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 16,
"num_key_value_heads": 8,
"pad_token_id": 128256,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 32.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.52.1",
"use_cache": true,
"vocab_size": 128257
}

12
generation_config.json Normal file
View File

@@ -0,0 +1,12 @@
{
"bos_token_id": 128000,
"do_sample": true,
"eos_token_id": [
128001,
128008,
128009
],
"temperature": 0.6,
"top_p": 0.9,
"transformers_version": "4.52.1"
}

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:870aa69fb38c031705d45f5e4e4a9140f1769ac605255dd83d3cb7ad74ddedec
size 2471649704

39
special_tokens_map.json Normal file
View File

@@ -0,0 +1,39 @@
{
"additional_special_tokens": [
{
"content": "<|start_header_id|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<|end_header_id|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
],
"bos_token": {
"content": "<|begin_of_text|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|eot_id|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<pad>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

3
tokenizer.json Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6dda8a876d08a5ae617ea46f34dfca2e9891ab05e4e20c22fa5e7720db451093
size 16239946

2075
tokenizer_config.json Normal file

File diff suppressed because it is too large Load Diff