Files
qwen3-4b-uzbek-v2/README.md
ModelHub XC d02bc0847b 初始化项目,由ModelHub XC社区提供模型
Model: inspirebek/qwen3-4b-uzbek-v2
Source: Original Platform
2026-05-05 14:31:43 +08:00

135 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- uz
- en
license: cc-by-nc-4.0
datasets:
- yakhyo/uz-wiki
- tahrirchi/uz-books-v2
- tahrirchi/uz-crawl
- saillab/alpaca_uzbek_taco
- behbudiy/alpaca-cleaned-uz
- UAzimov/uzbek-instruct-llm
- CohereLabs/aya_collection_language_split
- med-alex/qa_mt_ru_to_uzn
- med-alex/qa_mt_tr_to_uzn
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3-4B
tags:
- uzbek
- qwen3
- lora
- merged
- sft
- continued-pretraining
---
# qwen3-4b-uzbek-v2
merged bf16 uzbek fine-tune of `Qwen/Qwen3-4B`. standalone — loadable without peft.
## usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("inspirebek/qwen3-4b-uzbek-v2")
model = AutoModelForCausalLM.from_pretrained(
"inspirebek/qwen3-4b-uzbek-v2",
dtype=torch.bfloat16,
device_map="auto",
)
messages = [{"role": "user", "content": "Ozbekiston poytaxti qayer?"}]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
```
## training
a two-stage lora fine-tune of `Qwen/Qwen3-4B`. the hard parts weren't the training loop — they were figuring out why v1 failed, then engineering around a 16-hour compute timeout that couldn't fit the full run.
### the v1 lesson: why plain lora collapsed to random
v1 scored **26.92%** on mmlu-uz, statistically indistinguishable from 25% random baseline. the adapter was training and loss was descending, but the model learned nothing useful in uzbek. root cause: lora only touched the attention and mlp projections. for a base model with english-dominant pretraining, learning a new language requires **re-mapping the vocabulary itself** — which lives in `embed_tokens` (input embeddings) and `lm_head` (output projection). freezing both means the model can't reshape its token distribution over uzbek morphology, only nudge how it attends within the english-shaped geometry it already has.
v2 adds `embed_tokens` and `lm_head` to `target_modules`. this expands the lora to ~2 gb (vs ~200 mb in v1) but finally lets the model actually learn uzbek. result: **mmlu-uz jumped to 40.50%** (+13.58 pp over random).
### recipe
**lora configuration** (`unsloth` + `peft`):
- `r=64`, `alpha=128` — alpha = 2·r (more aggressive updates than the conservative alpha=r default)
- `use_rslora=True` — alpha scales by `alpha/sqrt(r)` instead of `alpha/r`; without this the per-parameter update magnitude at r=64 is too small and training stagnates
- target modules: `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head`
- `use_gradient_checkpointing="unsloth"` to fit the expanded adapter on a single l4
**dual learning rate** (via `UnslothTrainer`):
- base lr for projection layers, `embedding_learning_rate` at 1/10th for `embed_tokens` + `lm_head`
- embeddings are the highest-leverage layer — a full-lr update can catastrophically drift the model's token geometry in the first few hundred steps
**stage a — continued pretraining** on native uzbek text (3 datasets, ~2m rows):
- `lr=5e-5`, `embedding_lr=5e-6`, 1 epoch, `packing=False`
- the goal is to reshape the token distribution so the model fluently predicts uzbek sequences, not to learn task behavior
**stage b — supervised fine-tune** on chat-formatted uzbek instructions (6 datasets, ~4.3m rows):
- loaded from the stage a adapter (continues training, not a fresh start)
- `lr=1e-4`, `embedding_lr=1e-5`, 1 epoch
- `train_on_responses_only=True`: the loss is masked on the user side of the qwen3 chat template so gradient only flows through assistant responses; prevents the model from memorizing prompt patterns and gives +12% accuracy empirically
**stable batching**:
- `per_device_train_batch_size=1`, `gradient_accumulation_steps=16` — effective batch 16
- at r=64 with embedding lora, batch 2 oom's on 24gb. the accumulation trick keeps gradient signal while staying inside the vram budget
**optimization**: `adamw_8bit` (bitsandbytes) + cosine schedule + 3% warmup, weight decay 0.01, seed 3407.
### infra: surviving a 16-hour timeout on serverless gpu
modal functions have a 16-hour hard cap. stage b doesn't fit in 16 hours on a single l4, and a mid-epoch restart on a fresh container would lose everything (the `/ckpts/` volume persists, but the container that owns the training state doesn't).
the fix: a `TrainerCallback` that pushes each checkpoint to a private hugging face repo (`inspirebek/qwen3-4b-uzbek-ckpts`) on every `save_steps` fire. when the container timed out at step 7258 / 8242 (88%), the next run loaded `checkpoint-7000` from the hf backup and resumed — on a different modal account, after the first account's $30 credit ran out. account switches are invisible to huggingface; the backup was the portable state.
total stage b runtime: ~10 h first leg + ~3.2 h resume leg = ~13.2 h on a single l4.
### evaluation
- **mmlu-uz** (murodbek/MMLU-uz, zero-shot, logit-based multiple choice): **40.50%** overall
- social sciences 45.43%, other 45.07%, business 42.42%, stem 41.67%, medical 39.67%, humanities 35.60%
- **uzlib** (tahrirchi/uzlib, uzbek linguistic benchmark, sampled generation t=1.0 p=0.95): **33.42%** overall
- correct_word 34.98%, fill_in 30.77%, meaning 26.69%, meaning_in_context 25.00%
- 8.17% of responses didn't parse cleanly as `A/B/C/D` — a fraction of the score loss is format drift, not knowledge
## datasets
**stage a — fluency (continued pretraining):**
- [`yakhyo/uz-wiki`](https://huggingface.co/datasets/yakhyo/uz-wiki) · MIT
- [`tahrirchi/uz-books-v2`](https://huggingface.co/datasets/tahrirchi/uz-books-v2) · MIT
- [`tahrirchi/uz-crawl`](https://huggingface.co/datasets/tahrirchi/uz-crawl) · Apache-2.0
**stage b — instruct (sft):**
- [`saillab/alpaca_uzbek_taco`](https://huggingface.co/datasets/saillab/alpaca_uzbek_taco) · CC-BY-NC-4.0
- [`behbudiy/alpaca-cleaned-uz`](https://huggingface.co/datasets/behbudiy/alpaca-cleaned-uz) · CC-BY-4.0
- [`UAzimov/uzbek-instruct-llm`](https://huggingface.co/datasets/UAzimov/uzbek-instruct-llm) · Apache-2.0
- [`CohereLabs/aya_collection_language_split`](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) · Apache-2.0
- [`med-alex/qa_mt_ru_to_uzn`](https://huggingface.co/datasets/med-alex/qa_mt_ru_to_uzn) · unspecified
- [`med-alex/qa_mt_tr_to_uzn`](https://huggingface.co/datasets/med-alex/qa_mt_tr_to_uzn) · unspecified
> ⚠️ licensing note: `saillab/alpaca_uzbek_taco` is cc-by-nc-4.0, which restricts commercial use of derivative models. downstream users who need a fully permissive license should retrain without that subset.
## sibling formats
- [`inspirebek/qwen3-4b-uzbek-v2`](https://huggingface.co/inspirebek/qwen3-4b-uzbek-v2)
- [`inspirebek/qwen3-4b-uzbek-v2-lora`](https://huggingface.co/inspirebek/qwen3-4b-uzbek-v2-lora)
- [`inspirebek/qwen3-4b-uzbek-v2-bnb-4bit`](https://huggingface.co/inspirebek/qwen3-4b-uzbek-v2-bnb-4bit)
- [`inspirebek/qwen3-4b-uzbek-v2-awq`](https://huggingface.co/inspirebek/qwen3-4b-uzbek-v2-awq)
- [`inspirebek/qwen3-4b-uzbek-v2-GGUF`](https://huggingface.co/inspirebek/qwen3-4b-uzbek-v2-GGUF)
## intended use & limitations
uzbek-first chat assistant. capable in english as well. not aligned for safety — treat as a research artifact. knowledge cutoff inherits from `Qwen/Qwen3-4B`.