初始化项目，由ModelHub XC社区提供模型

Model: junaid008/qehwa-pashto-llm Source: Original Platform
2026-05-18 02:51:38 +08:00
commit c635798781
7 changed files with 585 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,455 @@
+---
+language:
+- ps
+- en
+- ur
+license: apache-2.0
+library_name: transformers
+tags:
+- pashto
+- peshawari
+- pakistani-pashto
+- causal-lm
+- qwen2
+- sft
+- cpt
+- unsloth
+- trl
+base_model: Qwen/Qwen2.5-7B
+pipeline_tag: text-generation
+---
+
+# ☕ Qehwa — Pashto's First LLM
+
+**The first and best Pakistani Pashto large language model — specifically trained on Peshawari dialect.**
+
+Built by a solo developer as a free and open resource for 60+ million Pashto speakers worldwide.
+
+> ⚠️ This model performs best on Pakistani/Peshawari Pashto. Performance may be lower on Afghan Pashto dialect.
+
+---
+
+## 🌟 Model Description
+
+**Qehwa** is a fully instruction-tuned Pashto language model built on top of Qwen2.5-7B. It is the result of two-stage training:
+
+1. **Continued Pre-Training (CPT)** on 3.4 million clean Pakistani Pashto documents
+2. **Supervised Fine-Tuning (SFT)** on 126,519 high-quality Peshawari Pashto instruction-response pairs 
+
+This is the **first dedicated Pakistani Pashto LLM** — no comparable model exists publicly. It specifically targets the **Peshawari/KPK dialect** rather than generic or Afghan Pashto.
+
+This repo contains the **fully merged model** — ready to use with standard transformers, no additional libraries required.
+
+---
+
+## ✨ Capabilities
+
+- ✅ Answers questions in pure Peshawari Pashto
+- ✅ Responds to English instructions in Pashto
+- ✅ Responds to Urdu instructions in Pashto
+- ✅ Natural Pashto conversation
+- ✅ Pashto creative writing and poetry
+- ✅ Islamic topics in Pashto
+- ✅ KPK history, culture, and geography
+- ✅ Pashtunwali traditions and ethics
+- ✅ Pashto grammar correction
+- ✅ English to Pashto translation
+- ✅ Correct Pashto-specific characters: ښ ږ ټ ډ ړ ځ
+
+---
+
+## 📊 Evaluation Results
+
+Qehwa was evaluated on a custom benchmark of **150 tests across 15 categories** — the first ever comprehensive Pashto LLM benchmark. Since no standard Pashto benchmark exists publicly, this evaluation was designed specifically for Pakistani Pashto.
+
+### Top Performing Categories
+
+| Category | Score |
+|---|---|
+| English → Pashto | **90%** 🔥🔥 |
+| Urdu → Pashto | **84%** 🔥🔥 |
+| Health & Daily Life in Pashto | **90%** 🔥🔥 |
+| Culture & History | **90%** 🔥 |
+| Geography & Nature | **90%** 🔥 |
+
+> **Overall Average Accuracy across all 15 benchmark categories: 85.3%**
+
+### Evaluation Methodology
+- 150 custom Pashto prompts across 15 categories
+- Evaluated on A100 40GB GPU
+- Human reviewed outputs for fluency, accuracy and dialect correctness
+- No existing Pashto benchmark was available — this is the first Pashto LLM benchmark
+
+---
+
+## 💻 Installation
+```bash
+pip install transformers accelerate torch
+```
+
+For faster inference:
+```bash
+pip install unsloth
+```
+
+For running locally on CPU or small GPU:
+```bash
+pip install transformers accelerate bitsandbytes
+```
+
+---
+
+## 🚀 How to Use
+
+### ✅ Method 1 — Transformers (Recommended)
+
+Best for: Research, production, standard usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+
+model_name = "junaid008/qehwa-pashto-llm"
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model     = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype = torch.bfloat16,
+    device_map  = "auto",
+)
+
+ALPACA_TEMPLATE = """Below is an instruction in Pashto or English. Write a detailed response in Pashto.
+
+### Instruction:
+{}
+
+### Response:
+{}"""
+
+def generate(prompt):
+    inputs = tokenizer(
+        ALPACA_TEMPLATE.format(prompt, ""),
+        return_tensors = "pt",
+    ).to("cuda")
+
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens     = 500,
+        temperature        = 0.7,
+        do_sample          = True,
+        repetition_penalty = 1.1,
+        pad_token_id       = tokenizer.eos_token_id,
+    )
+
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    return response.split("### Response:")[-1].strip()
+
+# Pashto input
+print(generate("د پیښور تاریخ راته ووایه"))
+
+# English input
+print(generate("Tell me about Pashtunwali"))
+
+# Urdu input
+print(generate("پشاور کے بارے میں بتاؤ"))
+```
+
+---
+
+### ✅ Method 2 — 4-bit Quantization (Low VRAM)
+
+Best for: GPUs with 8GB VRAM or less
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+import torch
+
+model_name = "junaid008/qehwa-pashto-llm"
+
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit              = True,
+    bnb_4bit_quant_type       = "nf4",
+    bnb_4bit_compute_dtype    = torch.bfloat16,
+    bnb_4bit_use_double_quant = True,
+)
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model     = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    quantization_config = bnb_config,
+    device_map          = "auto",
+)
+
+ALPACA_TEMPLATE = """Below is an instruction in Pashto or English. Write a detailed response in Pashto.
+
+### Instruction:
+{}
+
+### Response:
+{}"""
+
+def generate(prompt):
+    inputs = tokenizer(
+        ALPACA_TEMPLATE.format(prompt, ""),
+        return_tensors = "pt",
+    ).to("cuda")
+
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens     = 500,
+        temperature        = 0.7,
+        do_sample          = True,
+        repetition_penalty = 1.1,
+        pad_token_id       = tokenizer.eos_token_id,
+    )
+
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    return response.split("### Response:")[-1].strip()
+
+print(generate("پښتونولي تشریح کړه"))
+```
+
+---
+
+### ✅ Method 3 — Unsloth (2x Faster Inference)
+
+Best for: Speed-optimized usage, Colab, A100/H100
+```python
+from unsloth import FastLanguageModel
+
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name     = "junaid008/qehwa-pashto-llm",
+    max_seq_length = 2048,
+    dtype          = None,
+    load_in_4bit   = False,
+)
+FastLanguageModel.for_inference(model)
+
+ALPACA_TEMPLATE = """Below is an instruction in Pashto or English. Write a detailed response in Pashto.
+
+### Instruction:
+{}
+
+### Response:
+{}"""
+
+import torch
+inputs = tokenizer(
+    ALPACA_TEMPLATE.format("د پیښور تاریخ راته ووایه", ""),
+    return_tensors = "pt",
+).to("cuda")
+
+outputs = model.generate(
+    **inputs,
+    max_new_tokens     = 500,
+    temperature        = 0.7,
+    do_sample          = True,
+    repetition_penalty = 1.1,
+    pad_token_id       = tokenizer.pad_token_id,
+)
+
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response.split("### Response:")[-1].strip())
+```
+
+---
+
+### ✅ Method 4 — CPU Only (No GPU)
+
+Best for: Testing on laptop, no GPU available (slow but works)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+
+model_name = "junaid008/qehwa-pashto-llm"
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model     = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype = torch.float32,  # float32 for CPU
+    device_map  = "cpu",
+)
+
+ALPACA_TEMPLATE = """Below is an instruction in Pashto or English. Write a detailed response in Pashto.
+
+### Instruction:
+{}
+
+### Response:
+{}"""
+
+inputs = tokenizer(
+    ALPACA_TEMPLATE.format("پښتو ژبه د چا ده؟", ""),
+    return_tensors = "pt",
+)
+
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 200,
+    do_sample      = False,   # greedy for CPU speed
+    pad_token_id   = tokenizer.eos_token_id,
+)
+
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response.split("### Response:")[-1].strip())
+```
+
+---
+
+### ✅ Method 5 — Google Colab (Free)
+
+Best for: Trying without any local setup
+
+Open in Colab and run:
+```python
+# Install
+!pip install transformers accelerate -q
+
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+
+tokenizer = AutoTokenizer.from_pretrained("junaid008/qehwa-pashto-llm")
+model     = AutoModelForCausalLM.from_pretrained(
+    "junaid008/qehwa-pashto-llm",
+    torch_dtype = torch.bfloat16,
+    device_map  = "auto",
+)
+
+ALPACA_TEMPLATE = """Below is an instruction in Pashto or English. Write a detailed response in Pashto.
+
+### Instruction:
+{}
+
+### Response:
+{}"""
+
+def generate(prompt):
+    inputs  = tokenizer(ALPACA_TEMPLATE.format(prompt, ""), return_tensors="pt").to("cuda")
+    outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.7,
+                              do_sample=True, pad_token_id=tokenizer.eos_token_id)
+    return tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[-1].strip()
+
+print(generate("Tell me about Peshawar"))
+print(generate("پښتونولي تشریح کړه"))
+print(generate("پشاور کا مشہور کھانا کیا ہے؟"))
+```
+
+---
+
+## ⚙️ Hardware Requirements
+
+| Method | VRAM | Speed |
+|---|---|---|
+| bfloat16 full | 16GB+ | ✅ Fast |
+| 4-bit quantized | 8GB+ | ✅ Good |
+| Unsloth | 16GB+ | 🔥 2x Faster |
+| CPU only | No GPU | ⚠️ Slow |
+
+---
+
+## 📊 Training Details
+
+### Stage 1 — Continued Pre-Training (CPT)
+
+| Parameter | Value |
+|---|---|
+| Base model | Qwen/Qwen2.5-7B |
+| Hardware | NVIDIA A100-SXM4-40GB |
+| Training steps | 5,000 |
+| Final CPT loss | ~1.8 |
+| Dataset size | 3,400,000 documents |
+| Sequence length | 2,048 tokens |
+| Precision | bfloat16 |
+| LoRA rank | 64 |
+| Learning rate | 5e-5 |
+| Effective batch size | 32 |
+
+### Stage 2 — Supervised Fine-Tuning (SFT)
+
+| Parameter | Value |
+|---|---|
+| Base model | junaid008/pashto-qwen2.5-7b-v3 (CPT) |
+| Hardware | NVIDIA A100-SXM4-40GB |
+| Training steps | 7,908 |
+| Final SFT loss | 0.455 |
+| Dataset size | 126,519 pairs |
+| Epochs | 2 |
+| Sequence length | 2,048 tokens |
+| Precision | bfloat16 |
+| LoRA rank | 64 |
+| Learning rate | 5e-5 |
+| Effective batch size | 32 |
+
+---
+
+## 📚 Dataset
+
+### CPT Dataset
+- 3.4 million Pakistani Pashto documents
+- Sources: news, books, religious texts, Wikipedia, web crawl
+- Custom cleaned with Pashto-specific Unicode normalization
+
+### SFT Dataset — 126,519 pairs:
+
+| Type | Description |
+|---|---|
+| Pashto → Pashto | Questions and answers in pure Peshawari Pashto |
+| English → Pashto | English instructions with Pashto responses |
+| Urdu → Pashto | Urdu instructions with Pashto responses |
+| Conversation | Natural dialogue in Peshawari dialect |
+| Islamic topics | Religious knowledge in Pashto |
+| Creative writing | Pashto poetry, stories, descriptions |
+| Grammar | Pashto language correction and explanation |
+| Translation | English to Pashto translation pairs |
+| Cultural | Pashtunwali, traditions, history, geography |
+
+---
+
+## 🗣️ Example Outputs
+
+**Pashto input:**
+```
+Prompt : د پیښور تاریخ راته ووایه
+Output : پیښور د نړۍ یو له تر ټولو پخوانیو ښارونو څخه دی چې تاریخ یې زرګونه
+         کاله پخوا ته رسیږي. د دې ښار پخوانی نوم پوشکلاوتي او بیا پروشاپور و.
+         د پیښور د بالاحصار کلا د مغلو او سکهانو د وخت یادګار دی...
+```
+
+**English input:**
+```
+Prompt : Tell me about Pashtunwali
+Output : پښتونولي د پښتنو د ژوند هغه لرغونی او سپېڅلی ضابطه اخلاق ده چې د دوی
+         ټولنیز ژوند یې تنظیم کړی دی. په دې کې مېلمه پالنه، غیرت، او د کمزورو
+         ساتنه تر ټولو لوی ارزښتونه ګڼل کېږي...
+```
+
+**Urdu input:**
+```
+Prompt : پشتونوالی کیا ہے؟
+Output : پښتونولي د پښتنو د ژوند هغه لرغونی او زرین ضابطه اخلاق ده...
+```
+
+---
+
+## 👤 Author
+
+**Junaid Aslam**
+- HuggingFace: [junaid008](https://huggingface.co/junaid008)
+- Built independently as a contribution to Pashto NLP
+
+---
+
+## 📜 License
+
+Apache 2.0 — free to use, modify, and distribute with attribution.
+
+---
+
+## 🤝 Citation
+```bibtex
+@misc{qehwa-pashto-llm,
+  author    = {Junaid Aslam},
+  title     = {Qehwa — Pashto's First LLM},
+  year      = {2026},
+  publisher = {HuggingFace},
+  url       = {https://huggingface.co/junaid008/qehwa-pashto-llm}
+}
+```
--- a/config.json
+++ b/config.json
@@ -0,0 +1,64 @@
+{
+  "architectures": [
+    "Qwen2ForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": null,
+  "dtype": "bfloat16",
+  "eos_token_id": 151643,
+  "hidden_act": "silu",
+  "hidden_size": 3584,
+  "initializer_range": 0.02,
+  "intermediate_size": 18944,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 131072,
+  "max_window_layers": 28,
+  "model_type": "qwen2",
+  "num_attention_heads": 28,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 4,
+  "pad_token_id": 151665,
+  "rms_norm_eps": 1e-06,
+  "rope_parameters": {
+    "rope_theta": 1000000.0,
+    "rope_type": "default"
+  },
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "transformers_version": "5.2.0",
+  "unsloth_fixed": true,
+  "unsloth_version": "2026.3.4",
+  "use_cache": false,
+  "use_mrope": false,
+  "use_sliding_window": false,
+  "vocab_size": 152064
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,9 @@
+{
+  "eos_token_id": [
+    151643
+  ],
+  "max_length": 131072,
+  "max_new_tokens": 2048,
+  "pad_token_id": 151665,
+  "transformers_version": "5.2.0"
+}
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1b14653034d5866af47bde6859adb272c70eb2475ff742a914bde1cd4287f39e
+size 15231272152
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:bd5948af71b4f56cf697f7580814c7ce8b80595ef985544efcacf716126a2e31
+size 11422356
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,15 @@
+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|PAD_TOKEN|>",
+  "padding_side": "left",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}