---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
datasets:
- anuragshas/ur_opus100_processed_cv9
- mahwizzzz/urdu_alpaca_yc_filtered
- large-traversaal/urdu-instruct
- muhammadnoman76/lughaat-urdu-dataset-llm
tags:
- urdu
- nlp
- qwen
- unsloth
- instruct
- tts-ready
- ocr-optimized
- mega-dataset
language:
- ur
metrics:
- loss
library_name: transformers
pipeline_tag: text-generation
model_name: Qwen-Urdu-Shaheen-7B-Instruct-v1
---
"شاہین کا جہاں اور، کرگس کا جہاں اور..."
— علامہ اقبال کے افکار اور اردو ادب کی ترویج کے لیے ایک جدید لسانی ماڈل
🦅 Qwen-Urdu-Shaheen-7B-Instruct (v1.0) 🇵🇰
**Qwen-Urdu-Shaheen** is a state-of-the-art Urdu Language Model fine-tuned on a massive **1.83 Million row** corpus. It is designed to bridge the gap between classical Urdu intellectual heritage and modern conversational AI.
Built on the **Qwen 2.5 7B Instruct** architecture using **Unsloth**, this model delivers high-performance inference with deep cultural and linguistic nuances.
---
## 🌟 Key Highlights
- **Massive Scale:** Fine-tuned on 1.83M curated Urdu records.
- **Literary Depth:** Specialized in the philosophy of **Allama Iqbal**, the poetry of **Ghalib**, and **Ahmed Faraz**.
- **Instruction Master:** Optimized with the **Alif-Instruct** dataset for precise Urdu command following.
- **Modern Context:** Integrated with the **Lughat News Corpus** for contemporary vocabulary and news synthesis.
- **OCR Synergized:** Trained to process and generate couplets derived from Urdu Poetry OCR datasets.
---
## 📊 Dataset Composition
The model was trained on a multi-domain Urdu corpus to ensure versatility:
| Category | Dataset Source | Description |
| :--- | :--- | :--- |
| **Literature** | Iqbaliyat & Ghazal Bank | Classical and contemporary poetry analysis. |
| **Instruction** | Alif-Instruct | Multi-turn Urdu dialogues and logic tasks. |
| **Current Affairs** | Lughat News | Modern Urdu prose and media vocabulary. |
| **Specialized** | Urdu-Poetry-OCR | Structural understanding of poetic couplets. |
---
## 🛠️ Technical Specifications
- **Base Model:** `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
- **Architecture:** Causal Language Model (Transformer)
- **Parameters:** 7 Billion
- **Training Tool:** Unsloth (2x faster finetuning)
- **Hardware:** NVIDIA GeForce RTX 4060 Ti 16GB
- **Quantization:** 4-bit (bitsandbytes)
- **Checkpoint:** 4500 Steps
---
### 💎 Shaheen Highlights
| Feature | Capability |
| :--- | :--- |
| **Dataset Size** | 1.83 Million Urdu Rows |
| **Optimization** | Unsloth (4-bit LoRA) |
| **Primary Focus** | Iqbaliyat & Urdu Prose |
| **OCR Support** | Specialized for Nastaliq script couplets |
## 🚀 Quick Start (Inference)
```python
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1",
max_seq_length = 2048,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
# Sample Prompt
prompt = "علامہ اقبال کے فلسفہء خودی کا خلاصہ پیش کریں۔"
inputs = tokenizer([f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7)
print(tokenizer.batch_decode(outputs)[0])