🦅 Qwen-Urdu-Shaheen-7B-Instruct (v1.0) 🇵🇰

---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
datasets:
- anuragshas/ur_opus100_processed_cv9
- mahwizzzz/urdu_alpaca_yc_filtered
- large-traversaal/urdu-instruct
- muhammadnoman76/lughaat-urdu-dataset-llm
tags:
- urdu
- nlp
- qwen
- unsloth
- instruct
- tts-ready
- ocr-optimized
- mega-dataset
language:
- ur
metrics:
- loss
library_name: transformers
pipeline_tag: text-generation
model_name: Qwen-Urdu-Shaheen-7B-Instruct-v1
---
<p align="center">
  <br>
  <b style="font-size: 24px;">"شاہین کا جہاں اور، کرگس کا جہاں اور..."</b> <br>
  <b style="font-size: 18px;">— علامہ اقبال کے افکار اور اردو ادب کی ترویج کے لیے ایک جدید لسانی ماڈل</b>
</p>

<p align="center">
  <img src="https://huggingface.co/Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1/resolve/main/images.jpeg" width="300" alt="Qwen-Urdu-Shaheen Logo">
</p>

<h1 align="center">🦅 Qwen-Urdu-Shaheen-7B-Instruct (v1.0) 🇵🇰</h1>

**Qwen-Urdu-Shaheen** is a state-of-the-art Urdu Language Model fine-tuned on a massive **1.83 Million row** corpus. It is designed to bridge the gap between classical Urdu intellectual heritage and modern conversational AI. 

Built on the **Qwen 2.5 7B Instruct** architecture using **Unsloth**, this model delivers high-performance inference with deep cultural and linguistic nuances.

---

## 🌟 Key Highlights
- **Massive Scale:** Fine-tuned on 1.83M curated Urdu records.
- **Literary Depth:** Specialized in the philosophy of **Allama Iqbal**, the poetry of **Ghalib**, and **Ahmed Faraz**.
- **Instruction Master:** Optimized with the **Alif-Instruct** dataset for precise Urdu command following.
- **Modern Context:** Integrated with the **Lughat News Corpus** for contemporary vocabulary and news synthesis.
- **OCR Synergized:** Trained to process and generate couplets derived from Urdu Poetry OCR datasets.

---

## 📊 Dataset Composition
The model was trained on a multi-domain Urdu corpus to ensure versatility:
| Category | Dataset Source | Description |
| :--- | :--- | :--- |
| **Literature** | Iqbaliyat & Ghazal Bank | Classical and contemporary poetry analysis. |
| **Instruction** | Alif-Instruct | Multi-turn Urdu dialogues and logic tasks. |
| **Current Affairs** | Lughat News | Modern Urdu prose and media vocabulary. |
| **Specialized** | Urdu-Poetry-OCR | Structural understanding of poetic couplets. |

---

## 🛠️ Technical Specifications
- **Base Model:** `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
- **Architecture:** Causal Language Model (Transformer)
- **Parameters:** 7 Billion
- **Training Tool:** Unsloth (2x faster finetuning)
- **Hardware:** NVIDIA GeForce RTX 4060 Ti 16GB
- **Quantization:** 4-bit (bitsandbytes)
- **Checkpoint:** 4500 Steps

---

### 💎 Shaheen Highlights
| Feature | Capability |
| :--- | :--- |
| **Dataset Size** | 1.83 Million Urdu Rows |
| **Optimization** | Unsloth (4-bit LoRA) |
| **Primary Focus** | Iqbaliyat & Urdu Prose |
| **OCR Support** | Specialized for Nastaliq script couplets |

## 🚀 Quick Start (Inference)

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# Sample Prompt
prompt = "علامہ اقبال کے فلسفہء خودی کا خلاصہ پیش کریں۔"
inputs = tokenizer([f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7)
print(tokenizer.batch_decode(outputs)[0])