Qwen-Urdu-Shaheen-7B-Instru…/README.md

---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
datasets:
- anuragshas/ur_opus100_processed_cv9
- mahwizzzz/urdu_alpaca_yc_filtered
- large-traversaal/urdu-instruct
- muhammadnoman76/lughaat-urdu-dataset-llm
tags:
- urdu
- nlp
- qwen
- unsloth
- instruct
- tts-ready
- ocr-optimized
- mega-dataset
language:
- ur
metrics:
- loss
library_name: transformers
pipeline_tag: text-generation
model_name: Qwen-Urdu-Shaheen-7B-Instruct-v1
---
<p align="center">
  <br>
  <b style="font-size: 24px;">"شاہین کا جہاں اور، کرگس کا جہاں اور..."</b> <br>
  <b style="font-size: 18px;">— علامہ اقبال کے افکار اور اردو ادب کی ترویج کے لیے ایک جدید لسانی ماڈل</b>
</p>

<p align="center">
  <img src="https://huggingface.co/Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1/resolve/main/images.jpeg" width="300" alt="Qwen-Urdu-Shaheen Logo">
</p>

<h1 align="center">🦅 Qwen-Urdu-Shaheen-7B-Instruct (v1.0) 🇵🇰</h1>

**Qwen-Urdu-Shaheen** is a state-of-the-art Urdu Language Model fine-tuned on a massive **1.83 Million row** corpus. It is designed to bridge the gap between classical Urdu intellectual heritage and modern conversational AI. 

Built on the **Qwen 2.5 7B Instruct** architecture using **Unsloth**, this model delivers high-performance inference with deep cultural and linguistic nuances.

---

## 🌟 Key Highlights
- **Massive Scale:** Fine-tuned on 1.83M curated Urdu records.
- **Literary Depth:** Specialized in the philosophy of **Allama Iqbal**, the poetry of **Ghalib**, and **Ahmed Faraz**.
- **Instruction Master:** Optimized with the **Alif-Instruct** dataset for precise Urdu command following.
- **Modern Context:** Integrated with the **Lughat News Corpus** for contemporary vocabulary and news synthesis.
- **OCR Synergized:** Trained to process and generate couplets derived from Urdu Poetry OCR datasets.

---

## 📊 Dataset Composition
The model was trained on a multi-domain Urdu corpus to ensure versatility:
| Category | Dataset Source | Description |
| :--- | :--- | :--- |
| **Literature** | Iqbaliyat & Ghazal Bank | Classical and contemporary poetry analysis. |
| **Instruction** | Alif-Instruct | Multi-turn Urdu dialogues and logic tasks. |
| **Current Affairs** | Lughat News | Modern Urdu prose and media vocabulary. |
| **Specialized** | Urdu-Poetry-OCR | Structural understanding of poetic couplets. |

---

## 🛠️ Technical Specifications
- **Base Model:** `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
- **Architecture:** Causal Language Model (Transformer)
- **Parameters:** 7 Billion
- **Training Tool:** Unsloth (2x faster finetuning)
- **Hardware:** NVIDIA GeForce RTX 4060 Ti 16GB
- **Quantization:** 4-bit (bitsandbytes)
- **Checkpoint:** 4500 Steps

---

### 💎 Shaheen Highlights
| Feature | Capability |
| :--- | :--- |
| **Dataset Size** | 1.83 Million Urdu Rows |
| **Optimization** | Unsloth (4-bit LoRA) |
| **Primary Focus** | Iqbaliyat & Urdu Prose |
| **OCR Support** | Specialized for Nastaliq script couplets |

## 🚀 Quick Start (Inference)

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# Sample Prompt
prompt = "علامہ اقبال کے فلسفہء خودی کا خلاصہ پیش کریں۔"
inputs = tokenizer([f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7)
print(tokenizer.batch_decode(outputs)[0])
初始化项目，由ModelHub XC社区提供模型 Model: Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1 Source: Original Platform 2026-05-06 21:06:56 +08:00			`---`
			`license: apache-2.0`
			`base_model: Qwen/Qwen2.5-7B-Instruct`
			`datasets:`
			`- anuragshas/ur_opus100_processed_cv9`
			`- mahwizzzz/urdu_alpaca_yc_filtered`
			`- large-traversaal/urdu-instruct`
			`- muhammadnoman76/lughaat-urdu-dataset-llm`
			`tags:`
			`- urdu`
			`- nlp`
			`- qwen`
			`- unsloth`
			`- instruct`
			`- tts-ready`
			`- ocr-optimized`
			`- mega-dataset`
			`language:`
			`- ur`
			`metrics:`
			`- loss`
			`library_name: transformers`
			`pipeline_tag: text-generation`
			`model_name: Qwen-Urdu-Shaheen-7B-Instruct-v1`
			`---`
			`<p align="center">`
			`<br>`
			`<b style="font-size: 24px;">"شاہین کا جہاں اور، کرگس کا جہاں اور..."</b> <br>`
			`<b style="font-size: 18px;">— علامہ اقبال کے افکار اور اردو ادب کی ترویج کے لیے ایک جدید لسانی ماڈل</b>`
			`</p>`

			`<p align="center">`
			`<img src="https://huggingface.co/Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1/resolve/main/images.jpeg" width="300" alt="Qwen-Urdu-Shaheen Logo">`
			`</p>`

			`<h1 align="center">🦅 Qwen-Urdu-Shaheen-7B-Instruct (v1.0) 🇵🇰</h1>`

			`Qwen-Urdu-Shaheen is a state-of-the-art Urdu Language Model fine-tuned on a massive 1.83 Million row corpus. It is designed to bridge the gap between classical Urdu intellectual heritage and modern conversational AI.`

			`Built on the Qwen 2.5 7B Instruct architecture using Unsloth, this model delivers high-performance inference with deep cultural and linguistic nuances.`

			`---`

			`## 🌟 Key Highlights`
			`- Massive Scale: Fine-tuned on 1.83M curated Urdu records.`
			`- Literary Depth: Specialized in the philosophy of Allama Iqbal, the poetry of Ghalib, and Ahmed Faraz.`
			`- Instruction Master: Optimized with the Alif-Instruct dataset for precise Urdu command following.`
			`- Modern Context: Integrated with the Lughat News Corpus for contemporary vocabulary and news synthesis.`
			`- OCR Synergized: Trained to process and generate couplets derived from Urdu Poetry OCR datasets.`

			`---`

			`## 📊 Dataset Composition`
			`The model was trained on a multi-domain Urdu corpus to ensure versatility:`
			`\| Category \| Dataset Source \| Description \|`
			`\| :--- \| :--- \| :--- \|`
			`\| Literature \| Iqbaliyat & Ghazal Bank \| Classical and contemporary poetry analysis. \|`
			`\| Instruction \| Alif-Instruct \| Multi-turn Urdu dialogues and logic tasks. \|`
			`\| Current Affairs \| Lughat News \| Modern Urdu prose and media vocabulary. \|`
			`\| Specialized \| Urdu-Poetry-OCR \| Structural understanding of poetic couplets. \|`

			`---`

			`## 🛠️ Technical Specifications`
			- Base Model: `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
			`- Architecture: Causal Language Model (Transformer)`
			`- Parameters: 7 Billion`
			`- Training Tool: Unsloth (2x faster finetuning)`
			`- Hardware: NVIDIA GeForce RTX 4060 Ti 16GB`
			`- Quantization: 4-bit (bitsandbytes)`
			`- Checkpoint: 4500 Steps`

			`---`

			`### 💎 Shaheen Highlights`
			`\| Feature \| Capability \|`
			`\| :--- \| :--- \|`
			`\| Dataset Size \| 1.83 Million Urdu Rows \|`
			`\| Optimization \| Unsloth (4-bit LoRA) \|`
			`\| Primary Focus \| Iqbaliyat & Urdu Prose \|`
			`\| OCR Support \| Specialized for Nastaliq script couplets \|`

			`## 🚀 Quick Start (Inference)`

			```python
			`from unsloth import FastLanguageModel`
			`import torch`

			`model, tokenizer = FastLanguageModel.from_pretrained(`
			`model_name = "Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1",`
			`max_seq_length = 2048,`
			`load_in_4bit = True,`
			`)`
			`FastLanguageModel.for_inference(model)`

			`# Sample Prompt`
			`prompt = "علامہ اقبال کے فلسفہء خودی کا خلاصہ پیش کریں۔"`
			`inputs = tokenizer([f"<\|im_start\|>user\n{prompt}<\|im_end\|>\n<\|im_start\|>assistant\n"], return_tensors="pt").to("cuda")`

			`outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7)`
			`print(tokenizer.batch_decode(outputs)[0])`