101 lines
3.6 KiB
Markdown
101 lines
3.6 KiB
Markdown
|
|
---
|
||
|
|
license: apache-2.0
|
||
|
|
base_model: Qwen/Qwen2.5-7B-Instruct
|
||
|
|
datasets:
|
||
|
|
- anuragshas/ur_opus100_processed_cv9
|
||
|
|
- mahwizzzz/urdu_alpaca_yc_filtered
|
||
|
|
- large-traversaal/urdu-instruct
|
||
|
|
- muhammadnoman76/lughaat-urdu-dataset-llm
|
||
|
|
tags:
|
||
|
|
- urdu
|
||
|
|
- nlp
|
||
|
|
- qwen
|
||
|
|
- unsloth
|
||
|
|
- instruct
|
||
|
|
- tts-ready
|
||
|
|
- ocr-optimized
|
||
|
|
- mega-dataset
|
||
|
|
language:
|
||
|
|
- ur
|
||
|
|
metrics:
|
||
|
|
- loss
|
||
|
|
library_name: transformers
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
model_name: Qwen-Urdu-Shaheen-7B-Instruct-v1
|
||
|
|
---
|
||
|
|
<p align="center">
|
||
|
|
<br>
|
||
|
|
<b style="font-size: 24px;">"شاہین کا جہاں اور، کرگس کا جہاں اور..."</b> <br>
|
||
|
|
<b style="font-size: 18px;">— علامہ اقبال کے افکار اور اردو ادب کی ترویج کے لیے ایک جدید لسانی ماڈل</b>
|
||
|
|
</p>
|
||
|
|
|
||
|
|
<p align="center">
|
||
|
|
<img src="https://huggingface.co/Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1/resolve/main/images.jpeg" width="300" alt="Qwen-Urdu-Shaheen Logo">
|
||
|
|
</p>
|
||
|
|
|
||
|
|
<h1 align="center">🦅 Qwen-Urdu-Shaheen-7B-Instruct (v1.0) 🇵🇰</h1>
|
||
|
|
|
||
|
|
**Qwen-Urdu-Shaheen** is a state-of-the-art Urdu Language Model fine-tuned on a massive **1.83 Million row** corpus. It is designed to bridge the gap between classical Urdu intellectual heritage and modern conversational AI.
|
||
|
|
|
||
|
|
Built on the **Qwen 2.5 7B Instruct** architecture using **Unsloth**, this model delivers high-performance inference with deep cultural and linguistic nuances.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🌟 Key Highlights
|
||
|
|
- **Massive Scale:** Fine-tuned on 1.83M curated Urdu records.
|
||
|
|
- **Literary Depth:** Specialized in the philosophy of **Allama Iqbal**, the poetry of **Ghalib**, and **Ahmed Faraz**.
|
||
|
|
- **Instruction Master:** Optimized with the **Alif-Instruct** dataset for precise Urdu command following.
|
||
|
|
- **Modern Context:** Integrated with the **Lughat News Corpus** for contemporary vocabulary and news synthesis.
|
||
|
|
- **OCR Synergized:** Trained to process and generate couplets derived from Urdu Poetry OCR datasets.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📊 Dataset Composition
|
||
|
|
The model was trained on a multi-domain Urdu corpus to ensure versatility:
|
||
|
|
| Category | Dataset Source | Description |
|
||
|
|
| :--- | :--- | :--- |
|
||
|
|
| **Literature** | Iqbaliyat & Ghazal Bank | Classical and contemporary poetry analysis. |
|
||
|
|
| **Instruction** | Alif-Instruct | Multi-turn Urdu dialogues and logic tasks. |
|
||
|
|
| **Current Affairs** | Lughat News | Modern Urdu prose and media vocabulary. |
|
||
|
|
| **Specialized** | Urdu-Poetry-OCR | Structural understanding of poetic couplets. |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🛠️ Technical Specifications
|
||
|
|
- **Base Model:** `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
|
||
|
|
- **Architecture:** Causal Language Model (Transformer)
|
||
|
|
- **Parameters:** 7 Billion
|
||
|
|
- **Training Tool:** Unsloth (2x faster finetuning)
|
||
|
|
- **Hardware:** NVIDIA GeForce RTX 4060 Ti 16GB
|
||
|
|
- **Quantization:** 4-bit (bitsandbytes)
|
||
|
|
- **Checkpoint:** 4500 Steps
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 💎 Shaheen Highlights
|
||
|
|
| Feature | Capability |
|
||
|
|
| :--- | :--- |
|
||
|
|
| **Dataset Size** | 1.83 Million Urdu Rows |
|
||
|
|
| **Optimization** | Unsloth (4-bit LoRA) |
|
||
|
|
| **Primary Focus** | Iqbaliyat & Urdu Prose |
|
||
|
|
| **OCR Support** | Specialized for Nastaliq script couplets |
|
||
|
|
|
||
|
|
## 🚀 Quick Start (Inference)
|
||
|
|
|
||
|
|
```python
|
||
|
|
from unsloth import FastLanguageModel
|
||
|
|
import torch
|
||
|
|
|
||
|
|
model, tokenizer = FastLanguageModel.from_pretrained(
|
||
|
|
model_name = "Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1",
|
||
|
|
max_seq_length = 2048,
|
||
|
|
load_in_4bit = True,
|
||
|
|
)
|
||
|
|
FastLanguageModel.for_inference(model)
|
||
|
|
|
||
|
|
# Sample Prompt
|
||
|
|
prompt = "علامہ اقبال کے فلسفہء خودی کا خلاصہ پیش کریں۔"
|
||
|
|
inputs = tokenizer([f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"], return_tensors="pt").to("cuda")
|
||
|
|
|
||
|
|
outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7)
|
||
|
|
print(tokenizer.batch_decode(outputs)[0])
|