--- license: apache-2.0 base_model: Qwen/Qwen2.5-7B-Instruct datasets: - anuragshas/ur_opus100_processed_cv9 - mahwizzzz/urdu_alpaca_yc_filtered - large-traversaal/urdu-instruct - muhammadnoman76/lughaat-urdu-dataset-llm tags: - urdu - nlp - qwen - unsloth - instruct - tts-ready - ocr-optimized - mega-dataset language: - ur metrics: - loss library_name: transformers pipeline_tag: text-generation model_name: Qwen-Urdu-Shaheen-7B-Instruct-v1 ---


"شاہین کا جہاں اور، کرگس کا جہاں اور..."
— علامہ اقبال کے افکار اور اردو ادب کی ترویج کے لیے ایک جدید لسانی ماڈل

Qwen-Urdu-Shaheen Logo

🦅 Qwen-Urdu-Shaheen-7B-Instruct (v1.0) 🇵🇰

**Qwen-Urdu-Shaheen** is a state-of-the-art Urdu Language Model fine-tuned on a massive **1.83 Million row** corpus. It is designed to bridge the gap between classical Urdu intellectual heritage and modern conversational AI. Built on the **Qwen 2.5 7B Instruct** architecture using **Unsloth**, this model delivers high-performance inference with deep cultural and linguistic nuances. --- ## 🌟 Key Highlights - **Massive Scale:** Fine-tuned on 1.83M curated Urdu records. - **Literary Depth:** Specialized in the philosophy of **Allama Iqbal**, the poetry of **Ghalib**, and **Ahmed Faraz**. - **Instruction Master:** Optimized with the **Alif-Instruct** dataset for precise Urdu command following. - **Modern Context:** Integrated with the **Lughat News Corpus** for contemporary vocabulary and news synthesis. - **OCR Synergized:** Trained to process and generate couplets derived from Urdu Poetry OCR datasets. --- ## 📊 Dataset Composition The model was trained on a multi-domain Urdu corpus to ensure versatility: | Category | Dataset Source | Description | | :--- | :--- | :--- | | **Literature** | Iqbaliyat & Ghazal Bank | Classical and contemporary poetry analysis. | | **Instruction** | Alif-Instruct | Multi-turn Urdu dialogues and logic tasks. | | **Current Affairs** | Lughat News | Modern Urdu prose and media vocabulary. | | **Specialized** | Urdu-Poetry-OCR | Structural understanding of poetic couplets. | --- ## 🛠️ Technical Specifications - **Base Model:** `unsloth/Qwen2.5-7B-Instruct-bnb-4bit` - **Architecture:** Causal Language Model (Transformer) - **Parameters:** 7 Billion - **Training Tool:** Unsloth (2x faster finetuning) - **Hardware:** NVIDIA GeForce RTX 4060 Ti 16GB - **Quantization:** 4-bit (bitsandbytes) - **Checkpoint:** 4500 Steps --- ### 💎 Shaheen Highlights | Feature | Capability | | :--- | :--- | | **Dataset Size** | 1.83 Million Urdu Rows | | **Optimization** | Unsloth (4-bit LoRA) | | **Primary Focus** | Iqbaliyat & Urdu Prose | | **OCR Support** | Specialized for Nastaliq script couplets | ## 🚀 Quick Start (Inference) ```python from unsloth import FastLanguageModel import torch model, tokenizer = FastLanguageModel.from_pretrained( model_name = "Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1", max_seq_length = 2048, load_in_4bit = True, ) FastLanguageModel.for_inference(model) # Sample Prompt prompt = "علامہ اقبال کے فلسفہء خودی کا خلاصہ پیش کریں۔" inputs = tokenizer([f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"], return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7) print(tokenizer.batch_decode(outputs)[0])