Files
ModelHub XC d12ef646a6 初始化项目,由ModelHub XC社区提供模型
Model: Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1
Source: Original Platform
2026-05-06 21:06:56 +08:00

3.6 KiB

license, base_model, datasets, tags, language, metrics, library_name, pipeline_tag, model_name
license base_model datasets tags language metrics library_name pipeline_tag model_name
apache-2.0 Qwen/Qwen2.5-7B-Instruct
anuragshas/ur_opus100_processed_cv9
mahwizzzz/urdu_alpaca_yc_filtered
large-traversaal/urdu-instruct
muhammadnoman76/lughaat-urdu-dataset-llm
urdu
nlp
qwen
unsloth
instruct
tts-ready
ocr-optimized
mega-dataset
ur
loss
transformers text-generation Qwen-Urdu-Shaheen-7B-Instruct-v1


"شاہین کا جہاں اور، کرگس کا جہاں اور..."
— علامہ اقبال کے افکار اور اردو ادب کی ترویج کے لیے ایک جدید لسانی ماڈل

Qwen-Urdu-Shaheen Logo

🦅 Qwen-Urdu-Shaheen-7B-Instruct (v1.0) 🇵🇰

Qwen-Urdu-Shaheen is a state-of-the-art Urdu Language Model fine-tuned on a massive 1.83 Million row corpus. It is designed to bridge the gap between classical Urdu intellectual heritage and modern conversational AI.

Built on the Qwen 2.5 7B Instruct architecture using Unsloth, this model delivers high-performance inference with deep cultural and linguistic nuances.


🌟 Key Highlights

  • Massive Scale: Fine-tuned on 1.83M curated Urdu records.
  • Literary Depth: Specialized in the philosophy of Allama Iqbal, the poetry of Ghalib, and Ahmed Faraz.
  • Instruction Master: Optimized with the Alif-Instruct dataset for precise Urdu command following.
  • Modern Context: Integrated with the Lughat News Corpus for contemporary vocabulary and news synthesis.
  • OCR Synergized: Trained to process and generate couplets derived from Urdu Poetry OCR datasets.

📊 Dataset Composition

The model was trained on a multi-domain Urdu corpus to ensure versatility:

Category Dataset Source Description
Literature Iqbaliyat & Ghazal Bank Classical and contemporary poetry analysis.
Instruction Alif-Instruct Multi-turn Urdu dialogues and logic tasks.
Current Affairs Lughat News Modern Urdu prose and media vocabulary.
Specialized Urdu-Poetry-OCR Structural understanding of poetic couplets.

🛠️ Technical Specifications

  • Base Model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
  • Architecture: Causal Language Model (Transformer)
  • Parameters: 7 Billion
  • Training Tool: Unsloth (2x faster finetuning)
  • Hardware: NVIDIA GeForce RTX 4060 Ti 16GB
  • Quantization: 4-bit (bitsandbytes)
  • Checkpoint: 4500 Steps

💎 Shaheen Highlights

Feature Capability
Dataset Size 1.83 Million Urdu Rows
Optimization Unsloth (4-bit LoRA)
Primary Focus Iqbaliyat & Urdu Prose
OCR Support Specialized for Nastaliq script couplets

🚀 Quick Start (Inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Khurram123/Qwen-Urdu-Shaheen-7B-Instruct-v1",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# Sample Prompt
prompt = "علامہ اقبال کے فلسفہء خودی کا خلاصہ پیش کریں۔"
inputs = tokenizer([f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.7)
print(tokenizer.batch_decode(outputs)[0])