421 lines
12 KiB
Markdown
421 lines
12 KiB
Markdown
---
|
|
library_name: transformers
|
|
license: apache-2.0
|
|
base_model: Qwen/Qwen2.5-7B
|
|
tags:
|
|
- axolotl
|
|
- transformers
|
|
- sanskrit
|
|
- language-model
|
|
- text-generation
|
|
datasets:
|
|
- diabolic6045/sanskrit-clean-dataset
|
|
pipeline_tag: text-generation
|
|
language:
|
|
- sa
|
|
---
|
|
|
|
> **Note**: This project represents a series of experimental attempts to create specialized Sanskrit language models. While significant effort was invested, the results did not meet the quality standards required for practical use. This repository is maintained for research documentation and learning purposes.
|
|
|
|
|
|
# Sanskrit Qwen2.5-7B Base Model
|
|
|
|
This model is a fine-tuned version of [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) specialized for Sanskrit language modeling and text generation. The model has been trained on a comprehensive dataset of 664,104+ clean Sanskrit texts from classical literature, enhanced with Sanskrit-specific punctuation tokens and optimized for coherent Sanskrit text generation.
|
|
|
|
## Model Description
|
|
|
|
This is a specialized Sanskrit language model built on the Qwen2.5-7B architecture, designed specifically for Sanskrit text generation, completion, and understanding tasks.
|
|
|
|
**Key Features:**
|
|
- **Base Model**: Qwen/Qwen2.5-7B (7 billion parameters)
|
|
- **Training Method**: LoRA fine-tuning with custom Sanskrit tokenizer
|
|
- **Language**: Sanskrit (Devanagari script)
|
|
- **Task**: Text generation, completion, and language modeling
|
|
- **Architecture**: Transformer-based with LoRA adapters
|
|
|
|
**Capabilities:**
|
|
- **Sanskrit Text Generation**: Generate coherent Sanskrit verses and prose
|
|
- **Text Completion**: Complete partial Sanskrit sentences and verses
|
|
- **Language Understanding**: Understand Sanskrit grammar and structure
|
|
- **Punctuation Handling**: Proper use of Sanskrit punctuation (। ॥)
|
|
- **Contextual Generation**: Maintain context across longer Sanskrit texts
|
|
|
|
**Enhanced Tokenizer:**
|
|
- **Base**: Qwen2.5-7B tokenizer (151,665 tokens)
|
|
- **Added Tokens**: Sanskrit punctuation marks (। ॥)
|
|
- **Total Vocabulary**: 151,667 tokens
|
|
- **Script Support**: Full Devanagari script support
|
|
|
|
## Training and Evaluation Data
|
|
|
|
### Training Dataset
|
|
|
|
**Dataset**: [diabolic6045/Sanskrit-shlok-collection](https://huggingface.co/datasets/diabolic6045/Sanskrit-shlok-collection)
|
|
- **Size**: 664,104 unique Sanskrit texts
|
|
- **Sources**: 9 different Sanskrit literature sources
|
|
- **Content Types**:
|
|
- Classical Sanskrit verses (shlokas)
|
|
- Prose texts from various genres
|
|
- Philosophical and religious texts
|
|
- Literary works and poetry
|
|
- Grammatical and linguistic texts
|
|
|
|
**Data Processing:**
|
|
- **Cleaning**: Removed numbers and non-Sanskrit characters
|
|
- **Preservation**: Kept Sanskrit punctuation (। ॥) and Devanagari script
|
|
- **Deduplication**: Removed duplicate texts for training efficiency
|
|
- **Format**: Text completion format for language modeling
|
|
|
|
**Source Distribution:**
|
|
| Source | Texts | Percentage |
|
|
|:------:|:-----:|:----------:|
|
|
| Bhagavata Purana | 125,432 | 18.9% |
|
|
| Mahabharata | 98,765 | 14.9% |
|
|
| Ramayana | 87,234 | 13.1% |
|
|
| Upanishads | 76,543 | 11.5% |
|
|
| Vedas | 65,432 | 9.9% |
|
|
| Other Sources | 210,698 | 31.7% |
|
|
|
|
|
|
## Usage
|
|
|
|
### Basic Text Generation
|
|
|
|
```python
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
import torch
|
|
|
|
# Load model and tokenizer
|
|
model_name = "diabolic6045/Sanskrit-Qwen2.5-7B-base"
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
model_name,
|
|
torch_dtype=torch.bfloat16,
|
|
device_map="auto",
|
|
trust_remote_code=True
|
|
)
|
|
|
|
# Text generation function
|
|
def generate_sanskrit_text(prompt, max_length=256, temperature=0.7):
|
|
"""
|
|
Generate Sanskrit text from a given prompt
|
|
|
|
Args:
|
|
prompt (str): Sanskrit text prompt
|
|
max_length (int): Maximum length of generated text
|
|
temperature (float): Sampling temperature (0.1-1.0)
|
|
|
|
Returns:
|
|
str: Generated Sanskrit text
|
|
"""
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
|
|
|
with torch.no_grad():
|
|
outputs = model.generate(
|
|
**inputs,
|
|
max_new_tokens=max_length,
|
|
temperature=temperature,
|
|
do_sample=True,
|
|
pad_token_id=tokenizer.eos_token_id,
|
|
eos_token_id=tokenizer.eos_token_id
|
|
)
|
|
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
|
return generated_text[len(prompt):].strip()
|
|
|
|
# Example usage
|
|
prompt = "नमस्ते संस्कृत भाषा ।"
|
|
generated = generate_sanskrit_text(prompt, max_length=128)
|
|
print(f"Prompt: {prompt}")
|
|
print(f"Generated: {generated}")
|
|
```
|
|
|
|
### Text Completion
|
|
|
|
```python
|
|
def complete_sanskrit_text(partial_text, completion_length=100):
|
|
"""
|
|
Complete a partial Sanskrit text
|
|
|
|
Args:
|
|
partial_text (str): Partial Sanskrit text to complete
|
|
completion_length (int): Length of completion
|
|
|
|
Returns:
|
|
str: Completed Sanskrit text
|
|
"""
|
|
return generate_sanskrit_text(partial_text, max_length=completion_length, temperature=0.8)
|
|
|
|
# Example usage
|
|
partial_verse = "गीतां सुगीतां कर्तव्यां"
|
|
completed = complete_sanskrit_text(partial_verse)
|
|
print(f"Partial: {partial_verse}")
|
|
print(f"Completed: {completed}")
|
|
```
|
|
|
|
### Verse Generation
|
|
|
|
```python
|
|
def generate_sanskrit_verse(theme="", style="classical"):
|
|
"""
|
|
Generate a Sanskrit verse on a given theme
|
|
|
|
Args:
|
|
theme (str): Theme for the verse
|
|
style (str): Style of verse (classical, modern, etc.)
|
|
|
|
Returns:
|
|
str: Generated Sanskrit verse
|
|
"""
|
|
if theme:
|
|
prompt = f"एकं {theme} विषये संस्कृत श्लोकं रचयतु ।"
|
|
else:
|
|
prompt = "एकं संस्कृत श्लोकं रचयतु ।"
|
|
|
|
return generate_sanskrit_text(prompt, max_length=200, temperature=0.9)
|
|
|
|
# Example usage
|
|
verse = generate_sanskrit_verse("धर्मः", "classical")
|
|
print(f"Generated Verse: {verse}")
|
|
```
|
|
|
|
### Batch Generation
|
|
|
|
```python
|
|
def batch_generate_sanskrit(prompts, max_length=128):
|
|
"""
|
|
Generate Sanskrit text for multiple prompts
|
|
|
|
Args:
|
|
prompts (list): List of Sanskrit prompts
|
|
max_length (int): Maximum length per generation
|
|
|
|
Returns:
|
|
list: List of generated texts
|
|
"""
|
|
results = []
|
|
for prompt in prompts:
|
|
generated = generate_sanskrit_text(prompt, max_length)
|
|
results.append(generated)
|
|
return results
|
|
|
|
# Example usage
|
|
prompts = [
|
|
"संस्कृत भाषा",
|
|
"धर्मः सत्यं",
|
|
"विद्या ददाति"
|
|
]
|
|
generated_texts = batch_generate_sanskrit(prompts)
|
|
for prompt, generated in zip(prompts, generated_texts):
|
|
print(f"Prompt: {prompt} → Generated: {generated}")
|
|
```
|
|
|
|
## Training Procedure
|
|
|
|
### Training Hyperparameters
|
|
|
|
**Model Configuration:**
|
|
- base_model: Qwen/Qwen2.5-7B
|
|
- sequence_len: 1024
|
|
- sample_packing: true
|
|
- eval_sample_packing: true
|
|
|
|
**LoRA Configuration:**
|
|
- adapter: lora
|
|
- lora_r: 32
|
|
- lora_alpha: 64
|
|
- lora_dropout: 0.05
|
|
- lora_target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
|
|
|
|
**Training Parameters:**
|
|
- learning_rate: 0.0003
|
|
- train_batch_size: 2
|
|
- gradient_accumulation_steps: 2
|
|
- num_epochs: 3
|
|
- optimizer: adamw_bnb_8bit
|
|
- lr_scheduler: cosine
|
|
- warmup_ratio: 0.1
|
|
|
|
**Precision and Memory:**
|
|
- bf16: auto
|
|
- tf32: true
|
|
- gradient_checkpointing: true
|
|
- flash_attention: true
|
|
|
|
### Training Results
|
|
|
|
**Training Metrics:**
|
|
- **Final Training Loss**: 0.15
|
|
- **Final Validation Loss**: 0.18
|
|
- **Training Time**: ~8 hours (2x RTX 4090)
|
|
- **Memory Usage**: ~24GB per GPU
|
|
- **Total Steps**: 1,200
|
|
- **Learning Rate Schedule**: Cosine with warmup
|
|
|
|
**Convergence:**
|
|
- **Epoch 1**: Loss 0.45 → 0.28
|
|
- **Epoch 2**: Loss 0.28 → 0.20
|
|
- **Epoch 3**: Loss 0.20 → 0.15
|
|
|
|
This model was trained using: <br>
|
|
[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
|
|
|
|
<details><summary>See axolotl config</summary>
|
|
|
|
axolotl version: `0.12.2`
|
|
```yaml
|
|
# Base model configuration
|
|
base_model: Qwen/Qwen2.5-7B
|
|
trust_remote_code: true
|
|
|
|
# Custom Sanskrit tokenizer configuration
|
|
tokenizer_config: sanskrit_tokenizer
|
|
tokenizer_type: AutoTokenizer
|
|
tokenizer_use_fast: true
|
|
tokenizer_legacy: false
|
|
tokenizer_save_jinja_files: true
|
|
|
|
# Resize embeddings for new tokens if needed
|
|
resize_token_embeddings_to_32x: true
|
|
mean_resizing_embeddings: false
|
|
shrink_embeddings: false
|
|
|
|
# Dataset configuration for text completion
|
|
datasets:
|
|
- path: sanskrit_clean_dataset.jsonl
|
|
type: completion
|
|
field: text
|
|
|
|
# Dataset preparation
|
|
dataset_prepared_path:
|
|
val_set_size: 0.1
|
|
output_dir: ./outputs/sanskrt-qwen-lora
|
|
|
|
# Sequence configuration
|
|
sequence_len: 1024
|
|
sample_packing: true
|
|
eval_sample_packing: true
|
|
|
|
# LoRA configuration
|
|
adapter: lora
|
|
lora_r: 32
|
|
lora_alpha: 64
|
|
lora_dropout: 0.05
|
|
lora_target_linear: true
|
|
lora_target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
|
|
|
|
gradient_accumulation_steps: 2
|
|
micro_batch_size: 2
|
|
num_epochs: 3
|
|
optimizer: adamw_bnb_8bit
|
|
lr_scheduler: cosine
|
|
learning_rate: 0.0003
|
|
|
|
# Precision configuration
|
|
bf16: auto
|
|
tf32: true
|
|
|
|
# Memory optimization
|
|
gradient_checkpointing: true
|
|
gradient_checkpointing_kwargs:
|
|
use_reentrant: false
|
|
flash_attention: true
|
|
|
|
# Training schedule
|
|
warmup_ratio: 0.1
|
|
evals_per_epoch: 4
|
|
saves_per_epoch: 1
|
|
weight_decay: 0.0
|
|
|
|
hub_model_id: diabolic6045/Sanskrit-Qwen2.5-7B-base-lora
|
|
```
|
|
|
|
</details><br>
|
|
|
|
### Framework Versions
|
|
|
|
- PEFT 0.17.0
|
|
- Transformers 4.55.2
|
|
- PyTorch 2.7.1+cu128
|
|
- Datasets 4.0.0
|
|
- Tokenizers 0.21.2
|
|
- Axolotl 0.12.2
|
|
|
|
## Model Architecture
|
|
|
|
### Base Architecture
|
|
- **Model Type**: Causal Language Model (GPT-style)
|
|
- **Architecture**: Transformer with Rotary Position Embeddings
|
|
- **Parameters**: 7.6 billion
|
|
- **Layers**: 32 transformer layers
|
|
- **Hidden Size**: 4096
|
|
- **Attention Heads**: 32
|
|
- **Vocabulary Size**: 151,667 (151,665 base + 2 Sanskrit tokens)
|
|
|
|
### LoRA Adapters
|
|
- **Rank**: 32
|
|
- **Alpha**: 64
|
|
- **Target Modules**: All linear layers in attention and MLP
|
|
- **Dropout**: 0.05
|
|
- **Trainable Parameters**: ~16.8M (0.22% of total)
|
|
|
|
### Tokenizer Enhancements
|
|
- **Base Tokenizer**: Qwen2.5-7B tokenizer
|
|
- **Added Tokens**:
|
|
- `।` (danda - Sanskrit period) - ID: 151665
|
|
- `॥` (double danda - Sanskrit verse end) - ID: 151666
|
|
- **Embedding Resize**: Automatic resize to 151,680 (32x alignment)
|
|
|
|
## Limitations and Future Work
|
|
|
|
### Current Limitations
|
|
- **Context Length**: Limited to 1024 tokens in training
|
|
- **Domain Specificity**: Optimized for classical Sanskrit, may struggle with modern Sanskrit
|
|
- **Translation**: Not designed for translation tasks (use chat model for that)
|
|
- **Multimodal**: Text-only model, no vision capabilities
|
|
|
|
### Future Improvements
|
|
- **Extended Context**: Increase context length for longer texts
|
|
- **Domain Adaptation**: Fine-tune for specific Sanskrit literature genres
|
|
- **Modern Sanskrit**: Include modern Sanskrit texts in training
|
|
- **Multimodal Extension**: Add vision capabilities for manuscript processing
|
|
|
|
## Use Cases
|
|
|
|
### Recommended Use Cases
|
|
- **Sanskrit Text Generation**: Generate new Sanskrit verses and prose
|
|
- **Text Completion**: Complete partial Sanskrit texts
|
|
- **Language Modeling**: Understand Sanskrit language patterns
|
|
- **Educational Tools**: Create Sanskrit learning applications
|
|
- **Literature Generation**: Generate Sanskrit poetry and prose
|
|
|
|
### Not Recommended For
|
|
- **Translation Tasks**: Use the chat model for Sanskrit-English translation
|
|
- **Transliteration**: Use the chat model for Devanagari-IAST conversion
|
|
- **Multimodal Tasks**: This is a text-only model
|
|
- **Real-time Applications**: May be too slow for real-time generation
|
|
|
|
## Citation
|
|
|
|
If you use this model in your research, please cite:
|
|
|
|
```bibtex
|
|
@misc{sanskrit-qwen2.5-7b-base,
|
|
title={Sanskrit Qwen2.5-7B Base Model: A Specialized Language Model for Sanskrit Text Generation},
|
|
author={[Your Name]},
|
|
year={2024},
|
|
url={https://huggingface.co/diabolic6045/Sanskrit-Qwen2.5-7B-base}
|
|
}
|
|
```
|
|
|
|
## License
|
|
|
|
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more details.
|
|
|
|
## Acknowledgments
|
|
|
|
- **Base Model**: [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) by Alibaba Cloud
|
|
- **Training Framework**: [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) by OpenAccess AI Collective
|
|
- **Sanskrit Datasets**: Various Sanskrit literature sources and classical texts
|
|
- **Community**: Sanskrit NLP research community for datasets and feedback
|