初始化项目,由ModelHub XC社区提供模型
Model: sanganaka/phi4-hindi2sanskrit-anustubh-lora-merged-step3400 Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
253
README.md
Normal file
253
README.md
Normal file
@@ -0,0 +1,253 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
base_model: unsloth/Phi-4
|
||||
tags:
|
||||
- text-generation
|
||||
- poetry
|
||||
- hindi
|
||||
- sanskrit
|
||||
- anuṣṭubh
|
||||
- lora
|
||||
- unsloth
|
||||
datasets:
|
||||
- custom
|
||||
metrics:
|
||||
- custom
|
||||
language:
|
||||
- hi
|
||||
- sa
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# Hindi Prose → Sanskrit Anuṣṭubh Poetry (Phi-4 LoRA)
|
||||
|
||||
## Model Overview
|
||||
|
||||
This model is fine-tuned to generate Sanskrit Anuṣṭubh poetry from Hindi prose input.
|
||||
It is based on Phi-4 (unsloth) and trained using LoRA fine-tuning.
|
||||
|
||||
## About Anuṣṭubh Meter
|
||||
|
||||
Anuṣṭubh is the most widely used Sanskrit poetic meter, forming the basis of the classical śloka. It consists of:
|
||||
|
||||
- 32 syllables (akṣaras) in total
|
||||
- Divided into 4 pādas, each with 8 syllables
|
||||
|
||||
Beyond syllable count, the meter enforces positional constraints:
|
||||
|
||||
- The 5th syllable of each pāda is typically laghu (short)
|
||||
- The 6th syllable is guru (long)
|
||||
- The 7th syllable follows stricter rules depending on the pāda position
|
||||
|
||||
A valid Anuṣṭubh verse must satisfy both:
|
||||
|
||||
- Exact 32-syllable structure
|
||||
- Metrical pattern constraints
|
||||
|
||||
This makes generation significantly more constrained than standard text generation tasks.
|
||||
|
||||
## Model Details
|
||||
|
||||
- Base Model: unsloth/Phi-4
|
||||
- Fine-tuning: LoRA
|
||||
- Precision: 4-bit (QLoRA)
|
||||
- Max Sequence Length: 1024
|
||||
- Best Checkpoint: checkpoint-3400
|
||||
|
||||
## Training Configuration
|
||||
|
||||
- Batch size: 8
|
||||
- Gradient accumulation: 2
|
||||
- Epochs: 10
|
||||
- Learning rate: 2e-4
|
||||
- Weight decay: 0.01
|
||||
- Warmup ratio: 0.03
|
||||
- LoRA r: 16
|
||||
- LoRA alpha: 32
|
||||
- LoRA dropout: 0.05
|
||||
- Early Stop Patience: 5
|
||||
- Early Stop Threshold: 0.001
|
||||
- Early stop triggered at ~2.5 epochs (25% of allowed 10 epochs training stopped)
|
||||
|
||||
## Requirements
|
||||
- torch==2.10.0
|
||||
- transformers==5.3.0
|
||||
- peft==0.18.1
|
||||
- unsloth==2026.3.5
|
||||
- unsloth_zoo==2026.3.4
|
||||
- xformers==0.0.35
|
||||
- bitsandbytes==0.49.2
|
||||
- accelerate==1.13.0
|
||||
|
||||
## Dataset
|
||||
|
||||
- Train: 28,163
|
||||
- Validation: 3,130
|
||||
- Test: 1,648
|
||||
|
||||
### Description
|
||||
|
||||
**Input:** Hindi prose
|
||||
**Output:** Sanskrit Anuṣṭubh verse
|
||||
|
||||
Example:
|
||||
|
||||
---
|
||||
|
||||
**PROSE :** राक्षस कुल के आनंद, तुम्हारे कारण लंका की अवस्था और हम सब अब निराश्रित हो गए हैं। अपने कर्मों से तुमने अपने शरीर को गिद्धों द्वारा खाए जाने योग्य और अपनी आत्मा को नरक जाने योग्य बना लिया है।
|
||||
|
||||
**POETRY:** अस्माकं च निराशानां लङ्का च तव कारणात् गिद्धभक्ष्यशरीरस्य नरकाय च कर्मभिः ॥
|
||||
|
||||
---
|
||||
|
||||
## Evaluation
|
||||
|
||||
| Setup (Training → Input/Output) | Full (%) | Partial (%) | Invalid 32 (%) | Semantic (%) |
|
||||
| -----------------------------------------------| --------- | ----------- | -------------- | ------------ |
|
||||
| Ground Truth (DEV Sanskrit + Hindi Prose) | 99.51 | 99.51 | 0.00 | 74.04 |
|
||||
| Phi-4 (SLP1 → SLP1+0-shot/SLP1) `[*]` greedy | 24.76 | 62.44 | 37.68 | 73.23 |
|
||||
| Phi-4 (DEV → DEV+0-shot/DEV) greedy | 43.08 | 66.99 | 23.91 | 73.76 |
|
||||
| Phi-4 (DEV → DEV+3-shot/DEV) greedy | *50.97* | **72.81** | 21.84 | 73.21 |
|
||||
| Phi-4 (DEV → DEV+6-shot/DEV) greedy | 46.72 | 69.05 | 22.33 | 72.93 |
|
||||
| Phi-4 (DEV → DEV+0-shot/DEV) Sampling | 42.90 | 64.68 | 21.78 | 72.71 |
|
||||
| Phi-4 (DEV → DEV+3-shot/DEV) Sampling | **51.70** | *72.51* | 20.81 | 72.05 |
|
||||
|
||||
The ground truth row shows the metrics computed directly on the dataset (no model involved).
|
||||
**Bold** indicates the highest value.
|
||||
*Italics* indicate the second highest value.
|
||||
- `[*]` Evaluation performed in Devanagari (SLP1 outputs converted to DEV)
|
||||
### Notes
|
||||
|
||||
- **Full (%):**: Full Anuṣṭubh percentage: Percentage of outputs with 32 syllables and valid Anuṣṭubh pattern.
|
||||
- **Partial (%)**: Partial Anuṣṭubh percentage: Percentage of outputs with 32 syllables no matter its Anuṣṭubh pattern or not.
|
||||
- **Invalid 32 (%)**: 32-syllable outputs that violate metrical constraints _(Partial − Full)_
|
||||
- **Semantic (%)**: Semantic similarity calculated using **'sanganaka/bge-m3-sanskritFT'**
|
||||
- ★ Evaluation performed in Devanagari (SLP1 outputs converted to DEV)
|
||||
- All verses are obtained via greedy decoding unless stated otherwise
|
||||
- The metrics are computed following [Chandomitra](https://arxiv.org/abs/2506.00815).
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
import torch
|
||||
import pandas as pd
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
from unsloth.chat_templates import get_chat_template
|
||||
|
||||
MAX_NEW_TOKENS = 110
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
|
||||
model_path = "sanganaka/phi4-hindi2sanskrit-anustubh-lora-merged-step3400"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
tokenizer = get_chat_template(tokenizer, chat_template="phi-4")
|
||||
if tokenizer.pad_token is None:
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_path,
|
||||
torch_dtype=torch.bfloat16,
|
||||
)
|
||||
model.to(device)
|
||||
model.eval()
|
||||
|
||||
print("MODEL LOADED")
|
||||
|
||||
ANUSHTUP_INSTRUCTION = """The goal is to generate Sanskrit verse that follows the anushtup meter rules for the given input text.
|
||||
RULES:
|
||||
Verse Rules:
|
||||
The verse contains 32 syllables/akshara and 4 padas in total.
|
||||
The verse is divided into 2 lines, each containing 16 syllables.
|
||||
Each line is divided into 2 padas (quartets), each containing exactly 8 syllables.
|
||||
The fifth syllable of every pada must be LAGHU or short.
|
||||
The sixth syllable of every pada must be GURU or long.
|
||||
The seventh syllable of the second and fourth pada must be HRASVA.
|
||||
The seventh syllable of the first and third pada must be DEERGHA.
|
||||
|
||||
Syllable Rules:
|
||||
LAGHU vowels: अ, इ, उ, ऋ, ऌ
|
||||
GURU vowels: आ, ई, ऊ, ॠ, ॡ, ए, ऐ, ओ, औ
|
||||
HRASVA vowels: अ, इ, उ, ऋ, ऌ
|
||||
DEERGHA vowels: आ, ई, ऊ, ॠ, ॡ, ए, ऐ, ओ, औ
|
||||
|
||||
Syllable classification rules:
|
||||
- A syllable is marked Laghu/Guru and Hrasva/Deergha based on the vowel it contains.
|
||||
- Any syllable containing anusvāra (ं) or visarga (ः) is always Guru.
|
||||
- Any syllable followed by a conjunct consonant (saṁyuktākṣara) is always Guru.
|
||||
|
||||
Now convert the given Hindi text into a Sanskrit Anushtup verse in Devanagari:"""
|
||||
|
||||
def generate(hi_text):
|
||||
messages = [
|
||||
{"role": "system", "content": ANUSHTUP_INSTRUCTION},
|
||||
{"role": "user", "content": "यह समय वाणीकी पहुँचके परे था उसका वर्णन करना कठिन था उस समय कोई भूपाल वहाँ इस विषयमें कुछ भी न बोल सके मौन रह गये वे बारचार केवल श्रीकृष्णके मुखकी ओर देखते रहे ॥"},
|
||||
{"role": "assistant", "content": "ततः केचिन्महीपाला नानुवंस्तत्र किंचन अतीतवाक्पथे काले प्रेक्षमाणा जनार्दनम् ॥"},
|
||||
{"role": "user", "content": "फिर तो उसने एक दूसरे भयंकर शत्रुको वहाँ आया हुआ देखा, जो सरकण्डेके फूलके समान भूरे रंगका था वह धरतीमें विवर बनाकर उसके भीतर सोया करता था"},
|
||||
{"role": "assistant", "content": "अपश्यदपरं घोरमात्मनः शत्रुमागतम् शरप्रसूनसङ्काशं महीविवरशायिनम्॥"},
|
||||
{"role": "user", "content": "जो मनुष्य पाण्डुनन्दन अर्जुनके इस चरित्रको प्रतिदिन सुनता है, उसके मनमै पापपूर्ण विषयभोगोंकी इच्छा नहीं होती ॥"},
|
||||
{"role": "assistant", "content": "इदं यः शृणुयाद् वृत्तं नित्यं पाण्डुसुतस्य थे न तस्य कामः कामेषु पापकेषु प्रवर्तते ॥"},
|
||||
{"role": "user", "content": hi_text},
|
||||
]
|
||||
|
||||
inputs = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
tokenize=True,
|
||||
add_generation_prompt=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
).to(device)
|
||||
|
||||
with torch.inference_mode():
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=MAX_NEW_TOKENS,
|
||||
do_sample=False,
|
||||
pad_token_id=tokenizer.eos_token_id,
|
||||
eos_token_id=tokenizer.eos_token_id,
|
||||
)
|
||||
|
||||
generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
|
||||
return tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
|
||||
|
||||
|
||||
text = "राक्षस कुल के आनंद, तुम्हारे कारण लंका की अवस्था और हम सब अब निराश्रित हो गए हैं। अपने कर्मों से तुमने अपने शरीर को गिद्धों द्वारा खाए जाने योग्य और अपनी आत्मा को नरक जाने योग्य बना लिया है।"
|
||||
out = generate(text)
|
||||
|
||||
print(f"PROSE : {text}\nPOETRY: {out}")
|
||||
```
|
||||
|
||||
**NOTE** In this setting 3 shot is enabled. If you want 0-shot setting then just replace `messages = []` block with
|
||||
|
||||
```python
|
||||
messages = [
|
||||
{"role": "system", "content": ANUSHTUP_INSTRUCTION},
|
||||
{"role": "user", "content": hi_text},
|
||||
]
|
||||
```
|
||||
**NOTE** In this setting greedy decoding is enabled. To set non-deterministic decoding you can add these in settings.
|
||||
```python
|
||||
do_sample=True,
|
||||
temperature=0.6,
|
||||
top_p=0.9,
|
||||
top_k=50,
|
||||
```
|
||||
|
||||
## Limitations
|
||||
- It was observed that batched inference is degrading the outputs so it is advised to not use it.
|
||||
- Both greedy and non-greedy decoding scores are somewhat similar.
|
||||
- Meter compliance is not guaranteed
|
||||
- Sanskrit grammar may vary
|
||||
- Sensitive to input phrasing
|
||||
|
||||
## Citation
|
||||
```bibtex
|
||||
@misc{jagadeeshan2026chandomitrageneratingstructuredsanskrit,
|
||||
title={Chandomitra: Towards Generating Structured Sanskrit Poetry from Natural Language Inputs},
|
||||
author={Manoj Balaji Jagadeeshan and Samarth Bhatia and Pretam Ray and Harshul Raj Surana and Akhil Rajeev P and Priya Mishra and Annarao Kulkarni and Ganesh Ramakrishnan and Prathosh AP and Pawan Goyal},
|
||||
year={2026},
|
||||
eprint={2506.00815},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2506.00815},
|
||||
}
|
||||
```
|
||||
1
chat_template.jinja
Normal file
1
chat_template.jinja
Normal file
@@ -0,0 +1 @@
|
||||
{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}
|
||||
33
config.json
Normal file
33
config.json
Normal file
@@ -0,0 +1,33 @@
|
||||
{
|
||||
"architectures": [
|
||||
"LlamaForCausalLM"
|
||||
],
|
||||
"attention_bias": false,
|
||||
"attention_dropout": 0.0,
|
||||
"bos_token_id": 100257,
|
||||
"dtype": "bfloat16",
|
||||
"eos_token_id": 100265,
|
||||
"head_dim": 128,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 5120,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 17920,
|
||||
"max_position_embeddings": 16384,
|
||||
"mlp_bias": false,
|
||||
"model_type": "llama",
|
||||
"num_attention_heads": 40,
|
||||
"num_hidden_layers": 40,
|
||||
"num_key_value_heads": 10,
|
||||
"original_max_position_embeddings": 16384,
|
||||
"pad_token_id": 100351,
|
||||
"pretraining_tp": 1,
|
||||
"rms_norm_eps": 1e-05,
|
||||
"rope_parameters": {
|
||||
"rope_theta": 250000,
|
||||
"rope_type": "default"
|
||||
},
|
||||
"tie_word_embeddings": false,
|
||||
"transformers_version": "5.3.0",
|
||||
"use_cache": true,
|
||||
"vocab_size": 100352
|
||||
}
|
||||
7
generation_config.json
Normal file
7
generation_config.json
Normal file
@@ -0,0 +1,7 @@
|
||||
{
|
||||
"_from_model_config": true,
|
||||
"bos_token_id": 100257,
|
||||
"eos_token_id": 100265,
|
||||
"pad_token_id": 100351,
|
||||
"transformers_version": "5.3.0"
|
||||
}
|
||||
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:040ff54851e02b5573d38a58d14a20c431cff2e0c0c8e6af626b06771259e2ef
|
||||
size 29319057192
|
||||
501285
tokenizer.json
Normal file
501285
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
14
tokenizer_config.json
Normal file
14
tokenizer_config.json
Normal file
@@ -0,0 +1,14 @@
|
||||
{
|
||||
"add_prefix_space": false,
|
||||
"backend": "tokenizers",
|
||||
"bos_token": "<|endoftext|>",
|
||||
"clean_up_tokenization_spaces": false,
|
||||
"eos_token": "<|im_end|>",
|
||||
"errors": "replace",
|
||||
"is_local": false,
|
||||
"model_max_length": 16384,
|
||||
"pad_token": "<|dummy_87|>",
|
||||
"padding_side": "left",
|
||||
"tokenizer_class": "GPT2Tokenizer",
|
||||
"unk_token": "�"
|
||||
}
|
||||
Reference in New Issue
Block a user