178 lines
5.3 KiB
Markdown
178 lines
5.3 KiB
Markdown
|
|
---
|
||
|
|
library_name: transformers
|
||
|
|
tags:
|
||
|
|
- automatic-speech-recognition
|
||
|
|
- audio
|
||
|
|
- darija
|
||
|
|
- moroccan-arabic
|
||
|
|
- whisper
|
||
|
|
- fine-tuned
|
||
|
|
---
|
||
|
|
|
||
|
|
# Model Card for Whisper Darija (Fine-Tuned)
|
||
|
|
|
||
|
|
This is a fine-tuned [OpenAI Whisper small model](https://huggingface.co/openai/whisper-small) on Moroccan Darija speech transcription. It is trained to transcribe Moroccan dialectal Arabic from audio.
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
|
||
|
|
### Model Description
|
||
|
|
|
||
|
|
This model is a fine-tuned version of `giannitto/whisper-morocco-model` using a dataset of Moroccan Darija audio and transcriptions. The fine-tuning process aimed to improve the model's Word Error Rate (WER) for spoken Darija, which is underrepresented in many multilingual speech models.
|
||
|
|
|
||
|
|
- **Developed by:** Bentaleb Ali
|
||
|
|
- **Model type:** Automatic Speech Recognition (ASR)
|
||
|
|
- **Language(s):** Moroccan Darija (Arabic dialect)
|
||
|
|
- **License:** Apache 2.0
|
||
|
|
- **Finetuned from model:** giannitto/whisper-morocco-model
|
||
|
|
|
||
|
|
### Model Sources
|
||
|
|
|
||
|
|
- **Repository:** https://huggingface.co/TaloCreations/whisper-darija-finetuned
|
||
|
|
|
||
|
|
## Uses
|
||
|
|
|
||
|
|
### Direct Use
|
||
|
|
|
||
|
|
This model is intended for transcription of Moroccan Darija audio into text. It can be used in:
|
||
|
|
- Voice assistants
|
||
|
|
- Media subtitling
|
||
|
|
- Dialectal speech processing
|
||
|
|
- Linguistic research
|
||
|
|
|
||
|
|
### Out-of-Scope Use
|
||
|
|
|
||
|
|
- Translation tasks (this model is for transcription, not translation)
|
||
|
|
- Other Arabic dialects outside Moroccan Darija
|
||
|
|
|
||
|
|
## Bias, Risks, and Limitations
|
||
|
|
|
||
|
|
- The model may perform poorly on noisy or low-quality recordings.
|
||
|
|
- The model may not generalize well to other dialects of Arabic.
|
||
|
|
- Biases in the training data (e.g., gender, age, region) may affect transcription accuracy.
|
||
|
|
|
||
|
|
### Recommendations
|
||
|
|
|
||
|
|
Carefully evaluate outputs when using the model in sensitive applications. Avoid using it in high-risk domains without human verification.
|
||
|
|
|
||
|
|
## How to Get Started with the Model
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
|
||
|
|
import torch, torchaudio
|
||
|
|
|
||
|
|
# Load model and processor
|
||
|
|
processor = AutoProcessor.from_pretrained("TaloCreations/whisper-darija-finetuned")
|
||
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained("TaloCreations/whisper-darija-finetuned")
|
||
|
|
model.eval()
|
||
|
|
|
||
|
|
speech, sr = torchaudio.load("path_to_record.wav")
|
||
|
|
|
||
|
|
if sr != 16000:
|
||
|
|
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
|
||
|
|
speech = resampler(speech)
|
||
|
|
|
||
|
|
# Preprocess and generate
|
||
|
|
inputs = processor(speech[0], sampling_rate=16000, return_tensors="pt")
|
||
|
|
with torch.no_grad():
|
||
|
|
generated_ids = model.generate(**inputs)
|
||
|
|
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
||
|
|
|
||
|
|
print("📢 Transcription:", transcription)
|
||
|
|
|
||
|
|
```
|
||
|
|
|
||
|
|
## Training Details
|
||
|
|
|
||
|
|
### Training Data
|
||
|
|
|
||
|
|
The model was trained on:
|
||
|
|
- [atlasia/DODa-audio-dataset Viewer](https://huggingface.co/datasets/atlasia/DODa-audio-dataset)
|
||
|
|
- [adiren7/darija_speech_to_text](https://huggingface.co/datasets/adiren7/darija_speech_to_text)
|
||
|
|
|
||
|
|
These datasets contain manually transcribed audio samples of Moroccan Darija.
|
||
|
|
|
||
|
|
### Training Procedure
|
||
|
|
|
||
|
|
#### Preprocessing
|
||
|
|
- All audio was resampled to 16kHz
|
||
|
|
- Mel spectrograms were padded to 3000 frames (30s max)
|
||
|
|
- Transcripts were tokenized and clipped to <=448 tokens
|
||
|
|
- Decoder prompts were injected to ensure language/task alignment
|
||
|
|
|
||
|
|
#### Training Hyperparameters
|
||
|
|
- Batch size: 8 (gradient accumulation = 2)
|
||
|
|
- Epochs: 10
|
||
|
|
- Learning rate: 2e-6
|
||
|
|
- Mixed precision: fp16
|
||
|
|
- Weight decay: 0.01
|
||
|
|
- Warmup steps: 500
|
||
|
|
|
||
|
|
## Evaluation
|
||
|
|
|
||
|
|
### Testing Data, Factors & Metrics
|
||
|
|
|
||
|
|
#### Testing Data
|
||
|
|
A held-out subset (10%) of the training datasets.
|
||
|
|
|
||
|
|
#### Metrics
|
||
|
|
- Word Error Rate (WER)
|
||
|
|
|
||
|
|
### Results
|
||
|
|
|
||
|
|
### 📊 Training Progress
|
||
|
|
|
||
|
|
| Epoch | Training Loss | Validation Loss | Word Error Rate (WER) |
|
||
|
|
|-------|----------------|------------------|------------------------|
|
||
|
|
| 1 | 0.905000 | 0.831409 | 0.825147 |
|
||
|
|
| 2 | 0.773200 | 0.712022 | 0.732625 |
|
||
|
|
| 3 | 0.658900 | 0.652096 | 0.631158 |
|
||
|
|
| 4 | 0.609100 | 0.608619 | 0.578152 |
|
||
|
|
| 5 | 0.548400 | 0.579711 | 0.546444 |
|
||
|
|
| 6 | 0.509700 | 0.561768 | 0.524927 |
|
||
|
|
| 7 | 0.482000 | 0.551717 | 0.522067 |
|
||
|
|
| 8 | 0.459400 | 0.545695 | 0.526979 |
|
||
|
|
| 9 | 0.446500 | 0.543017 | 0.497141 |
|
||
|
|
| 10 | 0.443200 | 0.542152 | 0.504545 |
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
#### Summary
|
||
|
|
After 10 epochs, the model achieved a WER of ~50%, a significant improvement over baseline multilingual Whisper models on Moroccan Darija.
|
||
|
|
|
||
|
|
## Environmental Impact
|
||
|
|
|
||
|
|
Estimated based on training on a single A100 GPU for ~6.5 hours.
|
||
|
|
|
||
|
|
- **Hardware Type:** A100
|
||
|
|
- **Hours used:** ~6.5
|
||
|
|
- **Cloud Provider:** Google Cloud (Colab)
|
||
|
|
- **Compute Region:** Morocco
|
||
|
|
|
||
|
|
## Technical Specifications
|
||
|
|
|
||
|
|
### Model Architecture and Objective
|
||
|
|
- Whisper (small) encoder-decoder architecture
|
||
|
|
- Objective: sequence-to-sequence transcription
|
||
|
|
|
||
|
|
### Compute Infrastructure
|
||
|
|
- Google Colab Pro
|
||
|
|
- 1x A100 GPU
|
||
|
|
- PyTorch + Transformers 4.39
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
```bibtex
|
||
|
|
title={Whisper Darija: Fine-tuned Whisper Model for Moroccan Arabic Speech},
|
||
|
|
author={Bentaleb, Ali},
|
||
|
|
year={2025},
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Model Card Authors
|
||
|
|
- Ali Bentaleb [@TaloCreations](https://huggingface.co/TaloCreations)
|
||
|
|
|
||
|
|
|
||
|
|
## Model Card Contact
|
||
|
|
- 📧 alitennis131800@gmail.com
|
||
|
|
|