whisper-darija-finetuned/README.md

---
library_name: transformers
tags:
- automatic-speech-recognition
- audio
- darija
- moroccan-arabic
- whisper
- fine-tuned
---

# Model Card for Whisper Darija (Fine-Tuned)

This is a fine-tuned [OpenAI Whisper small model](https://huggingface.co/openai/whisper-small) on Moroccan Darija speech transcription. It is trained to transcribe Moroccan dialectal Arabic from audio.

## Model Details

### Model Description

This model is a fine-tuned version of `giannitto/whisper-morocco-model` using a dataset of Moroccan Darija audio and transcriptions. The fine-tuning process aimed to improve the model's Word Error Rate (WER) for spoken Darija, which is underrepresented in many multilingual speech models.

- **Developed by:** Bentaleb Ali
- **Model type:** Automatic Speech Recognition (ASR)
- **Language(s):** Moroccan Darija (Arabic dialect)
- **License:** Apache 2.0
- **Finetuned from model:** giannitto/whisper-morocco-model

### Model Sources

- **Repository:** https://huggingface.co/TaloCreations/whisper-darija-finetuned

## Uses

### Direct Use

This model is intended for transcription of Moroccan Darija audio into text. It can be used in:
- Voice assistants
- Media subtitling
- Dialectal speech processing
- Linguistic research

### Out-of-Scope Use

- Translation tasks (this model is for transcription, not translation)
- Other Arabic dialects outside Moroccan Darija

## Bias, Risks, and Limitations

- The model may perform poorly on noisy or low-quality recordings.
- The model may not generalize well to other dialects of Arabic.
- Biases in the training data (e.g., gender, age, region) may affect transcription accuracy.

### Recommendations

Carefully evaluate outputs when using the model in sensitive applications. Avoid using it in high-risk domains without human verification.

## How to Get Started with the Model

```python
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch, torchaudio

# Load model and processor
processor = AutoProcessor.from_pretrained("TaloCreations/whisper-darija-finetuned")
model = AutoModelForSpeechSeq2Seq.from_pretrained("TaloCreations/whisper-darija-finetuned")
model.eval()

speech, sr = torchaudio.load("path_to_record.wav")

if sr != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
    speech = resampler(speech)

# Preprocess and generate
inputs = processor(speech[0], sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    generated_ids = model.generate(**inputs)
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("📢 Transcription:", transcription)

```

## Training Details

### Training Data

The model was trained on:
- [atlasia/DODa-audio-dataset Viewer](https://huggingface.co/datasets/atlasia/DODa-audio-dataset)
- [adiren7/darija_speech_to_text](https://huggingface.co/datasets/adiren7/darija_speech_to_text)

These datasets contain manually transcribed audio samples of Moroccan Darija.

### Training Procedure

#### Preprocessing
- All audio was resampled to 16kHz
- Mel spectrograms were padded to 3000 frames (30s max)
- Transcripts were tokenized and clipped to <=448 tokens
- Decoder prompts were injected to ensure language/task alignment

#### Training Hyperparameters
- Batch size: 8 (gradient accumulation = 2)
- Epochs: 10
- Learning rate: 2e-6
- Mixed precision: fp16
- Weight decay: 0.01
- Warmup steps: 500

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data
A held-out subset (10%) of the training datasets.

#### Metrics
- Word Error Rate (WER)

### Results

### 📊 Training Progress

| Epoch | Training Loss | Validation Loss | Word Error Rate (WER) |
|-------|----------------|------------------|------------------------|
| 1     | 0.905000       | 0.831409         | 0.825147               |
| 2     | 0.773200       | 0.712022         | 0.732625               |
| 3     | 0.658900       | 0.652096         | 0.631158               |
| 4     | 0.609100       | 0.608619         | 0.578152               |
| 5     | 0.548400       | 0.579711         | 0.546444               |
| 6     | 0.509700       | 0.561768         | 0.524927               |
| 7     | 0.482000       | 0.551717         | 0.522067               |
| 8     | 0.459400       | 0.545695         | 0.526979               |
| 9     | 0.446500       | 0.543017         | 0.497141               |
| 10    | 0.443200       | 0.542152         | 0.504545               |


#### Summary
After 10 epochs, the model achieved a WER of ~50%, a significant improvement over baseline multilingual Whisper models on Moroccan Darija.

## Environmental Impact

Estimated based on training on a single A100 GPU for ~6.5 hours.

- **Hardware Type:** A100
- **Hours used:** ~6.5
- **Cloud Provider:** Google Cloud (Colab)
- **Compute Region:** Morocco

## Technical Specifications

### Model Architecture and Objective
- Whisper (small) encoder-decoder architecture
- Objective: sequence-to-sequence transcription

### Compute Infrastructure
- Google Colab Pro
- 1x A100 GPU
- PyTorch + Transformers 4.39

## Citation

```bibtex
  title={Whisper Darija: Fine-tuned Whisper Model for Moroccan Arabic Speech},
  author={Bentaleb, Ali},
  year={2025},
}
```

## Model Card Authors
- Ali Bentaleb [@TaloCreations](https://huggingface.co/TaloCreations)


## Model Card Contact
- 📧 alitennis131800@gmail.com
初始化项目，由ModelHub XC社区提供模型 Model: TaloCreations/whisper-darija-finetuned Source: Original Platform 2026-05-08 11:34:43 +08:00			`---`
			`library_name: transformers`
			`tags:`
			`- automatic-speech-recognition`
			`- audio`
			`- darija`
			`- moroccan-arabic`
			`- whisper`
			`- fine-tuned`
			`---`

			`# Model Card for Whisper Darija (Fine-Tuned)`

			`This is a fine-tuned [OpenAI Whisper small model](https://huggingface.co/openai/whisper-small) on Moroccan Darija speech transcription. It is trained to transcribe Moroccan dialectal Arabic from audio.`

			`## Model Details`

			`### Model Description`

			This model is a fine-tuned version of `giannitto/whisper-morocco-model` using a dataset of Moroccan Darija audio and transcriptions. The fine-tuning process aimed to improve the model's Word Error Rate (WER) for spoken Darija, which is underrepresented in many multilingual speech models.

			`- Developed by: Bentaleb Ali`
			`- Model type: Automatic Speech Recognition (ASR)`
			`- Language(s): Moroccan Darija (Arabic dialect)`
			`- License: Apache 2.0`
			`- Finetuned from model: giannitto/whisper-morocco-model`

			`### Model Sources`

			`- Repository: https://huggingface.co/TaloCreations/whisper-darija-finetuned`

			`## Uses`

			`### Direct Use`

			`This model is intended for transcription of Moroccan Darija audio into text. It can be used in:`
			`- Voice assistants`
			`- Media subtitling`
			`- Dialectal speech processing`
			`- Linguistic research`

			`### Out-of-Scope Use`

			`- Translation tasks (this model is for transcription, not translation)`
			`- Other Arabic dialects outside Moroccan Darija`

			`## Bias, Risks, and Limitations`

			`- The model may perform poorly on noisy or low-quality recordings.`
			`- The model may not generalize well to other dialects of Arabic.`
			`- Biases in the training data (e.g., gender, age, region) may affect transcription accuracy.`

			`### Recommendations`

			`Carefully evaluate outputs when using the model in sensitive applications. Avoid using it in high-risk domains without human verification.`

			`## How to Get Started with the Model`

			```python
			`from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq`
			`import torch, torchaudio`

			`# Load model and processor`
			`processor = AutoProcessor.from_pretrained("TaloCreations/whisper-darija-finetuned")`
			`model = AutoModelForSpeechSeq2Seq.from_pretrained("TaloCreations/whisper-darija-finetuned")`
			`model.eval()`

			`speech, sr = torchaudio.load("path_to_record.wav")`

			`if sr != 16000:`
			`resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)`
			`speech = resampler(speech)`

			`# Preprocess and generate`
			`inputs = processor(speech[0], sampling_rate=16000, return_tensors="pt")`
			`with torch.no_grad():`
			`generated_ids = model.generate(**inputs)`
			`transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]`

			`print("📢 Transcription:", transcription)`

			```

			`## Training Details`

			`### Training Data`

			`The model was trained on:`
			`- [atlasia/DODa-audio-dataset Viewer](https://huggingface.co/datasets/atlasia/DODa-audio-dataset)`
			`- [adiren7/darija_speech_to_text](https://huggingface.co/datasets/adiren7/darija_speech_to_text)`

			`These datasets contain manually transcribed audio samples of Moroccan Darija.`

			`### Training Procedure`

			`#### Preprocessing`
			`- All audio was resampled to 16kHz`
			`- Mel spectrograms were padded to 3000 frames (30s max)`
			`- Transcripts were tokenized and clipped to <=448 tokens`
			`- Decoder prompts were injected to ensure language/task alignment`

			`#### Training Hyperparameters`
			`- Batch size: 8 (gradient accumulation = 2)`
			`- Epochs: 10`
			`- Learning rate: 2e-6`
			`- Mixed precision: fp16`
			`- Weight decay: 0.01`
			`- Warmup steps: 500`

			`## Evaluation`

			`### Testing Data, Factors & Metrics`

			`#### Testing Data`
			`A held-out subset (10%) of the training datasets.`

			`#### Metrics`
			`- Word Error Rate (WER)`

			`### Results`

			`### 📊 Training Progress`

			`\| Epoch \| Training Loss \| Validation Loss \| Word Error Rate (WER) \|`
			`\|-------\|----------------\|------------------\|------------------------\|`
			`\| 1 \| 0.905000 \| 0.831409 \| 0.825147 \|`
			`\| 2 \| 0.773200 \| 0.712022 \| 0.732625 \|`
			`\| 3 \| 0.658900 \| 0.652096 \| 0.631158 \|`
			`\| 4 \| 0.609100 \| 0.608619 \| 0.578152 \|`
			`\| 5 \| 0.548400 \| 0.579711 \| 0.546444 \|`
			`\| 6 \| 0.509700 \| 0.561768 \| 0.524927 \|`
			`\| 7 \| 0.482000 \| 0.551717 \| 0.522067 \|`
			`\| 8 \| 0.459400 \| 0.545695 \| 0.526979 \|`
			`\| 9 \| 0.446500 \| 0.543017 \| 0.497141 \|`
			`\| 10 \| 0.443200 \| 0.542152 \| 0.504545 \|`



			`#### Summary`
			`After 10 epochs, the model achieved a WER of ~50%, a significant improvement over baseline multilingual Whisper models on Moroccan Darija.`

			`## Environmental Impact`

			`Estimated based on training on a single A100 GPU for ~6.5 hours.`

			`- Hardware Type: A100`
			`- Hours used: ~6.5`
			`- Cloud Provider: Google Cloud (Colab)`
			`- Compute Region: Morocco`

			`## Technical Specifications`

			`### Model Architecture and Objective`
			`- Whisper (small) encoder-decoder architecture`
			`- Objective: sequence-to-sequence transcription`

			`### Compute Infrastructure`
			`- Google Colab Pro`
			`- 1x A100 GPU`
			`- PyTorch + Transformers 4.39`

			`## Citation`

			```bibtex
			`title={Whisper Darija: Fine-tuned Whisper Model for Moroccan Arabic Speech},`
			`author={Bentaleb, Ali},`
			`year={2025},`
			`}`
			```

			`## Model Card Authors`
			`- Ali Bentaleb [@TaloCreations](https://huggingface.co/TaloCreations)`


			`## Model Card Contact`
			`- 📧 alitennis131800@gmail.com`