--- library_name: transformers tags: - automatic-speech-recognition - audio - darija - moroccan-arabic - whisper - fine-tuned --- # Model Card for Whisper Darija (Fine-Tuned) This is a fine-tuned [OpenAI Whisper small model](https://huggingface.co/openai/whisper-small) on Moroccan Darija speech transcription. It is trained to transcribe Moroccan dialectal Arabic from audio. ## Model Details ### Model Description This model is a fine-tuned version of `giannitto/whisper-morocco-model` using a dataset of Moroccan Darija audio and transcriptions. The fine-tuning process aimed to improve the model's Word Error Rate (WER) for spoken Darija, which is underrepresented in many multilingual speech models. - **Developed by:** Bentaleb Ali - **Model type:** Automatic Speech Recognition (ASR) - **Language(s):** Moroccan Darija (Arabic dialect) - **License:** Apache 2.0 - **Finetuned from model:** giannitto/whisper-morocco-model ### Model Sources - **Repository:** https://huggingface.co/TaloCreations/whisper-darija-finetuned ## Uses ### Direct Use This model is intended for transcription of Moroccan Darija audio into text. It can be used in: - Voice assistants - Media subtitling - Dialectal speech processing - Linguistic research ### Out-of-Scope Use - Translation tasks (this model is for transcription, not translation) - Other Arabic dialects outside Moroccan Darija ## Bias, Risks, and Limitations - The model may perform poorly on noisy or low-quality recordings. - The model may not generalize well to other dialects of Arabic. - Biases in the training data (e.g., gender, age, region) may affect transcription accuracy. ### Recommendations Carefully evaluate outputs when using the model in sensitive applications. Avoid using it in high-risk domains without human verification. ## How to Get Started with the Model ```python from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq import torch, torchaudio # Load model and processor processor = AutoProcessor.from_pretrained("TaloCreations/whisper-darija-finetuned") model = AutoModelForSpeechSeq2Seq.from_pretrained("TaloCreations/whisper-darija-finetuned") model.eval() speech, sr = torchaudio.load("path_to_record.wav") if sr != 16000: resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000) speech = resampler(speech) # Preprocess and generate inputs = processor(speech[0], sampling_rate=16000, return_tensors="pt") with torch.no_grad(): generated_ids = model.generate(**inputs) transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print("📢 Transcription:", transcription) ``` ## Training Details ### Training Data The model was trained on: - [atlasia/DODa-audio-dataset Viewer](https://huggingface.co/datasets/atlasia/DODa-audio-dataset) - [adiren7/darija_speech_to_text](https://huggingface.co/datasets/adiren7/darija_speech_to_text) These datasets contain manually transcribed audio samples of Moroccan Darija. ### Training Procedure #### Preprocessing - All audio was resampled to 16kHz - Mel spectrograms were padded to 3000 frames (30s max) - Transcripts were tokenized and clipped to <=448 tokens - Decoder prompts were injected to ensure language/task alignment #### Training Hyperparameters - Batch size: 8 (gradient accumulation = 2) - Epochs: 10 - Learning rate: 2e-6 - Mixed precision: fp16 - Weight decay: 0.01 - Warmup steps: 500 ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data A held-out subset (10%) of the training datasets. #### Metrics - Word Error Rate (WER) ### Results ### 📊 Training Progress | Epoch | Training Loss | Validation Loss | Word Error Rate (WER) | |-------|----------------|------------------|------------------------| | 1 | 0.905000 | 0.831409 | 0.825147 | | 2 | 0.773200 | 0.712022 | 0.732625 | | 3 | 0.658900 | 0.652096 | 0.631158 | | 4 | 0.609100 | 0.608619 | 0.578152 | | 5 | 0.548400 | 0.579711 | 0.546444 | | 6 | 0.509700 | 0.561768 | 0.524927 | | 7 | 0.482000 | 0.551717 | 0.522067 | | 8 | 0.459400 | 0.545695 | 0.526979 | | 9 | 0.446500 | 0.543017 | 0.497141 | | 10 | 0.443200 | 0.542152 | 0.504545 | #### Summary After 10 epochs, the model achieved a WER of ~50%, a significant improvement over baseline multilingual Whisper models on Moroccan Darija. ## Environmental Impact Estimated based on training on a single A100 GPU for ~6.5 hours. - **Hardware Type:** A100 - **Hours used:** ~6.5 - **Cloud Provider:** Google Cloud (Colab) - **Compute Region:** Morocco ## Technical Specifications ### Model Architecture and Objective - Whisper (small) encoder-decoder architecture - Objective: sequence-to-sequence transcription ### Compute Infrastructure - Google Colab Pro - 1x A100 GPU - PyTorch + Transformers 4.39 ## Citation ```bibtex title={Whisper Darija: Fine-tuned Whisper Model for Moroccan Arabic Speech}, author={Bentaleb, Ali}, year={2025}, } ``` ## Model Card Authors - Ali Bentaleb [@TaloCreations](https://huggingface.co/TaloCreations) ## Model Card Contact - 📧 alitennis131800@gmail.com