whisper-large-v3-et-subs/README.md

---
license: mit
language: et
tags:
- audio
- automatic-speech-recognition
#widget:
#- example_title: Librispeech sample 1
#  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
#- example_title: Librispeech sample 2
#  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
pipeline_tag: automatic-speech-recognition
base_model:
- openai/whisper-large-v3
library_name: transformers
---

## Introduction

This model is OpenAI Whisper large-v3, finetuned on ~770 hours of manually created subtitles from Estonian TV (ETV).
Therefore, this model does not always create verbatim (word-by-word) subtitles but often rephrases the sentences and
compresses text, especially in the case of spontaneous speech, hestitations, repetitions, etc. However, the length
of the generated text chunks almost always conforms to the ETV subtitle requirements (48 characters per line).

## Usage


It's a finetuned vesion of Whisper large-v3-turbo and can be therefore used via Hugging Face 🤗 Transformers. To run the model, first install the Transformers
library. For this example, we'll also install 🤗 Accelerate to reduce the model loading time:

```bash
pip install --upgrade pip
pip install --upgrade transformers accelerate
```

The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe audios of arbitrary length:

```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "TalTechNLP/whisper-large-v3-et-subs"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

audio = "sample.mp3"

result = pipe(sample, generate_kwargs={"task": "transcribe", "language": "et"})
print(result)
```

## Citation

```
@inproceedings{fedorchenko-2025-optimizing,
    title = "Optimizing Estonian {TV} Subtitles with Semi-supervised Learning and {LLMs}",
    author = {Fedorchenko, Artem and Alum{\"a}e, Tanel},
    booktitle = "Proceedings of the 25th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    year = "2025"
}
```