82 lines
2.9 KiB
Markdown
82 lines
2.9 KiB
Markdown
|
|
---
|
|||
|
|
datasets:
|
|||
|
|
- bond005/taiga_speech_v2
|
|||
|
|
- bond005/podlodka_speech
|
|||
|
|
- bond005/rulibrispeech
|
|||
|
|
language:
|
|||
|
|
- ru
|
|||
|
|
license: apache-2.0
|
|||
|
|
metrics:
|
|||
|
|
- wer
|
|||
|
|
pipeline_tag: automatic-speech-recognition
|
|||
|
|
library_name: transformers
|
|||
|
|
widget:
|
|||
|
|
- example_title: Нейронные сети - это хорошо!
|
|||
|
|
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_ru.flac
|
|||
|
|
- example_title: К сожалению, система распознавания речи не всегда стабильна, особенно
|
|||
|
|
в шумных условиях.
|
|||
|
|
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_with_noise.wav
|
|||
|
|
- example_title: Мимо театра мальчик ходил довольно часто — белое, со взбитыми сливками,
|
|||
|
|
здание-торт.
|
|||
|
|
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/anna_matveeva_test.wav
|
|||
|
|
model-index:
|
|||
|
|
- name: Whisper Large V3 Russian Podlodka by Ivan Bondarenko
|
|||
|
|
results:
|
|||
|
|
- task:
|
|||
|
|
type: automatic-speech-recognition
|
|||
|
|
name: Speech Recognition
|
|||
|
|
dataset:
|
|||
|
|
name: Podlodka.io
|
|||
|
|
type: bond005/podlodka_speech
|
|||
|
|
args: ru
|
|||
|
|
metrics:
|
|||
|
|
- type: wer
|
|||
|
|
value: 20.91
|
|||
|
|
name: WER (with punctuation and capital letters)
|
|||
|
|
- type: wer
|
|||
|
|
value: 10.987
|
|||
|
|
name: WER (without punctuation)
|
|||
|
|
- task:
|
|||
|
|
type: automatic-speech-recognition
|
|||
|
|
name: Speech Recognition
|
|||
|
|
dataset:
|
|||
|
|
name: Russian Librispeech
|
|||
|
|
type: bond005/rulibrispeech
|
|||
|
|
args: ru
|
|||
|
|
metrics:
|
|||
|
|
- type: wer
|
|||
|
|
value: 9.795
|
|||
|
|
name: WER (without punctuation)
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Whisper Large V3 Russian Podlodka
|
|||
|
|
|
|||
|
|
This repository contains a fine-tuned Whisper Large V3 model for Russian speech recognition. It serves as the core transcription component of the **Pisets** system, specifically optimized for long audio recordings such as lectures and interviews.
|
|||
|
|
|
|||
|
|
The model was presented in the paper [Pisets: A Robust Speech Recognition System for Lectures and Interviews](https://huggingface.co/papers/2601.18415).
|
|||
|
|
|
|||
|
|
## System Architecture
|
|||
|
|
|
|||
|
|
The Pisets system implements a three-component architecture to improve recognition accuracy while minimizing hallucinations:
|
|||
|
|
1. **Wav2Vec2**: For primary recognition and segmentation.
|
|||
|
|
2. **Audio Spectrogram Transformer (AST)**: For filtering non-speech segments.
|
|||
|
|
3. **Whisper (this model)**: For the final high-quality transcription.
|
|||
|
|
|
|||
|
|
## Implementation
|
|||
|
|
|
|||
|
|
The complete source code and instructions for using the system (including generation of SRT and DocX files) can be found in the GitHub repository:
|
|||
|
|
|
|||
|
|
**GitHub:** [https://github.com/bond005/pisets](https://github.com/bond005/pisets)
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
If you use this model or the Pisets system in your research, please cite:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@article{bondarenko2026pisets,
|
|||
|
|
title={Pisets: A Robust Speech Recognition System for Lectures and Interviews},
|
|||
|
|
author={Ivan Bondarenko},
|
|||
|
|
journal={arXiv preprint arXiv:2601.18415},
|
|||
|
|
year={2026}
|
|||
|
|
}
|
|||
|
|
```
|