Files
ModelHub XC 59ae840aff 初始化项目,由ModelHub XC社区提供模型
Model: bond005/whisper-large-v3-ru-podlodka
Source: Original Platform
2026-05-12 12:23:54 +08:00

82 lines
2.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
datasets:
- bond005/taiga_speech_v2
- bond005/podlodka_speech
- bond005/rulibrispeech
language:
- ru
license: apache-2.0
metrics:
- wer
pipeline_tag: automatic-speech-recognition
library_name: transformers
widget:
- example_title: Нейронные сети - это хорошо!
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_ru.flac
- example_title: К сожалению, система распознавания речи не всегда стабильна, особенно
в шумных условиях.
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/test_sound_with_noise.wav
- example_title: Мимо театра мальчик ходил довольно часто — белое, со взбитыми сливками,
здание-торт.
src: https://huggingface.co/bond005/whisper-large-v3-ru-podlodka/resolve/main/anna_matveeva_test.wav
model-index:
- name: Whisper Large V3 Russian Podlodka by Ivan Bondarenko
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: Podlodka.io
type: bond005/podlodka_speech
args: ru
metrics:
- type: wer
value: 20.91
name: WER (with punctuation and capital letters)
- type: wer
value: 10.987
name: WER (without punctuation)
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: Russian Librispeech
type: bond005/rulibrispeech
args: ru
metrics:
- type: wer
value: 9.795
name: WER (without punctuation)
---
# Whisper Large V3 Russian Podlodka
This repository contains a fine-tuned Whisper Large V3 model for Russian speech recognition. It serves as the core transcription component of the **Pisets** system, specifically optimized for long audio recordings such as lectures and interviews.
The model was presented in the paper [Pisets: A Robust Speech Recognition System for Lectures and Interviews](https://huggingface.co/papers/2601.18415).
## System Architecture
The Pisets system implements a three-component architecture to improve recognition accuracy while minimizing hallucinations:
1. **Wav2Vec2**: For primary recognition and segmentation.
2. **Audio Spectrogram Transformer (AST)**: For filtering non-speech segments.
3. **Whisper (this model)**: For the final high-quality transcription.
## Implementation
The complete source code and instructions for using the system (including generation of SRT and DocX files) can be found in the GitHub repository:
**GitHub:** [https://github.com/bond005/pisets](https://github.com/bond005/pisets)
## Citation
If you use this model or the Pisets system in your research, please cite:
```bibtex
@article{bondarenko2026pisets,
title={Pisets: A Robust Speech Recognition System for Lectures and Interviews},
author={Ivan Bondarenko},
journal={arXiv preprint arXiv:2601.18415},
year={2026}
}
```