whisper-large-v3-russian-ti…/README.md

---
base_model:
- antony66/whisper-large-v3-russian
- bond005/whisper-large-v3-ru-podlodka
language:
- ru
library_name: transformers
tags:
- asr
- whisper
- russian
- mergekit
- merge
datasets:
- mozilla-foundation/common_voice_17_0
- bond005/taiga_speech_v2
- bond005/podlodka_speech
- bond005/rulibrispeech
metrics:
- wer
---

# Model Details

This model was merged using the TIES merge method.

```yaml
method: ties
parameters:
  ties_density: 0.9
  encoder_weights:
    - 0.8
    - 0.2
  decoder_weights:
    - 0.2
    - 0.8
models:
  model_a: "/mnt/cloud/llm/whisper/whisper-large-v3-russian"
  model_b: "/mnt/cloud/llm/whisper/whisper-large-v3-ru-podlodka"
output_dir: "/mnt/cloud/llm/whisper/whisper-large-v3-russian-ties-podlodka"
```

## Simple API server

It can be uses with simple OpenAI compatible API server: https://github.com/kreolsky/whisper-api-server/

## Usage

In order to process phone calls it is highly recommended that you preprocess your records and adjust volume before performing ASR. For example, like this:

```bash
sox record.wav -r 8000 record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-50,-40,-15,0,0 -7 0 0.15
```

Then your ASR code should look somewhat like this:

```python
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

torch_dtype = torch.bfloat16 # set your preferred type here 

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
    setattr(torch.distributed, "is_initialized", lambda : False) # monkey patching
device = torch.device(device)

whisper = WhisperForConditionalGeneration.from_pretrained(
    "antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
    # add attn_implementation="flash_attention_2" if your GPU supports it
)

processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=whisper,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# read your wav file into variable wav. For example:
from io import BufferIO
wav = BytesIO()
with open('record-normalized.wav', 'rb') as f:
    wav.write(f.read())
wav.seek(0)

# get the transcription
asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)

print(asr['text'])

```

## Work in progress

This model is in WIP state for now. The goal is to finetune it for speech recognition of phone calls as much as possible. If you want to contribute and you know or have any good dataset please let me know. Your help will be much appreciated.
初始化项目，由ModelHub XC社区提供模型 Model: Apel-sin/whisper-large-v3-russian-ties-podlodka-v1.2 Source: Original Platform 2026-05-13 21:08:32 +08:00			`---`
			`base_model:`
			`- antony66/whisper-large-v3-russian`
			`- bond005/whisper-large-v3-ru-podlodka`
			`language:`
			`- ru`
			`library_name: transformers`
			`tags:`
			`- asr`
			`- whisper`
			`- russian`
			`- mergekit`
			`- merge`
			`datasets:`
			`- mozilla-foundation/common_voice_17_0`
			`- bond005/taiga_speech_v2`
			`- bond005/podlodka_speech`
			`- bond005/rulibrispeech`
			`metrics:`
			`- wer`
			`---`

			`# Model Details`

			`This model was merged using the TIES merge method.`

			```yaml
			`method: ties`
			`parameters:`
			`ties_density: 0.9`
			`encoder_weights:`
			`- 0.8`
			`- 0.2`
			`decoder_weights:`
			`- 0.2`
			`- 0.8`
			`models:`
			`model_a: "/mnt/cloud/llm/whisper/whisper-large-v3-russian"`
			`model_b: "/mnt/cloud/llm/whisper/whisper-large-v3-ru-podlodka"`
			`output_dir: "/mnt/cloud/llm/whisper/whisper-large-v3-russian-ties-podlodka"`
			```

			`## Simple API server`

			`It can be uses with simple OpenAI compatible API server: https://github.com/kreolsky/whisper-api-server/`

			`## Usage`

			`In order to process phone calls it is highly recommended that you preprocess your records and adjust volume before performing ASR. For example, like this:`

			```bash
			`sox record.wav -r 8000 record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-50,-40,-15,0,0 -7 0 0.15`
			```

			`Then your ASR code should look somewhat like this:`

			```python
			`import torch`
			`from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline`

			`torch_dtype = torch.bfloat16 # set your preferred type here`

			`device = 'cpu'`
			`if torch.cuda.is_available():`
			`device = 'cuda'`
			`elif torch.backends.mps.is_available():`
			`device = 'mps'`
			`setattr(torch.distributed, "is_initialized", lambda : False) # monkey patching`
			`device = torch.device(device)`

			`whisper = WhisperForConditionalGeneration.from_pretrained(`
			`"antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,`
			`# add attn_implementation="flash_attention_2" if your GPU supports it`
			`)`

			`processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")`

			`asr_pipeline = pipeline(`
			`"automatic-speech-recognition",`
			`model=whisper,`
			`tokenizer=processor.tokenizer,`
			`feature_extractor=processor.feature_extractor,`
			`max_new_tokens=256,`
			`chunk_length_s=30,`
			`batch_size=16,`
			`return_timestamps=True,`
			`torch_dtype=torch_dtype,`
			`device=device,`
			`)`

			`# read your wav file into variable wav. For example:`
			`from io import BufferIO`
			`wav = BytesIO()`
			`with open('record-normalized.wav', 'rb') as f:`
			`wav.write(f.read())`
			`wav.seek(0)`

			`# get the transcription`
			`asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)`

			`print(asr['text'])`

			```

			`## Work in progress`

			`This model is in WIP state for now. The goal is to finetune it for speech recognition of phone calls as much as possible. If you want to contribute and you know or have any good dataset please let me know. Your help will be much appreciated.`