初始化项目,由ModelHub XC社区提供模型
Model: yuriyvnv/whisper-large-v3-high-mixed-nl Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
273
README.md
Normal file
273
README.md
Normal file
@@ -0,0 +1,273 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- nl
|
||||
base_model: openai/whisper-large-v3
|
||||
tags:
|
||||
- automatic-speech-recognition
|
||||
- whisper
|
||||
- dutch
|
||||
- speech
|
||||
- audio
|
||||
- synthetic-data
|
||||
- asr
|
||||
- hf-asr-leaderboard
|
||||
datasets:
|
||||
- mozilla-foundation/common_voice_17_0
|
||||
- yuriyvnv/synthetic_transcript_nl
|
||||
model-index:
|
||||
- name: whisper-large-v3-high-mixed-nl
|
||||
results:
|
||||
- task:
|
||||
type: automatic-speech-recognition
|
||||
name: Automatic Speech Recognition
|
||||
dataset:
|
||||
name: Common Voice 17.0 (Dutch)
|
||||
type: mozilla-foundation/common_voice_17_0
|
||||
config: nl
|
||||
split: test
|
||||
metrics:
|
||||
- type: wer
|
||||
value: 4.43
|
||||
name: Test WER
|
||||
- task:
|
||||
type: automatic-speech-recognition
|
||||
name: Automatic Speech Recognition
|
||||
dataset:
|
||||
name: Multilingual LibriSpeech (Dutch)
|
||||
type: facebook/multilingual_librispeech
|
||||
config: dutch
|
||||
split: test
|
||||
metrics:
|
||||
- type: wer
|
||||
value: 20.29
|
||||
name: Test WER (MLS)
|
||||
pipeline_tag: automatic-speech-recognition
|
||||
library_name: transformers
|
||||
---
|
||||
|
||||
# Whisper-Large-v3 Dutch - High-Quality Filtered Synthetic Data
|
||||
|
||||
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with **WAVe-filtered high-quality synthetic speech data only** using a strict threshold (q ≥ 0.8).
|
||||
|
||||
## Introduction
|
||||
|
||||
### How the Data Was Created
|
||||
|
||||
The training data combines real speech from Common Voice 17.0 with synthetic speech generated through a two-stage pipeline:
|
||||
|
||||
1. **Transcript Generation**: We used GPT-4o-mini to generate Dutch transcripts that match the word count distribution observed in Common Voice, ensuring realistic utterance lengths and diverse linguistic content.
|
||||
|
||||
2. **Speech Synthesis**: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
|
||||
|
||||
3. **Quality Filtering with WAVe**: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied **WAVe (Word-Aligned Verification)**, a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. For this model, only samples scoring above the strict threshold (q ≥ 0.8) were retained, resulting in 10,555 high-quality synthetic samples.
|
||||
|
||||
### How the Model Was Created
|
||||
|
||||
The model was fine-tuned from `openai/whisper-large-v3` using the Hugging Face Transformers library with the following approach:
|
||||
|
||||
1. **Mixed Training**: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with 10,555 strictly WAVe-filtered high-quality synthetic samples (45,507 total).
|
||||
|
||||
2. **Optimization**: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
|
||||
|
||||
3. **Checkpoint Selection**: The best checkpoint was selected based on validation loss, occurring at step 350 with a validation loss of 0.0552.
|
||||
|
||||
This high-quality filtering approach achieves **35% reduction in training steps** compared to using all synthetic data, while maintaining excellent ASR performance.
|
||||
|
||||
## Model Details
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **Base Model** | openai/whisper-large-v3 |
|
||||
| **Language** | Dutch (nl) |
|
||||
| **Task** | Automatic Speech Recognition (transcribe) |
|
||||
| **Parameters** | 1550M |
|
||||
| **Training Data** | Common Voice 17.0 + High-Quality Synthetic (q ≥ 0.8) |
|
||||
| **Total Training Samples** | 45,507 |
|
||||
| **Sampling Rate** | 16kHz |
|
||||
|
||||
## Evaluation Results
|
||||
|
||||
### This Model (whisper-large-v3-high-mixed-nl)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| **Validation Loss** | 0.0520 |
|
||||
| **Validation WER** | 3.57% |
|
||||
| **Test WER (Common Voice)** | 4.43% |
|
||||
| **Test WER (MLS)** | 20.29% |
|
||||
| **Best Checkpoint** | Step 350 |
|
||||
| **Max Training Steps** | 890 |
|
||||
|
||||
### Comparison with Other Training Configurations (Whisper-Large-v3 Dutch)
|
||||
|
||||
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|
||||
|---------------|-----------|----------|---------|---------------|----------------|
|
||||
| Common Voice Only | 680 | 0.0549 | 3.56% | 4.39% | 22.43% |
|
||||
| **High-Quality Filtered + CV** | **890** | **0.0520** | **3.57%** | **4.43%** | **20.29%** |
|
||||
| Mid-High Quality Filtered + CV | 1,270 | 0.0570 | 3.63% | 4.48% | 17.25% |
|
||||
| All Synthetic + CV (Unfiltered) | 1,365 | 0.0560 | 3.61% | 4.44% | 17.02% |
|
||||
|
||||
### Key Performance Highlights
|
||||
|
||||
- **Most efficient training**: Only 890 max steps (35% fewer than unfiltered)
|
||||
- **Best validation loss** (0.0520) among all Whisper-Large-v3 Dutch configurations
|
||||
- **Competitive in-domain performance**: 4.43% Test WER on Common Voice
|
||||
- **9.5% relative improvement** on MLS benchmark vs baseline (20.29% vs 22.43%)
|
||||
- **Best quality-to-compute ratio**: Strong results with only top-tier synthetic data (30.2%)
|
||||
|
||||
## Training Data
|
||||
|
||||
### Dataset Composition
|
||||
|
||||
| Source | Samples | Description |
|
||||
|--------|---------|-------------|
|
||||
| [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 34,952 | Real speech from Mozilla's crowdsourced dataset |
|
||||
| [Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) (q ≥ 0.8) | 10,555 | Strictly WAVe-filtered TTS audio (high quality only) |
|
||||
| **Total** | **45,507** | |
|
||||
|
||||
### Synthetic Data Generation Pipeline
|
||||
|
||||
The synthetic dataset ([yuriyvnv/synthetic_transcript_nl](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl)) was generated using:
|
||||
|
||||
1. **Transcript Generation**: GPT-4o-mini, matching Common Voice word count distribution
|
||||
2. **Speech Synthesis**: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
|
||||
3. **Quality Filtering**: WAVe model with strict threshold q ≥ 0.8 (high quality only)
|
||||
|
||||
### WAVe Quality Distribution (Dutch Synthetic Data)
|
||||
|
||||
| Quality Level | Samples | Percentage | Used in This Model |
|
||||
|--------------|---------|------------|-------------------|
|
||||
| High (q ≥ 0.8) | 10,555 | 30.2% | ✓ |
|
||||
| Medium (0.5 ≤ q < 0.8) | 19,627 | 56.2% | ✗ |
|
||||
| Low (q < 0.5) | 4,716 | 13.5% | ✗ |
|
||||
|
||||
This strict threshold retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.
|
||||
|
||||
## Training Procedure
|
||||
|
||||
### Hyperparameters
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| Learning Rate | 5e-6 |
|
||||
| Batch Size (Global) | 256 |
|
||||
| Warmup Steps | 200 |
|
||||
| Max Epochs | 5 |
|
||||
| Precision | BF16 |
|
||||
| Optimizer | AdamW (fused) |
|
||||
| Eval Steps | 50 |
|
||||
| Metric for Best Model | eval_loss |
|
||||
|
||||
### Training Infrastructure
|
||||
|
||||
- **GPU**: NVIDIA H200 (140GB VRAM)
|
||||
- **Operating System**: Ubuntu 22.04
|
||||
- **Framework**: Hugging Face Transformers
|
||||
|
||||
### Training Curve
|
||||
|
||||
```
|
||||
Step 100: val_loss = 0.0588
|
||||
Step 200: val_loss = 0.0562
|
||||
Step 250: val_loss = 0.0561
|
||||
Step 350: val_loss = 0.0552 ← Best checkpoint
|
||||
Step 500: val_loss = 0.0601
|
||||
Step 650: val_loss = 0.0627
|
||||
Step 850: val_loss = 0.0680
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Transcription Pipeline
|
||||
|
||||
```python
|
||||
from transformers import pipeline
|
||||
|
||||
transcriber = pipeline(
|
||||
"automatic-speech-recognition",
|
||||
model="yuriyvnv/whisper-large-v3-high-mixed-nl",
|
||||
device="cuda"
|
||||
)
|
||||
|
||||
result = transcriber("path/to/dutch_audio.wav")
|
||||
print(result["text"])
|
||||
```
|
||||
|
||||
### Direct Model Usage
|
||||
|
||||
```python
|
||||
from transformers import WhisperProcessor, WhisperForConditionalGeneration
|
||||
import librosa
|
||||
|
||||
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-high-mixed-nl")
|
||||
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-high-mixed-nl")
|
||||
model.to("cuda")
|
||||
|
||||
audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
|
||||
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
|
||||
|
||||
predicted_ids = model.generate(input_features)
|
||||
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
|
||||
print(transcription)
|
||||
```
|
||||
|
||||
### Specifying Language
|
||||
|
||||
```python
|
||||
model.generation_config.language = "nl"
|
||||
model.generation_config.task = "transcribe"
|
||||
```
|
||||
|
||||
## Methodology
|
||||
|
||||
This model leverages **WAVe (Word-Aligned Verification)**, a word-level quality assessment method for filtering synthetic speech data. Unlike sentence-level filtering approaches, WAVe:
|
||||
|
||||
- Aligns each word to its corresponding audio frames using multi-head attention
|
||||
- Assigns per-word confidence scores via a GLU-based scorer
|
||||
- Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
|
||||
- Achieves **6.5% improvement** over sentence-level filtering methods
|
||||
|
||||
The strict threshold (q ≥ 0.8) retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.
|
||||
|
||||
## When to Use This Model
|
||||
|
||||
This model is ideal when:
|
||||
- **Compute resources are limited**: 35% fewer training steps than unfiltered approaches
|
||||
- **Quick fine-tuning is needed**: Smaller dataset (45,507 samples) enables faster iteration
|
||||
- **Best validation performance required**: Achieves lowest validation loss (0.0520)
|
||||
- **Quality over quantity**: Only top-tier synthetic data (30.2%) for clean training signal
|
||||
|
||||
Consider other variants based on your needs:
|
||||
- [whisper-large-v3-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-mixed-cv-nl): Better cross-domain performance with more data
|
||||
- [whisper-large-v3-cv-fully-synthetic-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl): Best cross-domain generalization (17.02% MLS)
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Domain specificity**: Optimized for general Dutch; may underperform on technical domains
|
||||
- **Acoustic conditions**: Trained on clean speech; noise robustness not guaranteed
|
||||
- **Dialect coverage**: Performance may vary across Dutch regional variants
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@article{perezhohin2024enhancing,
|
||||
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
|
||||
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
|
||||
journal={IEEE Access},
|
||||
year={2024},
|
||||
publisher={IEEE}
|
||||
}
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- **Base Model**: [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)
|
||||
- **Training Data (Real)**: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
|
||||
- **Training Data (Synthetic)**: [yuriyvnv/synthetic_transcript_nl](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl)
|
||||
- **Whisper Paper**: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
|
||||
- **IEEE Access Paper**: [Enhancing ASR with Semantic Audio Filtering](https://ieeexplore.ieee.org/document/10720758)
|
||||
|
||||
## License
|
||||
|
||||
Apache 2.0
|
||||
1611
added_tokens.json
Normal file
1611
added_tokens.json
Normal file
File diff suppressed because it is too large
Load Diff
46
config.json
Normal file
46
config.json
Normal file
@@ -0,0 +1,46 @@
|
||||
{
|
||||
"activation_dropout": 0.0,
|
||||
"activation_function": "gelu",
|
||||
"apply_spec_augment": false,
|
||||
"architectures": [
|
||||
"WhisperForConditionalGeneration"
|
||||
],
|
||||
"attention_dropout": 0.0,
|
||||
"begin_suppress_tokens": null,
|
||||
"bos_token_id": 50257,
|
||||
"classifier_proj_size": 256,
|
||||
"d_model": 1280,
|
||||
"decoder_attention_heads": 20,
|
||||
"decoder_ffn_dim": 5120,
|
||||
"decoder_layerdrop": 0.0,
|
||||
"decoder_layers": 32,
|
||||
"decoder_start_token_id": 50258,
|
||||
"dropout": 0.0,
|
||||
"encoder_attention_heads": 20,
|
||||
"encoder_ffn_dim": 5120,
|
||||
"encoder_layerdrop": 0.0,
|
||||
"encoder_layers": 32,
|
||||
"eos_token_id": 50257,
|
||||
"init_std": 0.02,
|
||||
"is_encoder_decoder": true,
|
||||
"mask_feature_length": 10,
|
||||
"mask_feature_min_masks": 0,
|
||||
"mask_feature_prob": 0.0,
|
||||
"mask_time_length": 10,
|
||||
"mask_time_min_masks": 2,
|
||||
"mask_time_prob": 0.05,
|
||||
"max_length": null,
|
||||
"max_source_positions": 1500,
|
||||
"max_target_positions": 448,
|
||||
"median_filter_width": 7,
|
||||
"model_type": "whisper",
|
||||
"num_hidden_layers": 32,
|
||||
"num_mel_bins": 128,
|
||||
"pad_token_id": 50256,
|
||||
"scale_embedding": false,
|
||||
"torch_dtype": "float32",
|
||||
"transformers_version": "4.50.2",
|
||||
"use_cache": false,
|
||||
"use_weighted_layer_sum": false,
|
||||
"vocab_size": 51866
|
||||
}
|
||||
258
generation_config.json
Normal file
258
generation_config.json
Normal file
@@ -0,0 +1,258 @@
|
||||
{
|
||||
"alignment_heads": [
|
||||
[
|
||||
7,
|
||||
0
|
||||
],
|
||||
[
|
||||
10,
|
||||
17
|
||||
],
|
||||
[
|
||||
12,
|
||||
18
|
||||
],
|
||||
[
|
||||
13,
|
||||
12
|
||||
],
|
||||
[
|
||||
16,
|
||||
1
|
||||
],
|
||||
[
|
||||
17,
|
||||
14
|
||||
],
|
||||
[
|
||||
19,
|
||||
11
|
||||
],
|
||||
[
|
||||
21,
|
||||
4
|
||||
],
|
||||
[
|
||||
24,
|
||||
1
|
||||
],
|
||||
[
|
||||
25,
|
||||
6
|
||||
]
|
||||
],
|
||||
"attn_implementation": "sdpa",
|
||||
"begin_suppress_tokens": [
|
||||
220,
|
||||
50257
|
||||
],
|
||||
"bos_token_id": 50257,
|
||||
"decoder_start_token_id": 50258,
|
||||
"eos_token_id": 50257,
|
||||
"is_multilingual": true,
|
||||
"lang_to_id": {
|
||||
"<|af|>": 50327,
|
||||
"<|am|>": 50334,
|
||||
"<|ar|>": 50272,
|
||||
"<|as|>": 50350,
|
||||
"<|az|>": 50304,
|
||||
"<|ba|>": 50355,
|
||||
"<|be|>": 50330,
|
||||
"<|bg|>": 50292,
|
||||
"<|bn|>": 50302,
|
||||
"<|bo|>": 50347,
|
||||
"<|br|>": 50309,
|
||||
"<|bs|>": 50315,
|
||||
"<|ca|>": 50270,
|
||||
"<|cs|>": 50283,
|
||||
"<|cy|>": 50297,
|
||||
"<|da|>": 50285,
|
||||
"<|de|>": 50261,
|
||||
"<|el|>": 50281,
|
||||
"<|en|>": 50259,
|
||||
"<|es|>": 50262,
|
||||
"<|et|>": 50307,
|
||||
"<|eu|>": 50310,
|
||||
"<|fa|>": 50300,
|
||||
"<|fi|>": 50277,
|
||||
"<|fo|>": 50338,
|
||||
"<|fr|>": 50265,
|
||||
"<|gl|>": 50319,
|
||||
"<|gu|>": 50333,
|
||||
"<|haw|>": 50352,
|
||||
"<|ha|>": 50354,
|
||||
"<|he|>": 50279,
|
||||
"<|hi|>": 50276,
|
||||
"<|hr|>": 50291,
|
||||
"<|ht|>": 50339,
|
||||
"<|hu|>": 50286,
|
||||
"<|hy|>": 50312,
|
||||
"<|id|>": 50275,
|
||||
"<|is|>": 50311,
|
||||
"<|it|>": 50274,
|
||||
"<|ja|>": 50266,
|
||||
"<|jw|>": 50356,
|
||||
"<|ka|>": 50329,
|
||||
"<|kk|>": 50316,
|
||||
"<|km|>": 50323,
|
||||
"<|kn|>": 50306,
|
||||
"<|ko|>": 50264,
|
||||
"<|la|>": 50294,
|
||||
"<|lb|>": 50345,
|
||||
"<|ln|>": 50353,
|
||||
"<|lo|>": 50336,
|
||||
"<|lt|>": 50293,
|
||||
"<|lv|>": 50301,
|
||||
"<|mg|>": 50349,
|
||||
"<|mi|>": 50295,
|
||||
"<|mk|>": 50308,
|
||||
"<|ml|>": 50296,
|
||||
"<|mn|>": 50314,
|
||||
"<|mr|>": 50320,
|
||||
"<|ms|>": 50282,
|
||||
"<|mt|>": 50343,
|
||||
"<|my|>": 50346,
|
||||
"<|ne|>": 50313,
|
||||
"<|nl|>": 50271,
|
||||
"<|nn|>": 50342,
|
||||
"<|no|>": 50288,
|
||||
"<|oc|>": 50328,
|
||||
"<|pa|>": 50321,
|
||||
"<|pl|>": 50269,
|
||||
"<|ps|>": 50340,
|
||||
"<|pt|>": 50267,
|
||||
"<|ro|>": 50284,
|
||||
"<|ru|>": 50263,
|
||||
"<|sa|>": 50344,
|
||||
"<|sd|>": 50332,
|
||||
"<|si|>": 50322,
|
||||
"<|sk|>": 50298,
|
||||
"<|sl|>": 50305,
|
||||
"<|sn|>": 50324,
|
||||
"<|so|>": 50326,
|
||||
"<|sq|>": 50317,
|
||||
"<|sr|>": 50303,
|
||||
"<|su|>": 50357,
|
||||
"<|sv|>": 50273,
|
||||
"<|sw|>": 50318,
|
||||
"<|ta|>": 50287,
|
||||
"<|te|>": 50299,
|
||||
"<|tg|>": 50331,
|
||||
"<|th|>": 50289,
|
||||
"<|tk|>": 50341,
|
||||
"<|tl|>": 50348,
|
||||
"<|tr|>": 50268,
|
||||
"<|tt|>": 50351,
|
||||
"<|uk|>": 50280,
|
||||
"<|ur|>": 50290,
|
||||
"<|uz|>": 50337,
|
||||
"<|vi|>": 50278,
|
||||
"<|yi|>": 50335,
|
||||
"<|yo|>": 50325,
|
||||
"<|yue|>": 50358,
|
||||
"<|zh|>": 50260
|
||||
},
|
||||
"language": "nl",
|
||||
"max_initial_timestamp_index": 50,
|
||||
"max_length": 448,
|
||||
"no_timestamps_token_id": 50364,
|
||||
"pad_token_id": 50257,
|
||||
"prev_sot_token_id": 50362,
|
||||
"return_timestamps": false,
|
||||
"suppress_tokens": [
|
||||
1,
|
||||
2,
|
||||
7,
|
||||
8,
|
||||
9,
|
||||
10,
|
||||
14,
|
||||
25,
|
||||
26,
|
||||
27,
|
||||
28,
|
||||
29,
|
||||
31,
|
||||
58,
|
||||
59,
|
||||
60,
|
||||
61,
|
||||
62,
|
||||
63,
|
||||
90,
|
||||
91,
|
||||
92,
|
||||
93,
|
||||
359,
|
||||
503,
|
||||
522,
|
||||
542,
|
||||
873,
|
||||
893,
|
||||
902,
|
||||
918,
|
||||
922,
|
||||
931,
|
||||
1350,
|
||||
1853,
|
||||
1982,
|
||||
2460,
|
||||
2627,
|
||||
3246,
|
||||
3253,
|
||||
3268,
|
||||
3536,
|
||||
3846,
|
||||
3961,
|
||||
4183,
|
||||
4667,
|
||||
6585,
|
||||
6647,
|
||||
7273,
|
||||
9061,
|
||||
9383,
|
||||
10428,
|
||||
10929,
|
||||
11938,
|
||||
12033,
|
||||
12331,
|
||||
12562,
|
||||
13793,
|
||||
14157,
|
||||
14635,
|
||||
15265,
|
||||
15618,
|
||||
16553,
|
||||
16604,
|
||||
18362,
|
||||
18956,
|
||||
20075,
|
||||
21675,
|
||||
22520,
|
||||
26130,
|
||||
26161,
|
||||
26435,
|
||||
28279,
|
||||
29464,
|
||||
31650,
|
||||
32302,
|
||||
32470,
|
||||
36865,
|
||||
42863,
|
||||
47425,
|
||||
49870,
|
||||
50254,
|
||||
50258,
|
||||
50359,
|
||||
50360,
|
||||
50361,
|
||||
50362,
|
||||
50363
|
||||
],
|
||||
"task": "transcribe",
|
||||
"task_to_id": {
|
||||
"transcribe": 50360,
|
||||
"translate": 50359
|
||||
},
|
||||
"transformers_version": "4.50.2"
|
||||
}
|
||||
50001
merges.txt
Normal file
50001
merges.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model-00001-of-00002.safetensors
Normal file
3
model-00001-of-00002.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:31fd38dd55ea2c1e1e50484f7350660252e1290c6005ec7a3dffbc253f40c6c5
|
||||
size 4993448880
|
||||
3
model-00002-of-00002.safetensors
Normal file
3
model-00002-of-00002.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:1689abf06c07cdae2b7e8b34e9ac763e6f06f8b4039ff41b39de9f4261323420
|
||||
size 1180663192
|
||||
1266
model.safetensors.index.json
Normal file
1266
model.safetensors.index.json
Normal file
File diff suppressed because it is too large
Load Diff
1742
normalizer.json
Normal file
1742
normalizer.json
Normal file
File diff suppressed because it is too large
Load Diff
15
preprocessor_config.json
Normal file
15
preprocessor_config.json
Normal file
@@ -0,0 +1,15 @@
|
||||
{
|
||||
"chunk_length": 30,
|
||||
"dither": 0.0,
|
||||
"feature_extractor_type": "WhisperFeatureExtractor",
|
||||
"feature_size": 128,
|
||||
"hop_length": 160,
|
||||
"n_fft": 400,
|
||||
"n_samples": 480000,
|
||||
"nb_max_frames": 3000,
|
||||
"padding_side": "right",
|
||||
"padding_value": 0.0,
|
||||
"processor_class": "WhisperProcessor",
|
||||
"return_attention_mask": false,
|
||||
"sampling_rate": 16000
|
||||
}
|
||||
139
special_tokens_map.json
Normal file
139
special_tokens_map.json
Normal file
@@ -0,0 +1,139 @@
|
||||
{
|
||||
"additional_special_tokens": [
|
||||
"<|startoftranscript|>",
|
||||
"<|en|>",
|
||||
"<|zh|>",
|
||||
"<|de|>",
|
||||
"<|es|>",
|
||||
"<|ru|>",
|
||||
"<|ko|>",
|
||||
"<|fr|>",
|
||||
"<|ja|>",
|
||||
"<|pt|>",
|
||||
"<|tr|>",
|
||||
"<|pl|>",
|
||||
"<|ca|>",
|
||||
"<|nl|>",
|
||||
"<|ar|>",
|
||||
"<|sv|>",
|
||||
"<|it|>",
|
||||
"<|id|>",
|
||||
"<|hi|>",
|
||||
"<|fi|>",
|
||||
"<|vi|>",
|
||||
"<|he|>",
|
||||
"<|uk|>",
|
||||
"<|el|>",
|
||||
"<|ms|>",
|
||||
"<|cs|>",
|
||||
"<|ro|>",
|
||||
"<|da|>",
|
||||
"<|hu|>",
|
||||
"<|ta|>",
|
||||
"<|no|>",
|
||||
"<|th|>",
|
||||
"<|ur|>",
|
||||
"<|hr|>",
|
||||
"<|bg|>",
|
||||
"<|lt|>",
|
||||
"<|la|>",
|
||||
"<|mi|>",
|
||||
"<|ml|>",
|
||||
"<|cy|>",
|
||||
"<|sk|>",
|
||||
"<|te|>",
|
||||
"<|fa|>",
|
||||
"<|lv|>",
|
||||
"<|bn|>",
|
||||
"<|sr|>",
|
||||
"<|az|>",
|
||||
"<|sl|>",
|
||||
"<|kn|>",
|
||||
"<|et|>",
|
||||
"<|mk|>",
|
||||
"<|br|>",
|
||||
"<|eu|>",
|
||||
"<|is|>",
|
||||
"<|hy|>",
|
||||
"<|ne|>",
|
||||
"<|mn|>",
|
||||
"<|bs|>",
|
||||
"<|kk|>",
|
||||
"<|sq|>",
|
||||
"<|sw|>",
|
||||
"<|gl|>",
|
||||
"<|mr|>",
|
||||
"<|pa|>",
|
||||
"<|si|>",
|
||||
"<|km|>",
|
||||
"<|sn|>",
|
||||
"<|yo|>",
|
||||
"<|so|>",
|
||||
"<|af|>",
|
||||
"<|oc|>",
|
||||
"<|ka|>",
|
||||
"<|be|>",
|
||||
"<|tg|>",
|
||||
"<|sd|>",
|
||||
"<|gu|>",
|
||||
"<|am|>",
|
||||
"<|yi|>",
|
||||
"<|lo|>",
|
||||
"<|uz|>",
|
||||
"<|fo|>",
|
||||
"<|ht|>",
|
||||
"<|ps|>",
|
||||
"<|tk|>",
|
||||
"<|nn|>",
|
||||
"<|mt|>",
|
||||
"<|sa|>",
|
||||
"<|lb|>",
|
||||
"<|my|>",
|
||||
"<|bo|>",
|
||||
"<|tl|>",
|
||||
"<|mg|>",
|
||||
"<|as|>",
|
||||
"<|tt|>",
|
||||
"<|haw|>",
|
||||
"<|ln|>",
|
||||
"<|ha|>",
|
||||
"<|ba|>",
|
||||
"<|jw|>",
|
||||
"<|su|>",
|
||||
"<|yue|>",
|
||||
"<|translate|>",
|
||||
"<|transcribe|>",
|
||||
"<|startoflm|>",
|
||||
"<|startofprev|>",
|
||||
"<|nospeech|>",
|
||||
"<|notimestamps|>"
|
||||
],
|
||||
"bos_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"eos_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"pad_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"unk_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
}
|
||||
12997
tokenizer_config.json
Normal file
12997
tokenizer_config.json
Normal file
File diff suppressed because it is too large
Load Diff
3
training_args.bin
Normal file
3
training_args.bin
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:ca2e464e9fb3767d5c7057046e5b4184137baf9b56ee2525daca45c28a7da428
|
||||
size 5688
|
||||
50259
vocab.json
Normal file
50259
vocab.json
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user