初始化项目,由ModelHub XC社区提供模型
Model: biodatlab/whisper-th-medium-timestamp Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
126
.ipynb_checkpoints/README-checkpoint.md
Normal file
126
.ipynb_checkpoints/README-checkpoint.md
Normal file
@@ -0,0 +1,126 @@
|
|||||||
|
---
|
||||||
|
license: mit
|
||||||
|
---
|
||||||
|
---
|
||||||
|
language:
|
||||||
|
- th
|
||||||
|
license: mit
|
||||||
|
library_name: transformers
|
||||||
|
tags:
|
||||||
|
- whisper-event
|
||||||
|
- generated_from_trainer
|
||||||
|
datasets:
|
||||||
|
- CMKL/Porjai-Thai-voice-dataset-central
|
||||||
|
metrics:
|
||||||
|
- wer
|
||||||
|
base_model: biodatlab/whisper-th-medium-combined
|
||||||
|
model-index:
|
||||||
|
- name: Whisper Medium Thai Timestamp - biodatlab
|
||||||
|
results:
|
||||||
|
- task:
|
||||||
|
type: automatic-speech-recognition
|
||||||
|
name: Automatic Speech Recognition
|
||||||
|
dataset:
|
||||||
|
name: mozilla-foundation/common_voice_13_0 th
|
||||||
|
type: mozilla-foundation/common_voice_13_0
|
||||||
|
config: th
|
||||||
|
split: test
|
||||||
|
args: th
|
||||||
|
metrics:
|
||||||
|
- type: wer
|
||||||
|
value: 15.57
|
||||||
|
name: Wer
|
||||||
|
---
|
||||||
|
|
||||||
|
# Whisper Medium (Thai) Timestamp
|
||||||
|
|
||||||
|
This model is a fine-tuned version of [biodatlab/whisper-th-medium-combined](biodatlab/whisper-th-medium-combined) on a custom-created longform dataset derived from the CMKL/Porjai-Thai-voice-dataset-central. It achieves the following results on the common-voice-13 test set:
|
||||||
|
- WER: 15.57 (with Deepcut Tokenizer)
|
||||||
|
|
||||||
|
## Model description
|
||||||
|
|
||||||
|
This model is designed to perform automatic speech recognition (ASR) for the Thai language, with the added capability of generating timestamps for the transcribed text. It's based on the Whisper medium architecture and has been fine-tuned on a specially crafted dataset to enable timestamp generation.
|
||||||
|
|
||||||
|
Use the model with Hugging Face's `transformers` as follows:
|
||||||
|
|
||||||
|
```py
|
||||||
|
from transformers import pipeline
|
||||||
|
import torch
|
||||||
|
|
||||||
|
MODEL_NAME = "biodatlab/whisper-th-medium-timestamp" # specify the model name
|
||||||
|
lang = "th" # Thai language
|
||||||
|
|
||||||
|
device = 0 if torch.cuda.is_available() else "cpu"
|
||||||
|
|
||||||
|
pipe = pipeline(
|
||||||
|
task="automatic-speech-recognition",
|
||||||
|
model=MODEL_NAME,
|
||||||
|
chunk_length_s=30,
|
||||||
|
device=device,
|
||||||
|
return_timestamps=True,
|
||||||
|
)
|
||||||
|
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
|
||||||
|
language=lang,
|
||||||
|
task="transcribe"
|
||||||
|
)
|
||||||
|
result = pipe("audio.mp3", return_timestamps=True)
|
||||||
|
text = result["text"]
|
||||||
|
timestamps = result["chunks"]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Intended uses & limitations
|
||||||
|
This model is intended for Thai automatic speech recognition tasks, particularly where timestamp information is required. It can be used for transcribing Thai audio content, creating subtitles, or any application that needs to align text with specific time points in audio.
|
||||||
|
The model's performance on speech recognition may be lower compared to non-timestamped versions due to the additional complexity of the task and the pseudo-timestamp generation method used in training.
|
||||||
|
## Training and evaluation data
|
||||||
|
The model was trained on a custom-created longform dataset derived from the CMKL/Porjai-Thai-voice-dataset-central. The dataset creation process involved the following steps:
|
||||||
|
|
||||||
|
- Combining multiple short audio clips from the original dataset into longer audio segments (up to 30 seconds).
|
||||||
|
- Adding environmental noises and silences between clips to simulate more realistic speech scenarios.
|
||||||
|
- Generating pseudo-timestamps for the combined audio using a Voice Activity Detection (VAD) model (Silero VAD).
|
||||||
|
|
||||||
|
This approach allowed us to create a dataset with longer, more diverse audio samples and approximate timestamp information, which is crucial for training a model capable of generating timestamps.
|
||||||
|
## Training procedure
|
||||||
|
The model was fine-tuned using a custom training script that incorporates the following:
|
||||||
|
|
||||||
|
- Mixed precision training (FP16)
|
||||||
|
- Gradient accumulation
|
||||||
|
- SpecAugment for data augmentation during training
|
||||||
|
|
||||||
|
## Training hyperparameters
|
||||||
|
The following hyperparameters were used during training:
|
||||||
|
|
||||||
|
learning_rate: 1e-05
|
||||||
|
train_batch_size: 8
|
||||||
|
eval_batch_size: 8
|
||||||
|
gradient_accumulation_steps: 1
|
||||||
|
num_train_iters: 50000
|
||||||
|
warmup_steps: 50
|
||||||
|
fp16: True
|
||||||
|
optimizer: AdamW
|
||||||
|
lr_scheduler_type: linear
|
||||||
|
|
||||||
|
## Framework versions
|
||||||
|
|
||||||
|
Transformers 4.44.2
|
||||||
|
Pytorch 2.4.1
|
||||||
|
Datasets 3.0.0
|
||||||
|
Tokenizers 0.20.0
|
||||||
|
|
||||||
|
## Performance and Limitations
|
||||||
|
The WER (Word Error Rate) of 15.57 on the Common Voice 13 test set indicates good performance for Thai ASR. However, it's important to note that the timestamp generation model has a lower accuracy compared to the non-timestamped version of the model. This is due to several factors:
|
||||||
|
|
||||||
|
- The use of pseudo-timestamps in training data, which are approximations based on VAD rather than precise human annotations.
|
||||||
|
- The additional complexity of the timestamp prediction task, which requires the model to learn both transcription and temporal alignment.
|
||||||
|
- Potential discrepancies between the VAD-generated timestamps and actual word boundaries in continuous speech.
|
||||||
|
|
||||||
|
Users should be aware that while the timestamps provide a general indication of when words or phrases occur in the audio, they may not be as precise as manually annotated timestamps. The model's performance may also vary depending on the acoustic conditions, speaker variability, and the presence of background noise in the input audio.
|
||||||
|
## Citation
|
||||||
|
If you use this model in your research or applications, please cite it as follows:
|
||||||
|
|
||||||
|
@misc{biodatlab_whisper_th_medium_timestamp,
|
||||||
|
author = {Atirut Boribalburephan, Zaw Htet Aung, Knot Pipatsrisawat, Titipat Achakulvisut},
|
||||||
|
title = {Whisper Medium Thai Timestamp: A fine-tuned Whisper model for Thai automatic speech recognition with timestamp generation},
|
||||||
|
year = 2024,
|
||||||
|
publisher = {Hugging Face},
|
||||||
|
howpublished = {\url{https://huggingface.co/biodatlab/whisper-th-medium-timestamp}}
|
||||||
|
}
|
||||||
124
README.md
Normal file
124
README.md
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- th
|
||||||
|
license: mit
|
||||||
|
library_name: transformers
|
||||||
|
tags:
|
||||||
|
- whisper-event
|
||||||
|
- generated_from_trainer
|
||||||
|
datasets:
|
||||||
|
- CMKL/Porjai-Thai-voice-dataset-central
|
||||||
|
metrics:
|
||||||
|
- wer
|
||||||
|
base_model: biodatlab/whisper-th-medium-combined
|
||||||
|
model-index:
|
||||||
|
- name: Whisper Medium Thai Timestamp - biodatlab
|
||||||
|
results:
|
||||||
|
- task:
|
||||||
|
type: automatic-speech-recognition
|
||||||
|
name: Automatic Speech Recognition
|
||||||
|
dataset:
|
||||||
|
name: mozilla-foundation/common_voice_13_0 th
|
||||||
|
type: mozilla-foundation/common_voice_13_0
|
||||||
|
config: th
|
||||||
|
split: test
|
||||||
|
args: th
|
||||||
|
metrics:
|
||||||
|
- type: wer
|
||||||
|
value: 15.57
|
||||||
|
name: Wer
|
||||||
|
---
|
||||||
|
|
||||||
|
# Whisper Medium (Thai) Timestamp
|
||||||
|
|
||||||
|
This model is a fine-tuned version of [biodatlab/whisper-th-medium-combined](biodatlab/whisper-th-medium-combined) on a custom-created longform dataset derived from the CMKL/Porjai-Thai-voice-dataset-central. It achieves the following results on the common-voice-13 test set:
|
||||||
|
- WER: 15.57 (with Deepcut Tokenizer)
|
||||||
|
|
||||||
|
## Model description
|
||||||
|
|
||||||
|
This model is designed to perform automatic speech recognition (ASR) for the Thai language, with the added capability of generating timestamps for the transcribed text. It's based on the Whisper medium architecture and has been fine-tuned on a specially crafted dataset to enable timestamp generation.
|
||||||
|
|
||||||
|
Use the model with Hugging Face's `transformers` as follows:
|
||||||
|
|
||||||
|
```py
|
||||||
|
from transformers import pipeline
|
||||||
|
import torch
|
||||||
|
|
||||||
|
MODEL_NAME = "biodatlab/whisper-th-medium-timestamp" # specify the model name
|
||||||
|
lang = "th" # Thai language
|
||||||
|
|
||||||
|
device = 0 if torch.cuda.is_available() else "cpu"
|
||||||
|
|
||||||
|
pipe = pipeline(
|
||||||
|
task="automatic-speech-recognition",
|
||||||
|
model=MODEL_NAME,
|
||||||
|
chunk_length_s=30,
|
||||||
|
device=device,
|
||||||
|
return_timestamps=True,
|
||||||
|
)
|
||||||
|
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
|
||||||
|
language=lang,
|
||||||
|
task="transcribe"
|
||||||
|
)
|
||||||
|
result = pipe("audio.mp3", return_timestamps=True)
|
||||||
|
text = result["text"]
|
||||||
|
timestamps = result["chunks"]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Intended uses & limitations
|
||||||
|
This model is intended for Thai automatic speech recognition tasks, particularly where timestamp information is required. It can be used for transcribing Thai audio content, creating subtitles, or any application that needs to align text with specific time points in audio.
|
||||||
|
The model's performance on speech recognition may be lower compared to non-timestamped versions due to the additional complexity of the task and the pseudo-timestamp generation method used in training.
|
||||||
|
## Training and evaluation data
|
||||||
|
The model was trained on a custom-created longform dataset derived from the CMKL/Porjai-Thai-voice-dataset-central. The dataset creation process involved the following steps:
|
||||||
|
|
||||||
|
- Combining multiple short audio clips from the original dataset into longer audio segments (up to 30 seconds).
|
||||||
|
- Adding environmental noises and silences between clips to simulate more realistic speech scenarios.
|
||||||
|
- Generating pseudo-timestamps for the combined audio using a Voice Activity Detection (VAD) model (Silero VAD).
|
||||||
|
|
||||||
|
This approach allowed us to create a dataset with longer, more diverse audio samples and approximate timestamp information, which is crucial for training a model capable of generating timestamps.
|
||||||
|
## Training procedure
|
||||||
|
The model was fine-tuned using a custom training script that incorporates the following:
|
||||||
|
|
||||||
|
- Mixed precision training (FP16)
|
||||||
|
- Gradient accumulation
|
||||||
|
- SpecAugment for data augmentation during training
|
||||||
|
|
||||||
|
## Training hyperparameters
|
||||||
|
The following hyperparameters were used during training:
|
||||||
|
|
||||||
|
- learning_rate: 1e-05
|
||||||
|
- train_batch_size: 8
|
||||||
|
- eval_batch_size: 8
|
||||||
|
- gradient_accumulation_steps: 1
|
||||||
|
- num_train_iters: ~50000
|
||||||
|
- warmup_steps: 500
|
||||||
|
- fp16: True
|
||||||
|
- optimizer: AdamW
|
||||||
|
- lr_scheduler_type: linear
|
||||||
|
|
||||||
|
## Framework versions
|
||||||
|
|
||||||
|
- Transformers 4.44.2
|
||||||
|
- Pytorch 2.4.1
|
||||||
|
- Datasets 3.0.0
|
||||||
|
- Tokenizers 0.20.0
|
||||||
|
|
||||||
|
## Performance and Limitations
|
||||||
|
The WER (Word Error Rate) of 15.57 on the Common Voice 13 test set indicates good performance for Thai ASR. However, it's important to note that the timestamp generation model has a lower accuracy compared to the non-timestamped version of the model. This is due to several factors:
|
||||||
|
|
||||||
|
- The use of pseudo-timestamps in training data, which are approximations based on VAD rather than precise human annotations.
|
||||||
|
- The additional complexity of the timestamp prediction task, which requires the model to learn both transcription and temporal alignment.
|
||||||
|
- Potential discrepancies between the VAD-generated timestamps and actual word boundaries in continuous speech.
|
||||||
|
|
||||||
|
Users should be aware that while the timestamps provide a general indication of when words or phrases occur in the audio, they may not be as precise as manually annotated timestamps. The model's performance may also vary depending on the acoustic conditions, speaker variability, and the presence of background noise in the input audio.
|
||||||
|
## Citation
|
||||||
|
If you use this model in your research or applications, please cite it as follows:
|
||||||
|
```
|
||||||
|
@misc{biodatlab_whisper_th_medium_timestamp,
|
||||||
|
author = {Atirut Boribalburephan, Zaw Htet Aung, Knot Pipatsrisawat, Titipat Achakulvisut},
|
||||||
|
title = {Whisper Medium Thai Timestamp: A fine-tuned Whisper model for Thai automatic speech recognition with timestamp generation},
|
||||||
|
year = 2024,
|
||||||
|
publisher = {Hugging Face},
|
||||||
|
howpublished = {\url{https://huggingface.co/biodatlab/whisper-th-medium-timestamp}}
|
||||||
|
}
|
||||||
|
```
|
||||||
1609
added_tokens.json
Normal file
1609
added_tokens.json
Normal file
File diff suppressed because it is too large
Load Diff
52
config.json
Normal file
52
config.json
Normal file
@@ -0,0 +1,52 @@
|
|||||||
|
{
|
||||||
|
"_name_or_path": "biodatlab/whisper-th-medium-combined",
|
||||||
|
"activation_dropout": 0.0,
|
||||||
|
"activation_function": "gelu",
|
||||||
|
"apply_spec_augment": true,
|
||||||
|
"architectures": [
|
||||||
|
"WhisperForConditionalGeneration"
|
||||||
|
],
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"begin_suppress_tokens": [
|
||||||
|
220,
|
||||||
|
50257
|
||||||
|
],
|
||||||
|
"bos_token_id": 50257,
|
||||||
|
"classifier_proj_size": 256,
|
||||||
|
"d_model": 1024,
|
||||||
|
"decoder_attention_heads": 16,
|
||||||
|
"decoder_ffn_dim": 4096,
|
||||||
|
"decoder_layerdrop": 0.0,
|
||||||
|
"decoder_layers": 24,
|
||||||
|
"decoder_start_token_id": 50258,
|
||||||
|
"dropout": 0.0,
|
||||||
|
"encoder_attention_heads": 16,
|
||||||
|
"encoder_ffn_dim": 4096,
|
||||||
|
"encoder_layerdrop": 0.0,
|
||||||
|
"encoder_layers": 24,
|
||||||
|
"eos_token_id": 50257,
|
||||||
|
"forced_decoder_ids": null,
|
||||||
|
"init_std": 0.02,
|
||||||
|
"is_encoder_decoder": true,
|
||||||
|
"mask_feature_length": 64,
|
||||||
|
"mask_feature_min_masks": 0,
|
||||||
|
"mask_feature_prob": 0.2,
|
||||||
|
"mask_time_length": 10,
|
||||||
|
"mask_time_min_masks": 2,
|
||||||
|
"mask_time_prob": 0.1,
|
||||||
|
"max_length": 448,
|
||||||
|
"max_source_positions": 1500,
|
||||||
|
"max_target_positions": 448,
|
||||||
|
"median_filter_width": 7,
|
||||||
|
"model_type": "whisper",
|
||||||
|
"num_hidden_layers": 24,
|
||||||
|
"num_mel_bins": 80,
|
||||||
|
"pad_token_id": 50257,
|
||||||
|
"scale_embedding": false,
|
||||||
|
"suppress_tokens": [],
|
||||||
|
"torch_dtype": "float32",
|
||||||
|
"transformers_version": "4.44.2",
|
||||||
|
"use_cache": true,
|
||||||
|
"use_weighted_layer_sum": false,
|
||||||
|
"vocab_size": 51865
|
||||||
|
}
|
||||||
248
generation_config.json
Normal file
248
generation_config.json
Normal file
@@ -0,0 +1,248 @@
|
|||||||
|
{
|
||||||
|
"alignment_heads": [
|
||||||
|
[
|
||||||
|
13,
|
||||||
|
15
|
||||||
|
],
|
||||||
|
[
|
||||||
|
15,
|
||||||
|
4
|
||||||
|
],
|
||||||
|
[
|
||||||
|
15,
|
||||||
|
15
|
||||||
|
],
|
||||||
|
[
|
||||||
|
16,
|
||||||
|
1
|
||||||
|
],
|
||||||
|
[
|
||||||
|
20,
|
||||||
|
0
|
||||||
|
],
|
||||||
|
[
|
||||||
|
23,
|
||||||
|
4
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"begin_suppress_tokens": [
|
||||||
|
220,
|
||||||
|
50257
|
||||||
|
],
|
||||||
|
"bos_token_id": 50257,
|
||||||
|
"decoder_start_token_id": 50258,
|
||||||
|
"eos_token_id": 50257,
|
||||||
|
"forced_decoder_ids": [
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
null
|
||||||
|
],
|
||||||
|
[
|
||||||
|
2,
|
||||||
|
50359
|
||||||
|
]
|
||||||
|
],
|
||||||
|
"is_multilingual": true,
|
||||||
|
"lang_to_id": {
|
||||||
|
"<|af|>": 50327,
|
||||||
|
"<|am|>": 50334,
|
||||||
|
"<|ar|>": 50272,
|
||||||
|
"<|as|>": 50350,
|
||||||
|
"<|az|>": 50304,
|
||||||
|
"<|ba|>": 50355,
|
||||||
|
"<|be|>": 50330,
|
||||||
|
"<|bg|>": 50292,
|
||||||
|
"<|bn|>": 50302,
|
||||||
|
"<|bo|>": 50347,
|
||||||
|
"<|br|>": 50309,
|
||||||
|
"<|bs|>": 50315,
|
||||||
|
"<|ca|>": 50270,
|
||||||
|
"<|cs|>": 50283,
|
||||||
|
"<|cy|>": 50297,
|
||||||
|
"<|da|>": 50285,
|
||||||
|
"<|de|>": 50261,
|
||||||
|
"<|el|>": 50281,
|
||||||
|
"<|en|>": 50259,
|
||||||
|
"<|es|>": 50262,
|
||||||
|
"<|et|>": 50307,
|
||||||
|
"<|eu|>": 50310,
|
||||||
|
"<|fa|>": 50300,
|
||||||
|
"<|fi|>": 50277,
|
||||||
|
"<|fo|>": 50338,
|
||||||
|
"<|fr|>": 50265,
|
||||||
|
"<|gl|>": 50319,
|
||||||
|
"<|gu|>": 50333,
|
||||||
|
"<|haw|>": 50352,
|
||||||
|
"<|ha|>": 50354,
|
||||||
|
"<|he|>": 50279,
|
||||||
|
"<|hi|>": 50276,
|
||||||
|
"<|hr|>": 50291,
|
||||||
|
"<|ht|>": 50339,
|
||||||
|
"<|hu|>": 50286,
|
||||||
|
"<|hy|>": 50312,
|
||||||
|
"<|id|>": 50275,
|
||||||
|
"<|is|>": 50311,
|
||||||
|
"<|it|>": 50274,
|
||||||
|
"<|ja|>": 50266,
|
||||||
|
"<|jw|>": 50356,
|
||||||
|
"<|ka|>": 50329,
|
||||||
|
"<|kk|>": 50316,
|
||||||
|
"<|km|>": 50323,
|
||||||
|
"<|kn|>": 50306,
|
||||||
|
"<|ko|>": 50264,
|
||||||
|
"<|la|>": 50294,
|
||||||
|
"<|lb|>": 50345,
|
||||||
|
"<|ln|>": 50353,
|
||||||
|
"<|lo|>": 50336,
|
||||||
|
"<|lt|>": 50293,
|
||||||
|
"<|lv|>": 50301,
|
||||||
|
"<|mg|>": 50349,
|
||||||
|
"<|mi|>": 50295,
|
||||||
|
"<|mk|>": 50308,
|
||||||
|
"<|ml|>": 50296,
|
||||||
|
"<|mn|>": 50314,
|
||||||
|
"<|mr|>": 50320,
|
||||||
|
"<|ms|>": 50282,
|
||||||
|
"<|mt|>": 50343,
|
||||||
|
"<|my|>": 50346,
|
||||||
|
"<|ne|>": 50313,
|
||||||
|
"<|nl|>": 50271,
|
||||||
|
"<|nn|>": 50342,
|
||||||
|
"<|no|>": 50288,
|
||||||
|
"<|oc|>": 50328,
|
||||||
|
"<|pa|>": 50321,
|
||||||
|
"<|pl|>": 50269,
|
||||||
|
"<|ps|>": 50340,
|
||||||
|
"<|pt|>": 50267,
|
||||||
|
"<|ro|>": 50284,
|
||||||
|
"<|ru|>": 50263,
|
||||||
|
"<|sa|>": 50344,
|
||||||
|
"<|sd|>": 50332,
|
||||||
|
"<|si|>": 50322,
|
||||||
|
"<|sk|>": 50298,
|
||||||
|
"<|sl|>": 50305,
|
||||||
|
"<|sn|>": 50324,
|
||||||
|
"<|so|>": 50326,
|
||||||
|
"<|sq|>": 50317,
|
||||||
|
"<|sr|>": 50303,
|
||||||
|
"<|su|>": 50357,
|
||||||
|
"<|sv|>": 50273,
|
||||||
|
"<|sw|>": 50318,
|
||||||
|
"<|ta|>": 50287,
|
||||||
|
"<|te|>": 50299,
|
||||||
|
"<|tg|>": 50331,
|
||||||
|
"<|th|>": 50289,
|
||||||
|
"<|tk|>": 50341,
|
||||||
|
"<|tl|>": 50348,
|
||||||
|
"<|tr|>": 50268,
|
||||||
|
"<|tt|>": 50351,
|
||||||
|
"<|uk|>": 50280,
|
||||||
|
"<|ur|>": 50290,
|
||||||
|
"<|uz|>": 50337,
|
||||||
|
"<|vi|>": 50278,
|
||||||
|
"<|yi|>": 50335,
|
||||||
|
"<|yo|>": 50325,
|
||||||
|
"<|zh|>": 50260
|
||||||
|
},
|
||||||
|
"max_initial_timestamp_index": 50,
|
||||||
|
"max_length": 448,
|
||||||
|
"no_timestamps_token_id": 50363,
|
||||||
|
"pad_token_id": 50257,
|
||||||
|
"prev_sot_token_id": 50361,
|
||||||
|
"return_timestamps": true,
|
||||||
|
"suppress_tokens": [
|
||||||
|
1,
|
||||||
|
2,
|
||||||
|
7,
|
||||||
|
8,
|
||||||
|
9,
|
||||||
|
10,
|
||||||
|
14,
|
||||||
|
25,
|
||||||
|
26,
|
||||||
|
27,
|
||||||
|
28,
|
||||||
|
29,
|
||||||
|
31,
|
||||||
|
58,
|
||||||
|
59,
|
||||||
|
60,
|
||||||
|
61,
|
||||||
|
62,
|
||||||
|
63,
|
||||||
|
90,
|
||||||
|
91,
|
||||||
|
92,
|
||||||
|
93,
|
||||||
|
359,
|
||||||
|
503,
|
||||||
|
522,
|
||||||
|
542,
|
||||||
|
873,
|
||||||
|
893,
|
||||||
|
902,
|
||||||
|
918,
|
||||||
|
922,
|
||||||
|
931,
|
||||||
|
1350,
|
||||||
|
1853,
|
||||||
|
1982,
|
||||||
|
2460,
|
||||||
|
2627,
|
||||||
|
3246,
|
||||||
|
3253,
|
||||||
|
3268,
|
||||||
|
3536,
|
||||||
|
3846,
|
||||||
|
3961,
|
||||||
|
4183,
|
||||||
|
4667,
|
||||||
|
6585,
|
||||||
|
6647,
|
||||||
|
7273,
|
||||||
|
9061,
|
||||||
|
9383,
|
||||||
|
10428,
|
||||||
|
10929,
|
||||||
|
11938,
|
||||||
|
12033,
|
||||||
|
12331,
|
||||||
|
12562,
|
||||||
|
13793,
|
||||||
|
14157,
|
||||||
|
14635,
|
||||||
|
15265,
|
||||||
|
15618,
|
||||||
|
16553,
|
||||||
|
16604,
|
||||||
|
18362,
|
||||||
|
18956,
|
||||||
|
20075,
|
||||||
|
21675,
|
||||||
|
22520,
|
||||||
|
26130,
|
||||||
|
26161,
|
||||||
|
26435,
|
||||||
|
28279,
|
||||||
|
29464,
|
||||||
|
31650,
|
||||||
|
32302,
|
||||||
|
32470,
|
||||||
|
36865,
|
||||||
|
42863,
|
||||||
|
47425,
|
||||||
|
49870,
|
||||||
|
50254,
|
||||||
|
50258,
|
||||||
|
50358,
|
||||||
|
50359,
|
||||||
|
50360,
|
||||||
|
50361,
|
||||||
|
50362
|
||||||
|
],
|
||||||
|
"task_to_id": {
|
||||||
|
"transcribe": 50359,
|
||||||
|
"translate": 50358
|
||||||
|
},
|
||||||
|
"transformers_version": "4.44.2"
|
||||||
|
}
|
||||||
50000
merges.txt
Normal file
50000
merges.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:6195b83a3ff23fa6f47fb6d2d9d1dd4aacd6998f3bf3234ba8315c978e6559e4
|
||||||
|
size 3055544304
|
||||||
1742
normalizer.json
Normal file
1742
normalizer.json
Normal file
File diff suppressed because it is too large
Load Diff
14
preprocessor_config.json
Normal file
14
preprocessor_config.json
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
{
|
||||||
|
"chunk_length": 30,
|
||||||
|
"feature_extractor_type": "WhisperFeatureExtractor",
|
||||||
|
"feature_size": 80,
|
||||||
|
"hop_length": 160,
|
||||||
|
"n_fft": 400,
|
||||||
|
"n_samples": 480000,
|
||||||
|
"nb_max_frames": 3000,
|
||||||
|
"padding_side": "right",
|
||||||
|
"padding_value": 0.0,
|
||||||
|
"processor_class": "WhisperProcessor",
|
||||||
|
"return_attention_mask": false,
|
||||||
|
"sampling_rate": 16000
|
||||||
|
}
|
||||||
139
special_tokens_map.json
Normal file
139
special_tokens_map.json
Normal file
@@ -0,0 +1,139 @@
|
|||||||
|
{
|
||||||
|
"additional_special_tokens": [
|
||||||
|
"<|endoftext|>",
|
||||||
|
"<|startoftranscript|>",
|
||||||
|
"<|en|>",
|
||||||
|
"<|zh|>",
|
||||||
|
"<|de|>",
|
||||||
|
"<|es|>",
|
||||||
|
"<|ru|>",
|
||||||
|
"<|ko|>",
|
||||||
|
"<|fr|>",
|
||||||
|
"<|ja|>",
|
||||||
|
"<|pt|>",
|
||||||
|
"<|tr|>",
|
||||||
|
"<|pl|>",
|
||||||
|
"<|ca|>",
|
||||||
|
"<|nl|>",
|
||||||
|
"<|ar|>",
|
||||||
|
"<|sv|>",
|
||||||
|
"<|it|>",
|
||||||
|
"<|id|>",
|
||||||
|
"<|hi|>",
|
||||||
|
"<|fi|>",
|
||||||
|
"<|vi|>",
|
||||||
|
"<|he|>",
|
||||||
|
"<|uk|>",
|
||||||
|
"<|el|>",
|
||||||
|
"<|ms|>",
|
||||||
|
"<|cs|>",
|
||||||
|
"<|ro|>",
|
||||||
|
"<|da|>",
|
||||||
|
"<|hu|>",
|
||||||
|
"<|ta|>",
|
||||||
|
"<|no|>",
|
||||||
|
"<|th|>",
|
||||||
|
"<|ur|>",
|
||||||
|
"<|hr|>",
|
||||||
|
"<|bg|>",
|
||||||
|
"<|lt|>",
|
||||||
|
"<|la|>",
|
||||||
|
"<|mi|>",
|
||||||
|
"<|ml|>",
|
||||||
|
"<|cy|>",
|
||||||
|
"<|sk|>",
|
||||||
|
"<|te|>",
|
||||||
|
"<|fa|>",
|
||||||
|
"<|lv|>",
|
||||||
|
"<|bn|>",
|
||||||
|
"<|sr|>",
|
||||||
|
"<|az|>",
|
||||||
|
"<|sl|>",
|
||||||
|
"<|kn|>",
|
||||||
|
"<|et|>",
|
||||||
|
"<|mk|>",
|
||||||
|
"<|br|>",
|
||||||
|
"<|eu|>",
|
||||||
|
"<|is|>",
|
||||||
|
"<|hy|>",
|
||||||
|
"<|ne|>",
|
||||||
|
"<|mn|>",
|
||||||
|
"<|bs|>",
|
||||||
|
"<|kk|>",
|
||||||
|
"<|sq|>",
|
||||||
|
"<|sw|>",
|
||||||
|
"<|gl|>",
|
||||||
|
"<|mr|>",
|
||||||
|
"<|pa|>",
|
||||||
|
"<|si|>",
|
||||||
|
"<|km|>",
|
||||||
|
"<|sn|>",
|
||||||
|
"<|yo|>",
|
||||||
|
"<|so|>",
|
||||||
|
"<|af|>",
|
||||||
|
"<|oc|>",
|
||||||
|
"<|ka|>",
|
||||||
|
"<|be|>",
|
||||||
|
"<|tg|>",
|
||||||
|
"<|sd|>",
|
||||||
|
"<|gu|>",
|
||||||
|
"<|am|>",
|
||||||
|
"<|yi|>",
|
||||||
|
"<|lo|>",
|
||||||
|
"<|uz|>",
|
||||||
|
"<|fo|>",
|
||||||
|
"<|ht|>",
|
||||||
|
"<|ps|>",
|
||||||
|
"<|tk|>",
|
||||||
|
"<|nn|>",
|
||||||
|
"<|mt|>",
|
||||||
|
"<|sa|>",
|
||||||
|
"<|lb|>",
|
||||||
|
"<|my|>",
|
||||||
|
"<|bo|>",
|
||||||
|
"<|tl|>",
|
||||||
|
"<|mg|>",
|
||||||
|
"<|as|>",
|
||||||
|
"<|tt|>",
|
||||||
|
"<|haw|>",
|
||||||
|
"<|ln|>",
|
||||||
|
"<|ha|>",
|
||||||
|
"<|ba|>",
|
||||||
|
"<|jw|>",
|
||||||
|
"<|su|>",
|
||||||
|
"<|translate|>",
|
||||||
|
"<|transcribe|>",
|
||||||
|
"<|startoflm|>",
|
||||||
|
"<|startofprev|>",
|
||||||
|
"<|nocaptions|>",
|
||||||
|
"<|notimestamps|>"
|
||||||
|
],
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eos_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"pad_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"unk_token": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
114895
tokenizer.json
Normal file
114895
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
12989
tokenizer_config.json
Normal file
12989
tokenizer_config.json
Normal file
File diff suppressed because it is too large
Load Diff
1
vocab.json
Normal file
1
vocab.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user