初始化项目，由ModelHub XC社区提供模型

Model: mjwong/whisper-large-v3-turbo-singlish Source: Original Platform
2026-05-15 07:11:53 +08:00
commit e204116dca
13 changed files with 117270 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,167 @@
 ---
 base_model:
 - openai/whisper-large-v3-turbo
 language:
 - en
 metrics:
 - wer
 pipeline_tag: automatic-speech-recognition
 license: mit
 library_name: transformers
 model-index:
 - name: whisper-large-v3-turbo-singlish
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: SASRBench-v1
      type: mjwong/SASRBench-v1
      split: test
    metrics:
      - name: WER
        type: WER
        value: 13.35
 - name: whisper-large-v3-turbo-singlish
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: AMI
      type: edinburghcstr/ami
      subset: ihm
      split: test
    metrics:
      - name: WER
        type: WER
        value: 16.99
 - name: whisper-large-v3-turbo-singlish
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      name: GigaSpeech
      type: speechcolab/gigaspeech
      subset: test
      split: test
    metrics:
      - name: WER
        type: WER
        value: 11.54
 tags:
 - whisper
 ---
 # Whisper large-v3-turbo-singlish
 **Whisper large-v3-turbo-singlish** is a fine-tuned automatic speech recognition (ASR) model optimized for Singlish. Built on OpenAI's Whisper model, it has been adapted using Singlish-specific data to accurately capture the unique phonetic and lexical nuances of Singlish speech.
 ## Model Details
 - **Developed by:** Ming Jie Wong
 - **Base Model:** [openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
 - **Model Type:** Encoder-decoder
 - **Metrics:** Word Error Rate (WER)
 - **Languages Supported:** English (with a focus on Singlish)
 - **License:** MIT
 ### Description
 Whisper large-v3-turbo-singlish is developed using an internal dataset of 66.9k audio-transcript pairs. The dataset is derived exclusively from the Part 3 Same Room Environment Close-talk Mic recordings of [IMDA's NSC Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus). 
 The original Part 3 of the National Speech Corpus comprises approximately 1,000 hours of conversational speech from around 1,000 local English speakers, recorded in pairs. These conversations cover everyday topics and include interactive game-based dialogues. Recordings were conducted in two environments:
 - Same Room, where speakers shared a room and were recorded using a close-talk mic and a boundary mic.
 - Separate Room, where each speaker was recorded individually using a standing mic and a telephone (IVR).
 Audio segments for the internal dataset were extracted using these criteria:
 - **Minimum Word Count:** 10 words
  _This threshold was chosen to ensure that each audio segment contains sufficient linguistic context for the model to better understand instructions in Singlish. Shorter segments may bias the model towards specific utterances or phrases, limiting its overall comprehension._
 - **Maximum Duration:** 20 seconds
  _This threshold was chosen to provide enough context for accurate transcription while minimizing noise and computational complexity for longer audio segments._
 - **Sampling Rate**: All audio segments are down-sampled to 16kHz.
 Full experiments details will be added soon.
 ### Fine-Tuning Details
 We applied fine-tuning on a single A100-80GB GPU.
 #### Training Hyperparameters
 The following hyperparameters are used:
 - **batch_size**: 16
 - **gradient_accumulation_steps**: 1
 - **learning_rate**: 1e-6
 - **warmup_steps**: 300
 - **max_steps**: 5000
 - **fp16**: true
 - **eval_batch_size**: 16
 - **eval_step**: 300
 - **max_grad_norm**: 1.0
 - **generation_max_length**: 225
 #### Training Results
 The table below summarizes the model’s progress across various training steps, showing the training loss, evaluation loss, and Word Error Rate (WER).
 | Steps | Train Loss | Eval Loss | WER                |
 |:-----:|:----------:|:---------:|:------------------:|
 | 300   | 0.8992	 | 0.3501    | 13.376788          |
 | 600   | 0.4157     | 0.3241    | 12.769994          |
 | 900   | 0.3520     | 0.3124    | 12.168367          |
 | 1200  | 0.3415     | 0.3079    | 12.517532          |
 | 1500  | 0.3620     | 0.3077    | 12.344057          |
 | 1800  | 0.3609     | 0.2996    | 12.315267          |
 | 2100  | 0.3348     | 0.2963    | 12.231113          |
 | 2400  | 0.3715     | 0.2927    | 12.005226          |
 | 2700  | 0.3445     | 0.2923    | 11.829537          |
 | 3000  | 0.3753     | 0.2884    | 11.954291          |
 | 3300  | 0.3469     | 0.2881    | 11.951338          |
 | 3600  | 0.3325     | 0.2857    | 12.145483          |
 | 3900  | 0.3168     | 0.2846    | 11.549023          |
 | 4200  | 0.3250     | 0.2837    | 11.740215          |
 | 4500  | 0.2855     | 0.2834    | 11.634654          |
 | 4800  | 0.2936     | 0.2836    | 11.651632          |
 The final checkpoint is taken from the model that achieved the lowest WER score during the 4800 steps.
 ### Benchmark Performance
 We evaluated Whisper large-v3-turbo-singlish on [SASRBench-v1](https://huggingface.co/datasets/mjwong/SASRBench-v1), a benchmark dataset for evaluating ASR performance on Singlish:
 | Model                                                                                                  | WER     |
 |:------------------------------------------------------------------------------------------------------:|:-------:|
 | [openai/whisper-small](https://huggingface.co/openai/whisper-small)                                    | 147.80% |
 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                              | 103.41% |
 | [jensenlwt/fine-tuned-122k-whisper-small](https://huggingface.co/jensenlwt/whisper-small-singlish-122k)| 68.79%  |
 | [openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)                  | 27.58%  |
 | [mjwong/whisper-small-singlish](https://huggingface.co/mjwong/whisper-small-singlish)                  | 18.49%  |
 | [mjwong/whisper-large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish)            | 16.41%  |
 | [mjwong/whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish)| 13.35%  |
 Additional performance evaluations of the model on other datasets are available [here](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT#model-performance).
 ## Disclaimer
 While this model has been fine-tuned to better recognize Singlish, users may experience inaccuracies, biases, or unexpected outputs, particularly in challenging audio conditions or with speakers using non-standard variations. Use of this model is at your own risk; the developers and distributors are not liable for any consequences arising from its use. Please validate results before deploying in any sensitive or production environment.
 ## How to use the model
 The model can be loaded with the `automatic-speech-recognition` pipeline like so:
 ```python
 from transformers import pipeline
 model = "mjwong/whisper-large-v3-turbo-singlish"
 pipe = pipeline("automatic-speech-recognition", model)
 ```
 You can then use this pipeline to transcribe audios of arbitrary length.
 ```python
 from datasets import load_dataset
 dataset = load_dataset("mjwong/SASRBench-v1", split="test")
 sample = dataset[0]["audio"]
 result = pipe(sample)
 print(result["text"])
 ```
 ## Contact
 For more information, please reach out to mingjwong@hotmail.com.
 ## Acknowledgements 
 1. https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english
 2. https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/README.md
 3. https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
--- a/added_tokens.json
+++ b/added_tokens.json
--- a/config.json
+++ b/config.json
@@ -0,0 +1,50 @@
 {
  "_name_or_path": "openai/whisper-large-v3-turbo",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "apply_spec_augment": false,
  "architectures": [
    "WhisperForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "begin_suppress_tokens": [
    220,
    50256
  ],
  "bos_token_id": 50257,
  "classifier_proj_size": 256,
  "d_model": 1280,
  "decoder_attention_heads": 20,
  "decoder_ffn_dim": 5120,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 4,
  "decoder_start_token_id": 50258,
  "dropout": 0.0,
  "encoder_attention_heads": 20,
  "encoder_ffn_dim": 5120,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 32,
  "eos_token_id": 50257,
  "forced_decoder_ids": null,
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "mask_feature_length": 10,
  "mask_feature_min_masks": 0,
  "mask_feature_prob": 0.0,
  "mask_time_length": 10,
  "mask_time_min_masks": 2,
  "mask_time_prob": 0.05,
  "max_source_positions": 1500,
  "max_target_positions": 448,
  "median_filter_width": 7,
  "model_type": "whisper",
  "num_hidden_layers": 32,
  "num_mel_bins": 128,
  "pad_token_id": 50257,
  "scale_embedding": false,
  "torch_dtype": "float32",
  "transformers_version": "4.49.0",
  "use_cache": true,
  "use_weighted_layer_sum": false,
  "vocab_size": 51866
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,249 @@
 {
  "alignment_heads": [
    [
      2,
      4
    ],
    [
      2,
      11
    ],
    [
      3,
      3
    ],
    [
      3,
      6
    ],
    [
      3,
      11
    ],
    [
      3,
      14
    ]
  ],
  "begin_suppress_tokens": [
    220,
    50257
  ],
  "bos_token_id": 50257,
  "decoder_start_token_id": 50258,
  "eos_token_id": 50257,
  "forced_decoder_ids": [
    [
      1,
      null
    ],
    [
      2,
      50360
    ]
  ],
  "is_multilingual": true,
  "lang_to_id": {
    "<|af|>": 50327,
    "<|am|>": 50334,
    "<|ar|>": 50272,
    "<|as|>": 50350,
    "<|az|>": 50304,
    "<|ba|>": 50355,
    "<|be|>": 50330,
    "<|bg|>": 50292,
    "<|bn|>": 50302,
    "<|bo|>": 50347,
    "<|br|>": 50309,
    "<|bs|>": 50315,
    "<|ca|>": 50270,
    "<|cs|>": 50283,
    "<|cy|>": 50297,
    "<|da|>": 50285,
    "<|de|>": 50261,
    "<|el|>": 50281,
    "<|en|>": 50259,
    "<|es|>": 50262,
    "<|et|>": 50307,
    "<|eu|>": 50310,
    "<|fa|>": 50300,
    "<|fi|>": 50277,
    "<|fo|>": 50338,
    "<|fr|>": 50265,
    "<|gl|>": 50319,
    "<|gu|>": 50333,
    "<|haw|>": 50352,
    "<|ha|>": 50354,
    "<|he|>": 50279,
    "<|hi|>": 50276,
    "<|hr|>": 50291,
    "<|ht|>": 50339,
    "<|hu|>": 50286,
    "<|hy|>": 50312,
    "<|id|>": 50275,
    "<|is|>": 50311,
    "<|it|>": 50274,
    "<|ja|>": 50266,
    "<|jw|>": 50356,
    "<|ka|>": 50329,
    "<|kk|>": 50316,
    "<|km|>": 50323,
    "<|kn|>": 50306,
    "<|ko|>": 50264,
    "<|la|>": 50294,
    "<|lb|>": 50345,
    "<|ln|>": 50353,
    "<|lo|>": 50336,
    "<|lt|>": 50293,
    "<|lv|>": 50301,
    "<|mg|>": 50349,
    "<|mi|>": 50295,
    "<|mk|>": 50308,
    "<|ml|>": 50296,
    "<|mn|>": 50314,
    "<|mr|>": 50320,
    "<|ms|>": 50282,
    "<|mt|>": 50343,
    "<|my|>": 50346,
    "<|ne|>": 50313,
    "<|nl|>": 50271,
    "<|nn|>": 50342,
    "<|no|>": 50288,
    "<|oc|>": 50328,
    "<|pa|>": 50321,
    "<|pl|>": 50269,
    "<|ps|>": 50340,
    "<|pt|>": 50267,
    "<|ro|>": 50284,
    "<|ru|>": 50263,
    "<|sa|>": 50344,
    "<|sd|>": 50332,
    "<|si|>": 50322,
    "<|sk|>": 50298,
    "<|sl|>": 50305,
    "<|sn|>": 50324,
    "<|so|>": 50326,
    "<|sq|>": 50317,
    "<|sr|>": 50303,
    "<|su|>": 50357,
    "<|sv|>": 50273,
    "<|sw|>": 50318,
    "<|ta|>": 50287,
    "<|te|>": 50299,
    "<|tg|>": 50331,
    "<|th|>": 50289,
    "<|tk|>": 50341,
    "<|tl|>": 50348,
    "<|tr|>": 50268,
    "<|tt|>": 50351,
    "<|uk|>": 50280,
    "<|ur|>": 50290,
    "<|uz|>": 50337,
    "<|vi|>": 50278,
    "<|yi|>": 50335,
    "<|yo|>": 50325,
    "<|yue|>": 50358,
    "<|zh|>": 50260
  },
  "max_initial_timestamp_index": 50,
  "max_length": 448,
  "no_timestamps_token_id": 50364,
  "pad_token_id": 50257,
  "prev_sot_token_id": 50362,
  "return_timestamps": false,
  "suppress_tokens": [
    1,
    2,
    7,
    8,
    9,
    10,
    14,
    25,
    26,
    27,
    28,
    29,
    31,
    58,
    59,
    60,
    61,
    62,
    63,
    90,
    91,
    92,
    93,
    359,
    503,
    522,
    542,
    873,
    893,
    902,
    918,
    922,
    931,
    1350,
    1853,
    1982,
    2460,
    2627,
    3246,
    3253,
    3268,
    3536,
    3846,
    3961,
    4183,
    4667,
    6585,
    6647,
    7273,
    9061,
    9383,
    10428,
    10929,
    11938,
    12033,
    12331,
    12562,
    13793,
    14157,
    14635,
    15265,
    15618,
    16553,
    16604,
    18362,
    18956,
    20075,
    21675,
    22520,
    26130,
    26161,
    26435,
    28279,
    29464,
    31650,
    32302,
    32470,
    36865,
    42863,
    47425,
    49870,
    50254,
    50258,
    50359,
    50360,
    50361,
    50362,
    50363
  ],
  "task_to_id": {
    "transcribe": 50360,
    "translate": 50359
  },
  "transformers_version": "4.49.0"
 }
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:65d4ded8bd8552fc9315766b4c95e90e3b762153e82fd630d3d22a00b26c2236
 size 3235581408
--- a/normalizer.json
+++ b/normalizer.json
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@@ -0,0 +1,14 @@
 {
  "chunk_length": 30,
  "feature_extractor_type": "WhisperFeatureExtractor",
  "feature_size": 128,
  "hop_length": 160,
  "n_fft": 400,
  "n_samples": 480000,
  "nb_max_frames": 3000,
  "padding_side": "right",
  "padding_value": 0.0,
  "processor_class": "WhisperProcessor",
  "return_attention_mask": false,
  "sampling_rate": 16000
 }
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,139 @@
 {
  "additional_special_tokens": [
    "<|startoftranscript|>",
    "<|en|>",
    "<|zh|>",
    "<|de|>",
    "<|es|>",
    "<|ru|>",
    "<|ko|>",
    "<|fr|>",
    "<|ja|>",
    "<|pt|>",
    "<|tr|>",
    "<|pl|>",
    "<|ca|>",
    "<|nl|>",
    "<|ar|>",
    "<|sv|>",
    "<|it|>",
    "<|id|>",
    "<|hi|>",
    "<|fi|>",
    "<|vi|>",
    "<|he|>",
    "<|uk|>",
    "<|el|>",
    "<|ms|>",
    "<|cs|>",
    "<|ro|>",
    "<|da|>",
    "<|hu|>",
    "<|ta|>",
    "<|no|>",
    "<|th|>",
    "<|ur|>",
    "<|hr|>",
    "<|bg|>",
    "<|lt|>",
    "<|la|>",
    "<|mi|>",
    "<|ml|>",
    "<|cy|>",
    "<|sk|>",
    "<|te|>",
    "<|fa|>",
    "<|lv|>",
    "<|bn|>",
    "<|sr|>",
    "<|az|>",
    "<|sl|>",
    "<|kn|>",
    "<|et|>",
    "<|mk|>",
    "<|br|>",
    "<|eu|>",
    "<|is|>",
    "<|hy|>",
    "<|ne|>",
    "<|mn|>",
    "<|bs|>",
    "<|kk|>",
    "<|sq|>",
    "<|sw|>",
    "<|gl|>",
    "<|mr|>",
    "<|pa|>",
    "<|si|>",
    "<|km|>",
    "<|sn|>",
    "<|yo|>",
    "<|so|>",
    "<|af|>",
    "<|oc|>",
    "<|ka|>",
    "<|be|>",
    "<|tg|>",
    "<|sd|>",
    "<|gu|>",
    "<|am|>",
    "<|yi|>",
    "<|lo|>",
    "<|uz|>",
    "<|fo|>",
    "<|ht|>",
    "<|ps|>",
    "<|tk|>",
    "<|nn|>",
    "<|mt|>",
    "<|sa|>",
    "<|lb|>",
    "<|my|>",
    "<|bo|>",
    "<|tl|>",
    "<|mg|>",
    "<|as|>",
    "<|tt|>",
    "<|haw|>",
    "<|ln|>",
    "<|ha|>",
    "<|ba|>",
    "<|jw|>",
    "<|su|>",
    "<|yue|>",
    "<|translate|>",
    "<|transcribe|>",
    "<|startoflm|>",
    "<|startofprev|>",
    "<|nospeech|>",
    "<|notimestamps|>"
  ],
  "bos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
--- a/training_args.bin
+++ b/training_args.bin
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:a4044fa5880648f9b6e09b65c8d82bc5ed2a92fbafd98d249d2479b861c79813
 size 5432
--- a/vocab.json
+++ b/vocab.json