初始化项目，由ModelHub XC社区提供模型

Model: airesearch/wav2vec2-large-xlsr-53-th Source: Original Platform
2026-05-28 09:56:16 +08:00
commit 4586df8665
32 changed files with 2811 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,199 @@
+---
+language: th
+datasets:
+- common_voice
+tags:
+- audio
+- automatic-speech-recognition
+- hf-asr-leaderboard
+- robust-speech-event
+- speech
+- xlsr-fine-tuning
+license: cc-by-sa-4.0
+model-index:
+- name: XLS-R-53 - Thai
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 7
+      type: mozilla-foundation/common_voice_7_0
+      args: th
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 0.9524
+    - name: Test SER
+      type: ser
+      value: 1.2346
+    - name: Test CER
+      type: cer
+      value: 0.1623
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Robust Speech Event - Dev Data
+      type: speech-recognition-community-v2/dev_data
+      args: sv
+    metrics:
+    - name: Test WER
+      type: wer
+      value: null
+    - name: Test SER
+      type: ser
+      value: null
+    - name: Test CER
+      type: cer
+      value: null
+---
+
+# `wav2vec2-large-xlsr-53-th`
+Finetuning `wav2vec2-large-xlsr-53` on Thai [Common Voice 7.0](https://commonvoice.mozilla.org/en/datasets)
+
+[Read more on our blog](https://medium.com/airesearch-in-th/airesearch-in-th-3c1019a99cd)
+
+We finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) based on [Fine-tuning Wav2Vec2 for English ASR](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tuning_Wav2Vec2_for_English_ASR.ipynb) using Thai examples of [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets). The notebooks and scripts can be found in [vistec-ai/wav2vec2-large-xlsr-53-th](https://github.com/vistec-ai/wav2vec2-large-xlsr-53-th). The pretrained model and processor can be found at [airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th).
+
+## `robust-speech-event`
+
+Add `syllable_tokenize`, `word_tokenize` ([PyThaiNLP](https://github.com/PyThaiNLP/pythainlp)) and [deepcut](https://github.com/rkcosmos/deepcut) tokenizers to `eval.py` from [robust-speech-event](https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust-speech-event#evaluation)
+
+```
+> python eval.py --model_id ./ --dataset mozilla-foundation/common_voice_7_0 --config th --split test --log_outputs --thai_tokenizer newmm/syllable/deepcut/cer
+```
+
+### Eval results on Common Voice 7 "test":
+
+|                                 | WER PyThaiNLP 2.3.1 | WER deepcut | SER     | CER     |
+|---------------------------------|---------------------|-------------|---------|---------|
+| Only Tokenization               | 0.9524%             | 2.5316%     | 1.2346% | 0.1623% |
+| Cleaning rules and Tokenization | TBD                 | TBD         | TBD     | TBD     |
+
+
+## Usage
+
+```
+#load pretrained processor and model
+processor = Wav2Vec2Processor.from_pretrained("airesearch/wav2vec2-large-xlsr-53-th")
+model = Wav2Vec2ForCTC.from_pretrained("airesearch/wav2vec2-large-xlsr-53-th")
+
+#function to resample to 16_000
+def speech_file_to_array_fn(batch, 
+                            text_col="sentence", 
+                            fname_col="path",
+                            resampling_to=16000):
+    speech_array, sampling_rate = torchaudio.load(batch[fname_col])
+    resampler=torchaudio.transforms.Resample(sampling_rate, resampling_to)
+    batch["speech"] = resampler(speech_array)[0].numpy()
+    batch["sampling_rate"] = resampling_to
+    batch["target_text"] = batch[text_col]
+    return batch
+
+#get 2 examples as sample input
+test_dataset = test_dataset.map(speech_file_to_array_fn)
+inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
+
+#infer
+with torch.no_grad():
+    logits = model(inputs.input_values,).logits
+
+predicted_ids = torch.argmax(logits, dim=-1)
+
+print("Prediction:", processor.batch_decode(predicted_ids))
+print("Reference:", test_dataset["sentence"][:2])
+
+>> Prediction: ['และ เขา ก็ สัมผัส ดีบุก', 'คุณ สามารถ รับทราบ เมื่อ ข้อความ นี้ ถูก อ่าน แล้ว']
+>> Reference: ['และเขาก็สัมผัสดีบุก', 'คุณสามารถรับทราบเมื่อข้อความนี้ถูกอ่านแล้ว']
+```
+
+## Datasets
+
+Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with `pythainlp.tokenize.word_tokenize`. We preprocess the dataset using cleaning rules described in `notebooks/cv-preprocess.ipynb` by [@tann9949](https://github.com/tann9949). We then deduplicate and split as described in [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) in order to 1) avoid data leakage due to random splits after cleaning in [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) and 2) preserve the majority of the data for the training set. The dataset loading script is `scripts/th_common_voice_70.py`. You can use this scripts together with `train_cleand.tsv`, `validation_cleaned.tsv` and `test_cleaned.tsv` to have the same splits as we do. The resulting dataset is as follows:
+
+```
+DatasetDict({
+    train: Dataset({
+        features: ['path', 'sentence'],
+        num_rows: 86586
+    })
+    test: Dataset({
+        features: ['path', 'sentence'],
+        num_rows: 2502
+    })
+    validation: Dataset({
+        features: ['path', 'sentence'],
+        num_rows: 3027
+    })
+})
+```
+
+## Training
+
+We fintuned using the following configuration on a single V100 GPU and chose the checkpoint with the lowest validation loss. The finetuning script is `scripts/wav2vec2_finetune.py`
+
+```
+# create model
+model = Wav2Vec2ForCTC.from_pretrained(
+    "facebook/wav2vec2-large-xlsr-53",
+    attention_dropout=0.1,
+    hidden_dropout=0.1,
+    feat_proj_dropout=0.0,
+    mask_time_prob=0.05,
+    layerdrop=0.1,
+    gradient_checkpointing=True,
+    ctc_loss_reduction="mean",
+    pad_token_id=processor.tokenizer.pad_token_id,
+    vocab_size=len(processor.tokenizer)
+)
+model.freeze_feature_extractor()
+training_args = TrainingArguments(
+    output_dir="../data/wav2vec2-large-xlsr-53-thai",
+    group_by_length=True,
+    per_device_train_batch_size=32,
+    gradient_accumulation_steps=1,
+    per_device_eval_batch_size=16,
+    metric_for_best_model='wer',
+    evaluation_strategy="steps",
+    eval_steps=1000,
+    logging_strategy="steps",
+    logging_steps=1000,
+    save_strategy="steps",
+    save_steps=1000,
+    num_train_epochs=100,
+    fp16=True,
+    learning_rate=1e-4,
+    warmup_steps=1000,
+    save_total_limit=3,
+    report_to="tensorboard"
+)
+```
+
+## Evaluation
+
+We benchmark on the test set using WER with words tokenized by [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) 2.3.1 and [deepcut](https://github.com/rkcosmos/deepcut), and CER. We also measure performance when spell correction using [TNC](http://www.arts.chula.ac.th/ling/tnc/) ngrams is applied. Evaluation codes can be found in `notebooks/wav2vec2_finetuning_tutorial.ipynb`. Benchmark is performed on `test-unique` split.
+
+|                                | WER PyThaiNLP 2.3.1 | WER deepcut    | CER            |
+|--------------------------------|---------------------|----------------|----------------|
+| [Kaldi from scratch](https://github.com/vistec-AI/commonvoice-th)         | 23.04              |                | 7.57         |
+| Ours without spell correction  | 13.634024          | **8.152052** | **2.813019** |
+| Ours with spell correction     | 17.996397          | 14.167975     | 5.225761     |
+| Google Web Speech API※         | 13.711234          | 10.860058     | 7.357340     |
+| Microsoft Bing Speech API※     | **12.578819**      | 9.620991     | 5.016620     |
+| Amazon Transcribe※             | 21.86334           | 14.487553     | 7.077562     |
+| NECTEC AI for Thai Partii API※ | 20.105887          | 15.515631     | 9.551027     |
+
+※ APIs are not finetuned with Common Voice 7.0 data
+
+## LICENSE
+
+[cc-by-sa 4.0](https://github.com/vistec-AI/wav2vec2-large-xlsr-53-th/blob/main/LICENSE)
+
+## Ackowledgements
+* model training and validation notebooks/scripts [@cstorm125](https://github.com/cstorm125/)
+* dataset cleaning scripts [@tann9949](https://github.com/tann9949)
+* dataset splits [@ekapolc](https://github.com/ekapolc/) and [@14mss](https://github.com/14mss)
+* running the training [@mrpeerat](https://github.com/mrpeerat)
+* spell correction [@wannaphong](https://github.com/wannaphong)
+