初始化项目,由ModelHub XC社区提供模型

Model: wannaphong/wav2vec2-large-xlsr-53-th-cv8-newmm
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-12 12:07:24 +08:00
commit efb138bf1c
19 changed files with 22841 additions and 0 deletions

29
.gitattributes vendored Normal file
View File

@@ -0,0 +1,29 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zstandard filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
language_model/3gram_correct.arpa filter=lfs diff=lfs merge=lfs -text
model.safetensors filter=lfs diff=lfs merge=lfs -text

76
README.md Normal file
View File

@@ -0,0 +1,76 @@
---
language:
- th
tags:
- automatic-speech-recognition
license: apache-2.0
datasets:
- common_voice
metrics:
- wer
- cer
---
# Thai Wav2Vec2 with CommonVoice V8 (newmm tokenizer) + language model
This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in [airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th). It was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).
## Model description
- Technical report: [Thai Wav2Vec2.0 with CommonVoice V8](https://arxiv.org/abs/2208.04799)
## Datasets
It is increase new data from The Common Voice V8 dataset to Common Voice V7 dataset or remove all data in Common Voice V7 dataset before split Common Voice V8 then add CommonVoice V7 dataset back to dataset.
It use [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) script for split Common Voice dataset.
## Models
This model was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model with Thai Common Voice V8 dataset and It use pre-tokenize with `pythainlp.tokenize.word_tokenize`.
## Training
I used many code from [vistec-AI/wav2vec2-large-xlsr-53-th](https://github.com/vistec-AI/wav2vec2-large-xlsr-53-th) and I fixed bug training code in [vistec-AI/wav2vec2-large-xlsr-53-th#2](https://github.com/vistec-AI/wav2vec2-large-xlsr-53-th/pull/2)
## Evaluation
**Test with CommonVoice V8 Testset**
| Model | WER by newmm (%) | WER by deepcut (%) | CER |
|-----------------------|------------------|--------------------|----------|
| AIResearch.in.th and PyThaiNLP | 17.414503 | 11.923089 | 3.854153 |
| wav2vec2 with deepcut | 16.354521 | 11.424476 | 3.684060 |
| wav2vec2 with newmm | 16.698299 | 11.436941 | 3.737407 |
| wav2vec2 with deepcut + language model | 12.630260 | 9.613886 | 3.292073 |
| **wav2vec2 with newmm + language model** | 12.583706 | 9.598305 | 3.276610 |
**Test with CommonVoice V7 Testset (same test by CV V7)**
| Model | WER by newmm (%) | WER by deepcut (%) | CER |
|-----------------------|------------------|--------------------|----------|
| AIResearch.in.th and PyThaiNLP | 13.936698 | 9.347462 | 2.804787 |
| wav2vec2 with deepcut | 12.776381 | 8.773006 | 2.628882 |
| wav2vec2 with newmm | 12.750596 | 8.672616 | 2.623341 |
| wav2vec2 with deepcut + language model | 9.940050 | 7.423313 | 2.344940 |
| **wav2vec2 with newmm + language model** | 9.559724 | 7.339654 | 2.277071 |
This is use same testset from [https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th).
**Links:**
- GitHub Dataset: [https://github.com/wannaphong/thai_commonvoice_dataset](https://github.com/wannaphong/thai_commonvoice_dataset)
- Technical report: [Thai Wav2Vec2.0 with CommonVoice V8](https://arxiv.org/abs/2208.04799)
## BibTeX entry and citation info
```
@misc{phatthiyaphaibun2022thai,
title={Thai Wav2Vec2.0 with CommonVoice V8},
author={Wannaphong Phatthiyaphaibun and Chompakorn Chaksangchaichot and Peerat Limkonchotiwat and Ekapol Chuangsuwanich and Sarana Nutanong},
year={2022},
eprint={2208.04799},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

1
added_tokens.json Normal file
View File

@@ -0,0 +1 @@
{"<s>": 70, "</s>": 71}

1
alphabet.json Normal file
View File

@@ -0,0 +1 @@
{"labels": ["\u0e37", "\u0e0a", "\u0e2e", "\u0e08", "\u0e31", "\u0e2b", "\u0e09", "\u0e04", "\u0e02", "\u0e06", "\u0e49", "\u0e27", "\u0e22", "\u0e2f", "\u0e43", "\u0e20", "\u0e23", "\u0e48", "\u0e14", "\u0e33", "\u0e42", "\u0e4b", "\u0e24", "\u0e0e", "\u0e45", "\u0e40", "\u0e16", "\u0e1c", "\u0e34", "\u0e13", "\u0e1a", "\u0e1d", "\u0e0c", "\u0e17", "\u0e15", "\u0e10", "\u0e11", "\u0e12", "\u0e28", "\u0e4a", "\u0e2c", "\u0e30", "\u0e41", "\u0e01", "\u0e47", "\u0e39", " ", "\u0e32", "\u0e0d", "\u0e18", "\u0e36", "\u0e07", "\u0e29", "\u0e35", "\u0e19", "\u0e25", "\u0e0b", "\u0e0f", "\u0e2a", "\u0e44", "\u0e38", "\u0e2d", "\u0e1e", "\u0e1f", "\u0e03", "\u0e1b", "\u0e21", "\u0e4c", "\u2047", "", "<s>", "</s>"], "is_bpe": false}

115
config.json Normal file
View File

@@ -0,0 +1,115 @@
{
"_name_or_path": "facebook/wav2vec2-large-xlsr-53",
"activation_dropout": 0.0,
"adapter_kernel_size": 3,
"adapter_stride": 2,
"add_adapter": false,
"apply_spec_augment": true,
"architectures": [
"Wav2Vec2ForCTC"
],
"attention_dropout": 0.1,
"bos_token_id": 1,
"classifier_proj_size": 256,
"codevector_dim": 768,
"contrastive_logits_temperature": 0.1,
"conv_bias": true,
"conv_dim": [
512,
512,
512,
512,
512,
512,
512
],
"conv_kernel": [
10,
3,
3,
3,
3,
2,
2
],
"conv_stride": [
5,
2,
2,
2,
2,
2,
2
],
"ctc_loss_reduction": "mean",
"ctc_zero_infinity": false,
"diversity_loss_weight": 0.1,
"do_stable_layer_norm": true,
"eos_token_id": 2,
"feat_extract_activation": "gelu",
"feat_extract_dropout": 0.0,
"feat_extract_norm": "layer",
"feat_proj_dropout": 0.0,
"feat_quantizer_dropout": 0.0,
"final_dropout": 0.0,
"hidden_act": "gelu",
"hidden_dropout": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"layerdrop": 0.1,
"mask_channel_length": 10,
"mask_channel_min_space": 1,
"mask_channel_other": 0.0,
"mask_channel_prob": 0.0,
"mask_channel_selection": "static",
"mask_feature_length": 10,
"mask_feature_min_masks": 0,
"mask_feature_prob": 0.0,
"mask_time_length": 10,
"mask_time_min_masks": 2,
"mask_time_min_space": 1,
"mask_time_other": 0.0,
"mask_time_prob": 0.05,
"mask_time_selection": "static",
"model_type": "wav2vec2",
"num_adapter_layers": 3,
"num_attention_heads": 16,
"num_codevector_groups": 2,
"num_codevectors_per_group": 320,
"num_conv_pos_embedding_groups": 16,
"num_conv_pos_embeddings": 128,
"num_feat_extract_layers": 7,
"num_hidden_layers": 24,
"num_negatives": 100,
"output_hidden_size": 1024,
"pad_token_id": 69,
"proj_codevector_dim": 768,
"tdnn_dilation": [
1,
2,
3,
1,
1
],
"tdnn_dim": [
512,
512,
512,
512,
1500
],
"tdnn_kernel": [
5,
3,
3,
1,
1
],
"torch_dtype": "float32",
"transformers_version": "4.17.0",
"use_weighted_layer_sum": false,
"vocab_size": 72,
"xvector_output_dim": 512
}

3
language_model/3gram.bin Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:608af0f75d801457b16ba96d6068c59aa2318d9374bc14722ab06708de56f324
size 9431499

View File

@@ -0,0 +1 @@
{"alpha": 0.5, "beta": 1.5, "unk_score_offset": -10.0, "score_boundary": true}

21098
language_model/unigrams.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f0135130a25f0f16a59cc376f886f9e54e0c1736f1b85c4c01e93b9f4cc4b090
size 1262102632

10
preprocessor_config.json Normal file
View File

@@ -0,0 +1,10 @@
{
"do_normalize": true,
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
"feature_size": 1,
"padding_side": "right",
"padding_value": 0.0,
"processor_class": "Wav2Vec2ProcessorWithLM",
"return_attention_mask": false,
"sampling_rate": 16000
}

3
pytorch_model.bin Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:cd00de33e508bc6094dc81e59069b3434ef1771ada269a6728f69970439c3ef9
size 1262221489

3
rng_state.pth Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3cb82389f05c149e395fb250fadffa54638bc114b5b165f33377e2c38c7acbc8
size 18647

3
scaler.pt Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:17ed9e914777e8e0e26e4ea887f50f9fbb4dd6b7a33099c61ed027e612089d8d
size 559

3
scheduler.pt Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:5e2b7b4d5e5abc7ad90139ee4e4b708feb908bc6a03baa728d2231450a53d462
size 623

1
special_tokens_map.json Normal file
View File

@@ -0,0 +1 @@
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "[UNK]", "pad_token": "[PAD]", "additional_special_tokens": [{"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}]}

1
tokenizer_config.json Normal file
View File

@@ -0,0 +1 @@
{"unk_token": "[UNK]", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "[PAD]", "do_lower_case": false, "word_delimiter_token": "|", "replace_word_delimiter_char": " ", "special_tokens_map_file": null, "name_or_path": "./thai-asr-cv8", "processor_class": "Wav2Vec2ProcessorWithLM"}

1486
trainer_state.json Normal file

File diff suppressed because it is too large Load Diff

3
training_args.bin Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0cf4ef9abbfd40cb8d9509784468dc10e7742b233f8b0a9653b55deb0f8a6d30
size 3055

1
vocab.json Normal file
View File

@@ -0,0 +1 @@
{"ื": 0, "ช": 1, "ฮ": 2, "จ": 3, "ั": 4, "ห": 5, "ฉ": 6, "ค": 7, "ข": 8, "ฆ": 9, "้": 10, "ว": 11, "ย": 12, "ฯ": 13, "ใ": 14, "ภ": 15, "ร": 16, "่": 17, "ด": 18, "ำ": 19, "โ": 20, "๋": 21, "ฤ": 22, "ฎ": 23, "ๅ": 24, "เ": 25, "ถ": 26, "ผ": 27, "ิ": 28, "ณ": 29, "บ": 30, "ฝ": 31, "ฌ": 32, "ท": 33, "ต": 34, "ฐ": 35, "ฑ": 36, "ฒ": 37, "ศ": 38, "๊": 39, "ฬ": 40, "ะ": 41, "แ": 42, "ก": 43, "็": 44, "ู": 45, "า": 47, "ญ": 48, "ธ": 49, "ึ": 50, "ง": 51, "ษ": 52, "ี": 53, "น": 54, "ล": 55, "ซ": 56, "ฏ": 57, "ส": 58, "ไ": 59, "ุ": 60, "อ": 61, "พ": 62, "ฟ": 63, "ฃ": 64, "ป": 65, "ม": 66, "์": 67, "|": 46, "[UNK]": 68, "[PAD]": 69}