127 lines
4.5 KiB
Markdown
127 lines
4.5 KiB
Markdown
---
|
|
library_name: transformers
|
|
tags: []
|
|
---
|
|
|
|
# Malaysian-TTS-1.7B-v1
|
|
|
|
Continue pretraining [mesolitica/Malaysian-TTS-1.7B-v0.1](https://huggingface.co/mesolitica/Malaysian-TTS-1.7B-v0.1) on much consistent dataset,
|
|
|
|
1. Use [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec) as speech detokenizer, output in 24k sample rate.
|
|
2. Support context switching between Malay and English.
|
|
3. Better pronunciation for letters.
|
|
4. Better repetitive tolerance.
|
|
|
|
## Speakers
|
|
|
|
1. [husein](https://huggingface.co/datasets/mesolitica/Malaysian-TTS-v2)
|
|
2. [idayu](https://huggingface.co/datasets/mesolitica/Malaysian-TTS-v2)
|
|
3. [singaporean](https://huggingface.co/datasets/mesolitica/IMDA-TTS)
|
|
4. [DisfluencySpeech](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)
|
|
5. [singlish-speaker2050](https://huggingface.co/datasets/thucdangvan020999/singlish-speaker2050)
|
|
6. [singlish-speaker2202](https://huggingface.co/datasets/thucdangvan020999/singlish-speaker2202)
|
|
7. [haqkiem](https://www.linkedin.com/in/haqkiem-daim/), private dataset.
|
|
|
|
## How do we train
|
|
|
|
1. Multipacking with proper document masking on 4096 context length.
|
|
2. FP32-BF16 mixed precision training.
|
|
3. Full parameter finetuning.
|
|
4. WanDB at https://wandb.ai/huseinzol05/Malaysian-TTS-1.7B-v1
|
|
|
|
## How to use
|
|
|
|
1. First install DistilCodec,
|
|
|
|
```bash
|
|
pip3 install git+https://github.com/mesolitica/DistilCodec
|
|
```
|
|
|
|
2. Load the models,
|
|
|
|
```python
|
|
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json
|
|
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000
|
|
|
|
from distilcodec import DistilCodec, demo_for_generate_audio_codes
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
codec_model_config_path='model_config.json'
|
|
codec_ckpt_path = 'g_00204000'
|
|
|
|
codec = DistilCodec.from_pretrained(
|
|
config_path=codec_model_config_path,
|
|
model_path=codec_ckpt_path,
|
|
use_generator=True,
|
|
is_debug=False).eval()
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v1')
|
|
model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v1', torch_dtype = 'auto').cuda()
|
|
```
|
|
|
|
3. Generate,
|
|
|
|
```bash
|
|
import soundfile as sf
|
|
import re
|
|
from tqdm import tqdm
|
|
|
|
speakers = [
|
|
'husein',
|
|
'idayu',
|
|
'singaporean',
|
|
'DisfluencySpeech',
|
|
'singlish-speaker2050',
|
|
'singlish-speaker2202',
|
|
'haqkiem',
|
|
]
|
|
|
|
string = 'IC saya adalah, sembilan enam, kosong tiga, satu empat, one, one, one, one, A, B, C, D, D, yes, Husein is very cute, cute, cute.'
|
|
|
|
for s in tqdm(speakers):
|
|
|
|
left = s +': ' + string
|
|
prompt = f'<|im_start|>{left}<|speech_start|>'
|
|
|
|
generate_kwargs = dict(
|
|
**tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'),
|
|
max_new_tokens=1024,
|
|
temperature=0.7,
|
|
do_sample=True,
|
|
repetition_penalty=1.1,
|
|
)
|
|
generation_output = model.generate(**generate_kwargs)
|
|
speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[-1].replace('<|endoftext|>', '')
|
|
numbers = re.findall(r'speech_(\d+)', speech_token)
|
|
d = list(map(int, numbers))
|
|
y_gen = codec.decode_from_codes(
|
|
d,
|
|
minus_token_offset=False
|
|
)
|
|
sf.write(f'{s}.mp3', y_gen[0, 0].cpu().numpy(), 24000)
|
|
```
|
|
|
|
Output,
|
|
|
|
1. [husein-v1.mp3](husein-v1.mp3)
|
|
2. [idayu-v1.mp3](idayu-v1.mp3)
|
|
3. [singaporean-v1.mp3](singaporean-v1.mp3)
|
|
4. [DisfluencySpeech-v1.mp3](DisfluencySpeech-v1.mp3)
|
|
5. [singlish-speaker2050-v1.mp3](singlish-speaker2050-v1.mp3)
|
|
6. [singlish-speaker2202-v1.mp3](singlish-speaker2202-v1.mp3)
|
|
6. [haqkiem-v1.mp3](haqkiem-v1.mp3)
|
|
|
|
**Only `singlish-speaker2202` and `haqkiem` had to generate 2 times to get better output that follow exact text input**.
|
|
|
|
## Limitation
|
|
|
|
1. This model trained on normalized text, so if you have text such as `123`, you have to normalize it first to become `one two three` or `one hundred twenty three` or `satu dua tiga` or `seratus dua puluh tiga`. Feel free to use Malaya for normalization, Malaya support Malay and English normalization, read more at https://github.com/mesolitica/malaya/issues/247#issuecomment-3030313021
|
|
2. The repetitive pronunciation dataset does not consistently use commas for pauses. For example, `A, A, A, A, B, B` in our recordings is spoken as `A A A A B B`. We have no intention to improve it due to cost, but continue finetune using proper dataset should able to solve it.
|
|
|
|
## Source code
|
|
|
|
Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts
|
|
|
|
## Acknowledgement
|
|
|
|
Special thanks to https://www.sns.com.my and Nvidia for 1x H100! |