Model: mesolitica/Malaysian-TTS-1.7B-v1 Source: Original Platform
library_name, tags
| library_name | tags |
|---|---|
| transformers |
Malaysian-TTS-1.7B-v1
Continue pretraining mesolitica/Malaysian-TTS-1.7B-v0.1 on much consistent dataset,
- Use DistilCodec as speech detokenizer, output in 24k sample rate.
- Support context switching between Malay and English.
- Better pronunciation for letters.
- Better repetitive tolerance.
Speakers
- husein
- idayu
- singaporean
- DisfluencySpeech
- singlish-speaker2050
- singlish-speaker2202
- haqkiem, private dataset.
How do we train
- Multipacking with proper document masking on 4096 context length.
- FP32-BF16 mixed precision training.
- Full parameter finetuning.
- WanDB at https://wandb.ai/huseinzol05/Malaysian-TTS-1.7B-v1
How to use
- First install DistilCodec,
pip3 install git+https://github.com/mesolitica/DistilCodec
- Load the models,
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000
from distilcodec import DistilCodec, demo_for_generate_audio_codes
from transformers import AutoTokenizer, AutoModelForCausalLM
codec_model_config_path='model_config.json'
codec_ckpt_path = 'g_00204000'
codec = DistilCodec.from_pretrained(
config_path=codec_model_config_path,
model_path=codec_ckpt_path,
use_generator=True,
is_debug=False).eval()
tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v1')
model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v1', torch_dtype = 'auto').cuda()
- Generate,
import soundfile as sf
import re
from tqdm import tqdm
speakers = [
'husein',
'idayu',
'singaporean',
'DisfluencySpeech',
'singlish-speaker2050',
'singlish-speaker2202',
'haqkiem',
]
string = 'IC saya adalah, sembilan enam, kosong tiga, satu empat, one, one, one, one, A, B, C, D, D, yes, Husein is very cute, cute, cute.'
for s in tqdm(speakers):
left = s +': ' + string
prompt = f'<|im_start|>{left}<|speech_start|>'
generate_kwargs = dict(
**tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'),
max_new_tokens=1024,
temperature=0.7,
do_sample=True,
repetition_penalty=1.1,
)
generation_output = model.generate(**generate_kwargs)
speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[-1].replace('<|endoftext|>', '')
numbers = re.findall(r'speech_(\d+)', speech_token)
d = list(map(int, numbers))
y_gen = codec.decode_from_codes(
d,
minus_token_offset=False
)
sf.write(f'{s}.mp3', y_gen[0, 0].cpu().numpy(), 24000)
Output,
- husein-v1.mp3
- idayu-v1.mp3
- singaporean-v1.mp3
- DisfluencySpeech-v1.mp3
- singlish-speaker2050-v1.mp3
- singlish-speaker2202-v1.mp3
- haqkiem-v1.mp3
Only singlish-speaker2202 and haqkiem had to generate 2 times to get better output that follow exact text input.
Limitation
- This model trained on normalized text, so if you have text such as
123, you have to normalize it first to becomeone two threeorone hundred twenty threeorsatu dua tigaorseratus dua puluh tiga. Feel free to use Malaya for normalization, Malaya support Malay and English normalization, read more at https://github.com/mesolitica/malaya/issues/247#issuecomment-3030313021 - The repetitive pronunciation dataset does not consistently use commas for pauses. For example,
A, A, A, A, B, Bin our recordings is spoken asA A A A B B. We have no intention to improve it due to cost, but continue finetune using proper dataset should able to solve it.
Source code
Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts
Acknowledgement
Special thanks to https://www.sns.com.my and Nvidia for 1x H100!
Description
Languages
Jinja
100%