Go to file

ModelHub XC c76985be56 初始化项目，由ModelHub XC社区提供模型

Model: mesolitica/Malaysian-TTS-1.7B-v1
Source: Original Platform

2026-04-12 06:47:57 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

added_tokens.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

configuration.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

DisfluencySpeech-v1.mp3

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

haqkiem-v1.mp3

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

husein-v1.mp3

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

idayu-v1.mp3

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

model-00001-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

model-00002-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

model.safetensors.index.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

singaporean-v1.mp3

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

singlish-speaker2050-v1.mp3

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

singlish-speaker2202-v1.mp3

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-04-12 06:47:57 +08:00

README.md

library_name, tags

library_name

Malaysian-TTS-1.7B-v1

Continue pretraining mesolitica/Malaysian-TTS-1.7B-v0.1 on much consistent dataset,

Use DistilCodec as speech detokenizer, output in 24k sample rate.
Support context switching between Malay and English.
Better pronunciation for letters.
Better repetitive tolerance.

Speakers

How do we train

Multipacking with proper document masking on 4096 context length.
FP32-BF16 mixed precision training.
Full parameter finetuning.
WanDB at https://wandb.ai/huseinzol05/Malaysian-TTS-1.7B-v1

How to use

First install DistilCodec,

pip3 install git+https://github.com/mesolitica/DistilCodec

Load the models,

# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json
# wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000

from distilcodec import DistilCodec, demo_for_generate_audio_codes
from transformers import AutoTokenizer, AutoModelForCausalLM

codec_model_config_path='model_config.json'
codec_ckpt_path = 'g_00204000'

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    use_generator=True,
    is_debug=False).eval()

tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v1')
model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-1.7B-v1', torch_dtype = 'auto').cuda()

Generate,

import soundfile as sf
import re
from tqdm import tqdm

speakers = [
    'husein',
    'idayu',
    'singaporean',
    'DisfluencySpeech',
    'singlish-speaker2050',
    'singlish-speaker2202',
    'haqkiem',
]

string = 'IC saya adalah, sembilan enam, kosong tiga, satu empat, one, one, one, one, A, B, C, D, D, yes, Husein is very cute, cute, cute.'

for s in tqdm(speakers):

    left = s +': ' + string
    prompt = f'<|im_start|>{left}<|speech_start|>'
    
    generate_kwargs = dict(
        **tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'),
        max_new_tokens=1024,
        temperature=0.7,
        do_sample=True,
        repetition_penalty=1.1,
    )
    generation_output = model.generate(**generate_kwargs)
    speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[-1].replace('<|endoftext|>', '')
    numbers = re.findall(r'speech_(\d+)', speech_token)
    d = list(map(int, numbers))
    y_gen = codec.decode_from_codes(
        d,
        minus_token_offset=False
    )
    sf.write(f'{s}.mp3', y_gen[0, 0].cpu().numpy(), 24000)

Output,

Only singlish-speaker2202 and haqkiem had to generate 2 times to get better output that follow exact text input.

Limitation

This model trained on normalized text, so if you have text such as 123, you have to normalize it first to become one two three or one hundred twenty three or satu dua tiga or seratus dua puluh tiga. Feel free to use Malaya for normalization, Malaya support Malay and English normalization, read more at https://github.com/mesolitica/malaya/issues/247#issuecomment-3030313021
The repetitive pronunciation dataset does not consistently use commas for pauses. For example, A, A, A, A, B, B in our recordings is spoken as A A A A B B. We have no intention to improve it due to cost, but continue finetune using proper dataset should able to solve it.

Source code

Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts

Acknowledgement

Special thanks to https://www.sns.com.my and Nvidia for 1x H100!