Bahasa-4b-chat/README.md

---
language:
- id
license: other
license_name: tongyi-qianwen
---

# Bahasa-4b Model Report

## Model Name
**Bahasa-4b**

## Model Detail
Bahasa-4b is continued training from qwen-4b using 10 billion high quality text of Indonesian. The model outperforms some 4b, and even 7b models for Indonesian tasks.

## Model Developers
Bahasa AI

## Intended Use
This model is intended for various NLP tasks that require understanding and generating Indonesian language. It is suitable for applications such as question answering, sentiment analysis, document summarization, and more.

## Training Data
Bahasa-4b was trained on a 10 billion subset data of Indonesian dataset from a collected pool of 100 billion.

## Benchmarks
The following table shows the performance of Bahasa-4b compared to the models Sailor_4b and Mistral-7B-v0.1 across several benchmarks:

| Dataset        | Version | Metric | Mode | Sailor_4b | Bahasa-4b-hf | Mistral-7B-v0.1 |
|----------------|---------|--------|------|-----------|--------------|-----------------|
| tydiqa-id      | 0e9309  | EM     | gen  | 53.98     | 55.04        | 63.54           |
| tydiqa-id      | 0e9309  | F1     | gen  | 73.48     | 75.39        | 78.73           |
| xcopa-id       | 36c11c  | EM     | ppl  | 69.2      | 73.2         | 62.40           |
| xcopa-id       | 36c11c  | F1     | ppl  | 69.2      | 73.2         | -               |
| m3exam-id-ppl  | ede415  | EM     | ppl  | 31.27     | 44.47        | 26.68           |
| belebele-id-ppl| 7fe030  | EM     | ppl  | 41.33     | 42.33        | 41.33           |


```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Bahasalab/Bahasa-4b-chat-v2",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Bahasalab/Bahasa-4b-chat")

messages = [
    {"role": "system", "content": "Kamu adalah asisten yang membantu"},
    {"role": "user", "content": "kamu siapa"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    input_ids=model_inputs.input_ids,
    attention_mask=model_inputs.attention_mask,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id

)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

This data demonstrates that Bahasa-4b consistently outperforms the Sailor_4b model in various Indonesian language tasks, showing improvements in both EM (Exact Match) and F1 scores across different datasets, and is competitive with the Mistral-7B-v0.1 model.
初始化项目，由ModelHub XC社区提供模型 Model: Bahasalab/Bahasa-4b-chat Source: Original Platform 2026-05-26 13:35:29 +08:00			`---`
			`language:`
			`- id`
			`license: other`
			`license_name: tongyi-qianwen`
			`---`

			`# Bahasa-4b Model Report`

			`## Model Name`
			`Bahasa-4b`

			`## Model Detail`
			`Bahasa-4b is continued training from qwen-4b using 10 billion high quality text of Indonesian. The model outperforms some 4b, and even 7b models for Indonesian tasks.`

			`## Model Developers`
			`Bahasa AI`

			`## Intended Use`
			`This model is intended for various NLP tasks that require understanding and generating Indonesian language. It is suitable for applications such as question answering, sentiment analysis, document summarization, and more.`

			`## Training Data`
			`Bahasa-4b was trained on a 10 billion subset data of Indonesian dataset from a collected pool of 100 billion.`

			`## Benchmarks`
			`The following table shows the performance of Bahasa-4b compared to the models Sailor_4b and Mistral-7B-v0.1 across several benchmarks:`

			`\| Dataset \| Version \| Metric \| Mode \| Sailor_4b \| Bahasa-4b-hf \| Mistral-7B-v0.1 \|`
			`\|----------------\|---------\|--------\|------\|-----------\|--------------\|-----------------\|`
			`\| tydiqa-id \| 0e9309 \| EM \| gen \| 53.98 \| 55.04 \| 63.54 \|`
			`\| tydiqa-id \| 0e9309 \| F1 \| gen \| 73.48 \| 75.39 \| 78.73 \|`
			`\| xcopa-id \| 36c11c \| EM \| ppl \| 69.2 \| 73.2 \| 62.40 \|`
			`\| xcopa-id \| 36c11c \| F1 \| ppl \| 69.2 \| 73.2 \| - \|`
			`\| m3exam-id-ppl \| ede415 \| EM \| ppl \| 31.27 \| 44.47 \| 26.68 \|`
			`\| belebele-id-ppl\| 7fe030 \| EM \| ppl \| 41.33 \| 42.33 \| 41.33 \|`


			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer`
			`device = "cuda" # the device to load the model onto`

			`model = AutoModelForCausalLM.from_pretrained(`
			`"Bahasalab/Bahasa-4b-chat-v2",`
			`torch_dtype="auto",`
			`device_map="auto"`
			`)`
			`tokenizer = AutoTokenizer.from_pretrained("Bahasalab/Bahasa-4b-chat")`

			`messages = [`
			`{"role": "system", "content": "Kamu adalah asisten yang membantu"},`
			`{"role": "user", "content": "kamu siapa"}`
			`]`
			`text = tokenizer.apply_chat_template(`
			`messages,`
			`tokenize=False,`
			`add_generation_prompt=True`
			`)`

			`model_inputs = tokenizer([text], return_tensors="pt").to(device)`

			`generated_ids = model.generate(`
			`input_ids=model_inputs.input_ids,`
			`attention_mask=model_inputs.attention_mask,`
			`max_new_tokens=512,`
			`eos_token_id=tokenizer.eos_token_id`

			`)`
			`generated_ids = [`
			`output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)`
			`]`

			`response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]`
			`print(response)`
			```

			`This data demonstrates that Bahasa-4b consistently outperforms the Sailor_4b model in various Indonesian language tasks, showing improvements in both EM (Exact Match) and F1 scores across different datasets, and is competitive with the Mistral-7B-v0.1 model.`