152 lines
5.0 KiB
Markdown
152 lines
5.0 KiB
Markdown
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
library_name: transformers
|
||
|
|
license: llama3.1
|
||
|
|
language:
|
||
|
|
- ko
|
||
|
|
- vi
|
||
|
|
- id
|
||
|
|
- km
|
||
|
|
- th
|
||
|
|
metrics:
|
||
|
|
- bleu
|
||
|
|
- rouge
|
||
|
|
base_model:
|
||
|
|
- meta-llama/Llama-3.1-8B-Instruct
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
[](https://hf.co/QuantFactory)
|
||
|
|
|
||
|
|
|
||
|
|
# QuantFactory/llama-3.1-Asian-Bllossom-8B-Translator-GGUF
|
||
|
|
This is quantized version of [MLP-KTLim/llama-3.1-Asian-Bllossom-8B-Translator](https://huggingface.co/MLP-KTLim/llama-3.1-Asian-Bllossom-8B-Translator) created using llama.cpp
|
||
|
|
|
||
|
|
# Original Model Card
|
||
|
|
|
||
|
|
|
||
|
|
# Model Card for Model ID
|
||
|
|
|
||
|
|
This model is a multilingual translation model fine-tuned on LLaMA 3.1 Instruct base model. It enables mutual translation between the following Southeast Asian languages:
|
||
|
|
|
||
|
|
- Korean
|
||
|
|
- Vietnamese
|
||
|
|
- Indonesian
|
||
|
|
- Cambodian (Khmer)
|
||
|
|
- Thai
|
||
|
|
|
||
|
|
## Acknowledgements
|
||
|
|
AICA <img src="https://aica-gj.kr/images/logo.png" width="20%" height="20%">
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
The model is designed for translating short text segments between any pair of the supported languages.
|
||
|
|
|
||
|
|
Supported language pairs:
|
||
|
|
|
||
|
|
- Korean ↔ Vietnamese
|
||
|
|
- Korean ↔ Indonesian
|
||
|
|
- Korean ↔ Cambodian
|
||
|
|
- Korean ↔ Thai
|
||
|
|
- Vietnamese ↔ Indonesian
|
||
|
|
- Vietnamese ↔ Cambodian
|
||
|
|
- Vietnamese ↔ Thai
|
||
|
|
- Indonesian ↔ Cambodian
|
||
|
|
- Indonesian ↔ Thai
|
||
|
|
- Cambodian ↔ Thai
|
||
|
|
|
||
|
|
### Model Description
|
||
|
|
|
||
|
|
This model is specifically optimized for Southeast Asian language translation needs, focusing on enabling communication between these specific language communities.
|
||
|
|
|
||
|
|
The extensive training data of 20M examples (1M for each translation direction) provides a robust foundation for handling common expressions and basic conversations across these languages.
|
||
|
|
|
||
|
|
### Model Architecture
|
||
|
|
|
||
|
|
Base Model: meta-llama/Llama-3.1-8B-Instruct
|
||
|
|
|
||
|
|
|
||
|
|
## Bias, Risks, and Limitations
|
||
|
|
|
||
|
|
- Performance is limited to short sentences and phrases
|
||
|
|
- May not handle complex or lengthy text effectively
|
||
|
|
- Translation quality may vary depending on language pair and content complexity
|
||
|
|
|
||
|
|
## Evaluation results
|
||
|
|
|
||
|
|
| Source Language | Target Language | BLEU Score | ROUGE-1 | ROUGE-L |
|
||
|
|
|----------------|-----------------|------------|---------|---------|
|
||
|
|
| Korean | Vietnamese | 56.70 | 81.64 | 76.66 |
|
||
|
|
| Korean | Cambodian | 71.69 | 89.26 | 88.20 |
|
||
|
|
| Korean | Indonesian | 58.32 | 80.39 | 76.63 |
|
||
|
|
| Korean | Thai | 63.26 | 78.88 | 72.29 |
|
||
|
|
| Vietnamese | Korean | 49.01 | 75.57 | 72.74 |
|
||
|
|
| Vietnamese | Cambodian | 78.26 | 90.74 | 90.32 |
|
||
|
|
| Vietnamese | Indonesian | 65.96 | 83.08 | 81.46 |
|
||
|
|
| Vietnamese | Thai | 65.93 | 81.09 | 76.57 |
|
||
|
|
| Cambodian | Korean | 49.10 | 72.67 | 69.75 |
|
||
|
|
| Cambodian | Vietnamese | 63.42 | 81.56 | 79.09 |
|
||
|
|
| Cambodian | Indonesian | 61.41 | 79.67 | 77.75 |
|
||
|
|
| Cambodian | Thai | 70.91 | 81.85 | 77.66 |
|
||
|
|
| Indonesian | Korean | 53.61 | 77.14 | 74.29 |
|
||
|
|
| Indonesian | Vietnamese | 68.21 | 85.41 | 83.10 |
|
||
|
|
| Indonesian | Cambodian | 78.84 | 90.81 | 90.35 |
|
||
|
|
| Indonesian | Thai | 67.12 | 81.54 | 77.19 |
|
||
|
|
| Thai | Korean | 45.59 | 72.48 | 69.46 |
|
||
|
|
| Thai | Vietnamese | 61.55 | 81.01 | 78.24 |
|
||
|
|
| Thai | Cambodian | 78.52 | 91.47 | 91.16 |
|
||
|
|
| Thai | Indonesian | 58.99 | 78.56 | 76.40 |
|
||
|
|
|
||
|
|
## Example
|
||
|
|
|
||
|
|
```py
|
||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
|
|
|
||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
||
|
|
"MLP-KTLim/llama-3.1-Asian-Bllossom-8B-Translator",
|
||
|
|
torch_dtype="auto",
|
||
|
|
device_map="auto",
|
||
|
|
)
|
||
|
|
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained(
|
||
|
|
"MLP-KTLim/llama-3.1-Asian-Bllossom-8B-Translator",
|
||
|
|
)
|
||
|
|
|
||
|
|
input_text = "안녕하세요? 아시아 언어 번역 모델 입니다."
|
||
|
|
|
||
|
|
def get_input_ids(source_lang, target_lang, message):
|
||
|
|
assert source_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"]
|
||
|
|
assert target_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"]
|
||
|
|
|
||
|
|
input_ids = tokenizer.apply_chat_template(
|
||
|
|
conversation=[
|
||
|
|
{"role": "system", "content": f"You are a useful translation AI. Please translate the sentence given in {source_lang} into {target_lang}."},
|
||
|
|
{"role": "user", "content": message},
|
||
|
|
],
|
||
|
|
tokenize=True,
|
||
|
|
return_tensors="pt",
|
||
|
|
add_generation_prompt=True,
|
||
|
|
)
|
||
|
|
return input_ids
|
||
|
|
|
||
|
|
input_ids = get_input_ids(
|
||
|
|
source_lang="Korean",
|
||
|
|
target_lang="Vietnamese",
|
||
|
|
message=input_text,
|
||
|
|
)
|
||
|
|
|
||
|
|
output = model.generate(
|
||
|
|
input_ids.to(model.device),
|
||
|
|
max_new_tokens=128,
|
||
|
|
)
|
||
|
|
|
||
|
|
print(tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True))
|
||
|
|
```
|
||
|
|
|
||
|
|
|
||
|
|
## Contributor
|
||
|
|
- 원인호 (wih1226@seoultech.ac.kr)
|
||
|
|
- 김민준 (mjkmain@seoultech.ac.kr)
|
||
|
|
|
||
|
|
|