Files
Taigi-Llama-2-7B/README.md
ModelHub XC d78e35ce9e 初始化项目,由ModelHub XC社区提供模型
Model: Bohanlu/Taigi-Llama-2-7B
Source: Original Platform
2026-05-14 06:02:54 +08:00

76 lines
3.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: cc-by-nc-sa-4.0
---
<p align="center">
<img src="https://github.com/lbh0830/TW-Hokkien-LLM/blob/main/pics/logo.jpg?raw=true" alt="Taigi-llama-logo" width="350">
</p>
# Model Card for Taigi-Llama-2-7B
The Taigi-Llama-2 series are built based on the Traditional Chinese version of the LLaMA-2 model. We conducted continued pre-training on web-scraped data in Taiwanese Hokkien, including Hanzi, POJ, and Hanlo, totaling around 78MB.
For more details, please refer to our [GitHub repository](https://github.com/lbh0830/TW-Hokkien-LLM/tree/main) and the paper: [Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems](https://arxiv.org/abs/2403.12024)
Explore other models and datasets in the [Taiwanese Hokkien LLM collection](https://huggingface.co/collections/Bohanlu/taiwanese-hokkien-llm-6614ba7456e6789bc2f10ca0).
## Model description
- **Usage:** This model can be used for causal language modeling tasks in Taiwanese Hokkien. It is also suitable for further fine-tuning on specific datasets for downstream tasks.
- **Language(s) (NLP):** The primary language is Taiwanese Hokkien (Hanzi and POJ). The model also retains capabilities in English and Mandarin Chinese due to prior pre-training.
- **Input:** Text
- **Output:** Text
- **Model Size:** 7B parameters
## Usage Example
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextGenerationPipeline
import torch
import accelerate
def get_pipeline(path:str, tokenizer:AutoTokenizer, accelerator:accelerate.Accelerator) -> TextGenerationPipeline:
model = AutoModelForCausalLM.from_pretrained(
path, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True)
terminators = [tokenizer.eos_token_id, tokenizer.pad_token_id]
pipeline = TextGenerationPipeline(model = model, tokenizer = tokenizer, num_workers=accelerator.state.num_processes*4, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)
return pipeline
model_dir = "Bohanlu/Taigi-Llama-2-7B" # or Bohanlu/Taigi-Llama-2-13B for the 13B model
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
accelerator = accelerate.Accelerator()
pipe = get_pipeline(model_dir, tokenizer, accelerator)
# Few-shot示例問答
qa_prompt = """Example 1:
問題台北101有偌懸
答案台北101的高度是五百空八公尺。
Example 2:
問題:台灣上長的溪仔是佗一條?
答案:台灣上長的溪仔是濁水溪,規个長度有百八公里遐爾長。
Example 3:
問題:臺灣上懸的山是啥物?
答案:"""
print(pipe(qa_prompt, return_full_text=False))
# Output: [{'generated_text': '臺灣上懸的山是玉山,海拔三千九百五十二公尺。'}]
```
## Citation
If you find the resources in the Taiwanese Hokkien LLM collection useful in your work, please cite it using the following reference:
```
@misc{lu2024enhancing,
title={Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems},
author={Bo-Han Lu and Yi-Hsuan Lin and En-Shiun Annie Lee and Richard Tzong-Han Tsai},
year={2024},
eprint={2403.12024},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```