hon9kon9ize/CantoneseLLM-6B-preview202402

Files

ModelHub XC f41478c47c 初始化项目，由ModelHub XC社区提供模型

Model: hon9kon9ize/CantoneseLLM-6B-preview202402
Source: Original Platform

2026-04-11 03:58:56 +08:00

5.7 KiB

Raw Permalink Blame History

language, license, license_name, license_link, pipeline_tag, model-index

language

license

license_name

license_link

pipeline_tag

model-index

yue

other

yi-license

https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE

text-generation

name

results

CantoneseLLM-6B-preview202402

task

dataset

metrics

source

type	name
text-generation	Text Generation

name

type

config

split

args

AI2 Reasoning Challenge (25-Shot)

ai2_arc

ARC-Challenge

test

num_few_shot
25

type	value	name
acc_norm	55.63	normalized accuracy

url	name
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402	Open LLM Leaderboard

task

dataset

metrics

source

type	name
text-generation	Text Generation

name

type

split

args

HellaSwag (10-Shot)

hellaswag

validation

num_few_shot
10

type	value	name
acc_norm	75.8	normalized accuracy

url	name
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402	Open LLM Leaderboard

task

dataset

metrics

source

type	name
text-generation	Text Generation

name

type

config

split

args

MMLU (5-Shot)

cais/mmlu

all

test

num_few_shot
5

type	value	name
acc	63.07	accuracy

url	name
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402	Open LLM Leaderboard

task

dataset

metrics

source

type	name
text-generation	Text Generation

name

type

config

split

args

TruthfulQA (0-shot)

truthful_qa

multiple_choice

validation

num_few_shot
0

type	value
mc2	42.26

url	name
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402	Open LLM Leaderboard

task

dataset

metrics

source

type	name
text-generation	Text Generation

name

type

config

split

args

Winogrande (5-shot)

winogrande

winogrande_xl

validation

num_few_shot
5

type	value	name
acc	74.11	accuracy

url	name
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402	Open LLM Leaderboard

task

dataset

metrics

source

type	name
text-generation	Text Generation

name

type

config

split

args

GSM8k (5-shot)

gsm8k

main

test

num_few_shot
5

type	value	name
acc	30.71	accuracy

url	name
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402	Open LLM Leaderboard

CantoneseLLM

This model is further pre-trained model based on 01-ai/Yi-6B with 800M tokens of Cantonese text compiled from various sources, including translated zh-yue Wikipedia, translated RTHK news datasets/jed351/rthk_news, Cantonese filtered CC100 and Cantonese textbooks generated by Gemini Pro.

This is a preview version, for experimental use only, we will use it to fine-tune on downstream tasks and evaluate the performance.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	56.93
ARC (25-shot)	55.63
HellaSwag (10-shot)	75.8
MMLU (5-shot)	63.07
TruthfulQA (0-shot)	42.26
Winogrande (5-shot)	74.11
GSM8K (5-shot)	30.71

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("hon9kon9ize/CantoneseLLM-6B-preview202402")
model = AutoModelForMaskedLM.from_pretrained("hon9kon9ize/CantoneseLLM-6B-preview202402")

prompt = "歷經三年疫情，望穿秋水終於全面復常，隨住各項防疫措施陸續放寬以至取消，香港"

input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cuda:0')
output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, repetition_penalty=1.1, do_sample=True, temperature=temperature, top_k=50, top_p=0.95)
output = tokenizer.decode(output[0], skip_special_tokens=True)

# output: 歷經三年疫情，望穿秋水終於全面復常，隨住各項防疫措施陸續放寬以至取消，香港旅遊業可謂「起死回生」。
# 不過，旅遊業嘅復蘇之路並唔順利，香港遊客數量仍然遠低於疫前水平，而海外旅客亦只係恢復到疫情前約一半。有業界人士認為，當局需要進一步放寬入境檢疫措施，吸引更多國際旅客來港，令旅遊業得以真正復甦。

Limitation and Bias

The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.

We found the model is not well trained on the updated Hong Kong knowledge, it may due to the corpus is not large enough to brainwash the original model. We will continue to improve the model and corpus in the future.

5.7 KiB Raw Permalink Blame History

CantoneseLLM

Open LLM Leaderboard Evaluation Results

Usage

Limitation and Bias

5.7 KiB

Raw Permalink Blame History