Files
Artificial-llama3_1_8B_inst…/README.md
ModelHub XC dfc5039340 初始化项目,由ModelHub XC社区提供模型
Model: BAAI/Artificial-llama3_1_8B_instruct
Source: Original Platform
2026-05-18 11:38:45 +08:00

118 lines
3.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
datasets:
- BAAI/IndustryInstruction_Artificial-Intelligence
- BAAI/IndustryInstruction
base_model:
- meta-llama/Meta-Llama-3.1-8B-Instruct
tags:
- 人工智能领域语言模型
- 机器学习领域语言模型
---
This model is finetuned on the model llama3.1-8b-instruct using the dataset [BAAI/IndustryInstruction_Artificial-Intelligence](https://huggingface.co/datasets/BAAI/IndustryInstruction_Artificial-Intelligence) dataset, the dataset details can jump to the repo: [BAAI/IndustryInstruction](https://huggingface.co/datasets/BAAI/IndustryInstruction)
## training params
The training framework is llama-factory, template=llama3
```
learning_rate=1e-5
lr_scheduler_type=cosine
max_length=2048
warmup_ratio=0.05
batch_size=64
epoch=10
```
select best ckpt by the evaluation loss
## evaluation
Duto to there is no evaluation benchmark, we can not eval the model
## How to use
```python
# !/usr/bin/env python
# -*- coding:utf-8 -*-
# ==================================================================
# [Author] : xiaofeng
# [Descriptions] :
# ==================================================================
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
llama3_jinja = """{% if messages[0]['role'] == 'system' %}
{% set offset = 1 %}
{% else %}
{% set offset = 0 %}
{% endif %}
{{ bos_token }}
{% for message in messages %}
{% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>' }}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\n\n' }}
{% endif %}"""
dtype = torch.bfloat16
model_dir = "MonteXiaofeng/Artificial-llama3_1_8B_instruct"
model = AutoModelForCausalLM.from_pretrained(
model_dir,
device_map="cuda",
torch_dtype=dtype,
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
tokenizer.chat_template = llama3_jinja # update template
message = [
{"role": "system", "content": "You are a helpful assistant"},
{
"role": "user",
"content": "请解释为什么ChatGPT的英文输出结果的逻辑性和准确性远大于中文结果",
},
]
prompt = tokenizer.apply_chat_template(
message, tokenize=False, add_generation_prompt=True
)
print(prompt)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
prompt_length = len(inputs[0])
print(f"prompt_length:{prompt_length}")
generating_args = {
"do_sample": True,
"temperature": 1.0,
"top_p": 0.5,
"top_k": 15,
"max_new_tokens": 512,
}
generate_output = model.generate(input_ids=inputs.to(model.device), **generating_args)
response_ids = generate_output[:, prompt_length:]
response = tokenizer.batch_decode(
response_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
print(f"response:{response}")
"""
ChatGPT的英文输出结果的逻辑性和准确性远大于中文结果的原因可能是因为以下几点首先ChatGPT的训练数据主要来自英文互联网上的大量文本因此它在处理英文语言时具有更强的能力。其次英文语言的语法和词汇结构相对简单容易被ChatGPT理解和生成。此外ChatGPT的算法优化也可能更适合处理英文语言。然而对于中文语言由于其复杂的语法结构和大量的非字节字符ChatGPT可能难以完全理解和生成准确的中文结果。
"""
```