Files
ModelHub XC 00eda0164f 初始化项目,由ModelHub XC社区提供模型
Model: shibing624/llama-3-8b-instruct-262k-chinese
Source: Original Platform
2026-06-01 11:35:22 +08:00

153 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- zh
- en
license: other
tags:
- llama3
- chinese
- meta
pipeline_tag: text-generation
license_name: llama3
license_link: LICENSE
---
# llama-3-8b-instruct-262k-chinese
llama-3-8b-instruct-262k-chinese基于[Llama-3-8B-Instruct-262k](https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k)使用ORPO方法在中英文偏好数据集[shibing624/DPO-En-Zh-20k-Preference](https://huggingface.co/datasets/shibing624/DPO-En-Zh-20k-Preference)
上微调得到的对话模型。
模型的部署、训练等方法详见MedicalGPT的GitHub仓库[https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
## Relate models
- 完整模型权重https://huggingface.co/shibing624/llama-3-8b-instruct-262k-chinese
- lora权重https://huggingface.co/shibing624/llama-3-8b-instruct-262k-chinese-lora
## Features
模型优势:
1. 支持超长context length 262k token适合RAG
2. 支持中英文
3. 支持多轮对话,代码编码、推理能力强,英文知识充分
4. 模型推理需要显存:
Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens
-- | -- | --
FP16/BF16 | 18.66GB | 24.58GB
Int4 | 9.21GB | 14.62GB
缺点:
1. model size只有8B知识类问答幻觉明显
2. 中文知识欠缺容易幻觉特别是中文古文知识属于llama类模型通病
## 如何使用
```python
import transformers
import torch
model_id = "shibing624/llama-3-8b-instruct-262k-chinese"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.float16},
device="cuda",
)
messages = [{"role": "system", "content": ""}]
messages.append({"role": "user", "content": "介绍一下机器学习"})
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=512,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9
)
content = outputs[0]["generated_text"][len(prompt):]
print(content)
```
result:
```shell
机器学习Machine Learning是一种基于计算机算法的自动数据分析技术用于从数据中学习并预测未来的结果。它是人工智能AI和数据挖掘Data Mining的子领域旨在通过训练和调整算法来发现数据中的模式、关系和规律。
机器学习算法可以分为监督学习、无监督学习和半监督学习三类:
1. 监督学习Supervised Learning在这种类型的学习中算法被提供带有标签的数据集用于训练。算法学习如何将输入数据映射到输出数据并在新数据上进行预测。常见的监督学习算法包括逻辑回归、决策树、支持向量机SVM、随机森林和神经网络。
2. 无监督学习Unsupervised Learning在这种类型的学习中算法没有标签数据。算法学习数据中的模式、结构和关系并可能发现新的数据集群或特征。常见的无监督学习算法包括聚类、主成分分析PCA、独立成分分析ICA和高维度数据降维。
3. 半监督学习Semi-supervised Learning在这种类型的学习中算法被提供部分带有标签的数据集。算法学习如何将输入数据映射到输出数据并在新数据上进行预测。半监督学习算法结合了监督学习和无监督学习的优点常见的半监督学习算法包括自我标注Self-Labeling和基于图的半监督学习Graph-based Semi-supervised Learning
机器学习的应用广泛,包括自然语言处理、计算机视觉、推荐系统、人工智能和自动驾驶等领域。它的优势包括:
1. 自动化:机器学习算法可以自动从数据中发现模式和关系,无需人为干预。
2. 高效性:机器学习算法可以处理大量数据,并且可以在不需要人为干预的情况下进行预测。
3. 适应性:机器学习算法可以根据数据集的变化和更新进行调整。
4. 精准性:机器学习算法可以通过训练和测试来提高预测的准确性。
```
## train detail
train loss:
<img src="https://huggingface.co/shibing624/llama-3-8b-instruct-262k-chinese/raw/main/train_lossv2.svg" width="600">
eval loss:
<img src="https://huggingface.co/shibing624/llama-3-8b-instruct-262k-chinese/raw/main/eval_lossv2.svg" width="600">
# About Llama-3-8B-Instruct-262k
Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model.
This model extends LLama-3 8B's context length from 8k to -> 160K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
<img src="https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/hiHWva3CbsrnPvZTp5-lu.png" width="600">
**Approach:**
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
- Progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2] (See details below)
**Infra:**
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 262144 tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster.
**Data:**
For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
**Progressive Training Details:**
| Parameter | 65K | 262K |
|-----------------------------|----------------|------------|
| Initialize From | LLaMA-3-8B-Inst| 65K |
| Sequence Length | 2^16 | 2^18 |
| RoPE theta | 15.3 M | 207.1 M |
| Batch Size (Tokens / Step) | 2.097 M | 4.192 M |
| Steps | 30 | 24 |
| Total Tokens | 63 M | 101 M |
| Learning Rate | 2.00E-05 | 2.00E-05 |
| # GPUs | 32 | 32 |
| GPU Type | NVIDIA L40S | NVIDIA L40S|