初始化项目,由ModelHub XC社区提供模型
Model: cmz1024/OLMo3-190M-zh-v3.1 Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
157
README.md
Normal file
157
README.md
Normal file
@@ -0,0 +1,157 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
language:
|
||||
- zh
|
||||
tags:
|
||||
- pretrained
|
||||
- olmo3
|
||||
- chinese
|
||||
- continue-pretrain
|
||||
- mid-training
|
||||
library_name: transformers
|
||||
pipeline_tag: text-generation
|
||||
base_model: allenai/Olmo-3-1025-7B
|
||||
datasets:
|
||||
- openbmb/Ultra-FineWeb
|
||||
- opencsg/chinese-cosmopedia
|
||||
- wikimedia/wikipedia
|
||||
---
|
||||
|
||||
# OLMo3-190M-zh-v3.1
|
||||
|
||||
**中文 continue-pretrain 教学模型**,基于 OLMo3-190M 架构,在 v3 (from-scratch pretrain) 之上注入 **事实密度高的中文语料**(Wikipedia-zh + Cosmopedia-Chinese)。
|
||||
|
||||
> **活水 42ailab 出品**。配套于《零基础 AI 大模型研发训练营》第 04 讲:预训练。
|
||||
|
||||
## 来源与动机
|
||||
|
||||
v3.1 的目标是**验证 continue pretrain 这个工业级技术**:在已有 base model 上继续训练 1-2B tokens,以低成本(**~$14 / 3h H100**)改善特定能力。
|
||||
|
||||
**对比 v3(原 base)→ v3.1 的 7-prompt 抽测**:
|
||||
|
||||
| Prompt | v3 | v3.1 | 变化 |
|
||||
|---|---|---|---|
|
||||
| 人工智能是 | 🟢 | 🟢 | 持平 |
|
||||
| 北京大学位于 | 🔴 "江苏省" | 🟢 **"北京市"** | ✅ 关键质变 |
|
||||
| Python 是一种 | 🔴 "开源库" | 🟡 "开源的可扩展代码" + 能写代码 | ✅ 改善 |
|
||||
| 中国古代四大发明是 | 🔴 | 🔴 | 持平(190M 物理天花板)|
|
||||
| 其他 3 个 | 🟡×3 | 🟡×3 + 1 退步 | — |
|
||||
| **合计** | **1🟢 3🟡 3🔴** | **2🟢 3🟡 2🔴** | **+1 绿** |
|
||||
|
||||
关键 takeaways:
|
||||
- ✅ **事实密度策略有效**:wiki + Cosmopedia 让"北京大学"跨过了 190M 模型的记忆阈值。
|
||||
- ⚠️ **190M 参数有记忆天花板**:"四大发明"即便在语料中存在,曝光次数不够就学不会。
|
||||
- ⚠️ **未清洗的 Wiki 会引入格式污染**:红楼梦 prompt 出现"林黛玉、林黛玉、林黛玉的妹妹"重复——已作为 v4 必修项。
|
||||
|
||||
## 架构
|
||||
|
||||
基于 **OLMo3-190M canonical** 架构(和 allenai/Olmo-3 同宗,但从零训练 + continue pretrain):
|
||||
|
||||
| 字段 | 值 |
|
||||
|---|---|
|
||||
| `architectures` | `["Olmo3ForCausalLM"]` |
|
||||
| `hidden_size` | 768 |
|
||||
| `num_hidden_layers` | 12 |
|
||||
| `num_attention_heads` | 12 |
|
||||
| `intermediate_size` | 3072 |
|
||||
| `sliding_window` | 4096 (每 4 层 full attention) |
|
||||
| `vocab_size` | 48000 (自训 48k 中文 BPE) |
|
||||
| `max_position_embeddings` | 4096 |
|
||||
| `total params` | ~187M |
|
||||
|
||||
## 使用
|
||||
|
||||
### 最简用法(路径 A)
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
tok = AutoTokenizer.from_pretrained("42ailab/OLMo3-190M-zh-v3.1")
|
||||
model = AutoModelForCausalLM.from_pretrained("42ailab/OLMo3-190M-zh-v3.1")
|
||||
|
||||
inputs = tok("北京大学位于", return_tensors="pt")
|
||||
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.8, top_p=0.9)
|
||||
print(tok.decode(outputs[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
### 用于 SFT(推荐用法,L05 课程路径)
|
||||
v3.1 是 **base model**,没有对话能力。上 SFT 就是:
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
# 作为 SFT 基座
|
||||
model = AutoModelForCausalLM.from_pretrained("42ailab/OLMo3-190M-zh-v3.1")
|
||||
# ... 用 TRL SFTTrainer 训练
|
||||
```
|
||||
|
||||
### 国内加速
|
||||
```bash
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
# 或用 ModelScope:https://modelscope.cn/models/42ailab/OLMo3-190M-zh-v3.1
|
||||
```
|
||||
|
||||
## 训练配方
|
||||
|
||||
### 数据混合(1.19B tokens effective)
|
||||
|
||||
| 数据源 | HF 路径 | tokens | % |
|
||||
|---|---|---|---|
|
||||
| v3 replay | 自 merged_raw 抽样 30% | 300M | 27% |
|
||||
| Wikipedia-zh | `wikimedia/wikipedia` config `20231101.zh` | 394M | 35% |
|
||||
| Cosmopedia-Chinese | `opencsg/chinese-cosmopedia` | 500M | 45% |
|
||||
|
||||
(原计划加入 `bigcode/the-stack-v2` 代码和 `liwu/MNBVC` 事实补刀,但两者在 HF datasets 3.0+ 中因 script-based loading 失效。v4 会用 Parquet-native 替代。)
|
||||
|
||||
### 超参
|
||||
- Base: v3 final checkpoint
|
||||
- Peak lr: **2e-4**(v3 peak 5e-4 的 2/5)
|
||||
- Warmup: **500 steps** 显式(从 ~0 线性爬升)
|
||||
- Scheduler: cosine decay to 2e-5
|
||||
- Batch: 16 × 2048 × grad_accum 8 = **262K tokens/step**
|
||||
- Max steps: 5700
|
||||
- Duration: **3h 00min** on H100
|
||||
- Attention: **SDPA**(OLMo3 SWA + flash-attn-2 有 `s_aux` None bug,continue pretrain 场景必须 SDPA)
|
||||
|
||||
### Loss 轨迹
|
||||
| step | train loss |
|
||||
|---|---|
|
||||
| 0 | 3.33 |
|
||||
| 500 (warmup done) | 2.97 |
|
||||
| 1500 | 2.87 |
|
||||
| 3000 | 2.81 |
|
||||
| 4500 | 2.80 |
|
||||
| 5700 (final) | **2.85 mean** |
|
||||
|
||||
对比 v3 final mean loss **3.95** → v3.1 **-28%**。
|
||||
|
||||
## 限制
|
||||
|
||||
1. **事实记忆上限**:190M 参数约能稳定记住 ~1000-2000 个事实。"四大发明"等低曝光事实仍会错。
|
||||
2. **无 code 数据**:Stage-2 计划中的代码数据源因 HF gated / script-based 问题未加入。"Python 是编程语言"仍偶有漂移。
|
||||
3. **未做 SFT**:v3.1 是纯 base model,不会对话。对话任务需额外做 SFT。
|
||||
4. **eval 数字偏乐观**:训练时 eval/loss = 2.00 是在 in-domain 数据上算的(wiki + Cosmopedia 尾部),不能直接对比 v3 的 3.61 held-out eval。
|
||||
|
||||
## License
|
||||
|
||||
- **模型权重**:Apache-2.0(允许商用)
|
||||
- **训练数据**:混合 license
|
||||
- FineWeb-Edu-Chinese V2.2、Ultra-FineWeb、Cosmopedia-Chinese:Apache-2.0
|
||||
- **Wikipedia-zh**:**CC-BY-SA-3.0 + GFDL**(share-alike 传染性,商用前请咨询法律)
|
||||
|
||||
按 Allen AI OLMo 社区实践:权重本身声明 Apache-2.0,但披露训练数据含 CC-BY-SA 内容。
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@misc{huoshui-olmo3-190m-zh-v3.1,
|
||||
title={OLMo3-190M-zh-v3.1: Chinese Continue-Pretrained Teaching Model},
|
||||
author={活水 AI 实验室 (42ailab) and 阳志平},
|
||||
year={2026},
|
||||
howpublished={\url{https://huggingface.co/42ailab/OLMo3-190M-zh-v3.1}},
|
||||
note={LLM001 Course, Lecture 04}
|
||||
}
|
||||
```
|
||||
|
||||
## 配套资源
|
||||
|
||||
- **数据集**:[`42ailab/llm101-v3.1-data`](https://huggingface.co/datasets/42ailab/llm101-v3.1-data) — tokenizer + tokenized.bin + nano 子集 + 7-prompt 评测对照
|
||||
- **课程仓库**:https://cnb.cool/42edu/LLM001/llm001
|
||||
- **讲师笔记**:`demo/ch04/OLMo3-190M-zh-v3.1/teaching_notes.md`
|
||||
8
chat_template.jinja
Normal file
8
chat_template.jinja
Normal file
@@ -0,0 +1,8 @@
|
||||
{% for message in messages %}{% if message['role'] == 'system' %}<|im_start|>system
|
||||
{{ message['content'] }}<|im_end|>
|
||||
{% elif message['role'] == 'user' %}<|im_start|>user
|
||||
{{ message['content'] }}<|im_end|>
|
||||
{% elif message['role'] == 'assistant' %}{% generation %}<|im_start|>assistant
|
||||
{{ message['content'] }}<|im_end|>
|
||||
{% endgeneration %}{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
|
||||
{% endif %}
|
||||
45
config.json
Normal file
45
config.json
Normal file
@@ -0,0 +1,45 @@
|
||||
{
|
||||
"architectures": [
|
||||
"Olmo3ForCausalLM"
|
||||
],
|
||||
"attention_bias": false,
|
||||
"attention_dropout": 0.0,
|
||||
"attn_implementation": "sdpa",
|
||||
"bos_token_id": 2,
|
||||
"dtype": "bfloat16",
|
||||
"eos_token_id": 0,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 768,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 3072,
|
||||
"layer_types": [
|
||||
"sliding_attention",
|
||||
"sliding_attention",
|
||||
"sliding_attention",
|
||||
"full_attention",
|
||||
"sliding_attention",
|
||||
"sliding_attention",
|
||||
"sliding_attention",
|
||||
"full_attention",
|
||||
"sliding_attention",
|
||||
"sliding_attention",
|
||||
"sliding_attention",
|
||||
"full_attention"
|
||||
],
|
||||
"max_position_embeddings": 4096,
|
||||
"model_type": "olmo3",
|
||||
"num_attention_heads": 12,
|
||||
"num_hidden_layers": 12,
|
||||
"num_key_value_heads": 12,
|
||||
"pad_token_id": 1,
|
||||
"rms_norm_eps": 1e-06,
|
||||
"rope_parameters": {
|
||||
"rope_theta": 500000,
|
||||
"rope_type": "default"
|
||||
},
|
||||
"sliding_window": 4096,
|
||||
"tie_word_embeddings": false,
|
||||
"transformers_version": "5.6.0",
|
||||
"use_cache": false,
|
||||
"vocab_size": 48000
|
||||
}
|
||||
12
generation_config.json
Normal file
12
generation_config.json
Normal file
@@ -0,0 +1,12 @@
|
||||
{
|
||||
"_from_model_config": true,
|
||||
"bos_token_id": 2,
|
||||
"eos_token_id": [
|
||||
0
|
||||
],
|
||||
"output_attentions": false,
|
||||
"output_hidden_states": false,
|
||||
"pad_token_id": 1,
|
||||
"transformers_version": "5.6.0",
|
||||
"use_cache": true
|
||||
}
|
||||
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:7e96b1881b0b58b1e856b471047606161e1e4470c8bc04ff4aad86c71f416b8f
|
||||
size 374038864
|
||||
6
special_tokens_map.json
Normal file
6
special_tokens_map.json
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"bos_token": "<|bos|>",
|
||||
"eos_token": "<|endoftext|>",
|
||||
"pad_token": "<|pad|>",
|
||||
"unk_token": "<|unk|>"
|
||||
}
|
||||
239058
tokenizer.json
Normal file
239058
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
12
tokenizer_config.json
Normal file
12
tokenizer_config.json
Normal file
@@ -0,0 +1,12 @@
|
||||
{
|
||||
"tokenizer_class": "PreTrainedTokenizerFast",
|
||||
"bos_token": "<|bos|>",
|
||||
"eos_token": "<|endoftext|>",
|
||||
"pad_token": "<|pad|>",
|
||||
"unk_token": "<|unk|>",
|
||||
"add_bos_token": false,
|
||||
"add_eos_token": false,
|
||||
"clean_up_tokenization_spaces": false,
|
||||
"model_max_length": 4096,
|
||||
"chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}<|im_start|>system\n{{ message['content'] }}<|im_end|>\n{% elif message['role'] == 'user' %}<|im_start|>user\n{{ message['content'] }}<|im_end|>\n{% elif message['role'] == 'assistant' %}{% generation %}<|im_start|>assistant\n{{ message['content'] }}<|im_end|>\n{% endgeneration %}{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}\n"
|
||||
}
|
||||
3
training_args.bin
Normal file
3
training_args.bin
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:584bf9f0fc01e155aec4504ea76ea0dc1844eecb1f292e5801a2d14efd484d63
|
||||
size 4856
|
||||
Reference in New Issue
Block a user