初始化项目，由ModelHub XC社区提供模型

Model: GODELEV/Archaea-74M Source: Original Platform
2026-06-09 06:18:26 +08:00
commit 1098bfb11f
10 changed files with 250734 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,38 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+Archaea74M_Learning_Rate_Schedule.png filter=lfs diff=lfs merge=lfs -text
+Archaea74M_Loss_Curve.png filter=lfs diff=lfs merge=lfs -text
+Archaea74M_Training_Loss_Curve.png filter=lfs diff=lfs merge=lfs -text
--- a/Archaea74M_Learning_Rate_Schedule.png
+++ b/Archaea74M_Learning_Rate_Schedule.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:db80a2c4277516d8d047c8324bdd6cb7c1b8cfddee7f39004afdb323e967d0dd
+size 219216
--- a/Archaea74M_Loss_Curve.png
+++ b/Archaea74M_Loss_Curve.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:fd075d547356f8521df640919d4e3988e376591d6a5ec43afe521942f27f959e
+size 166161
--- a/Archaea74M_Training_Loss_Curve.png
+++ b/Archaea74M_Training_Loss_Curve.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e76f2f43111c49609573d4750730c0eb47dec904f5e483986866270e8c969ecb
+size 283149
--- a/README.md
+++ b/README.md
@@ -0,0 +1,324 @@
+---
+license: mit
+datasets:
+- GODELEV/BetterDataset-2M
+language:
+- en
+pipeline_tag: text-generation
+---
+
+# Archaea-74M
+
+Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using BF16 mixed precision.
+
+This release represents approximately **1.23 billion trained tokens** out of a planned **1.6 billion token pretraining run**, making it a substantial intermediate checkpoint that captures most of the intended training curriculum while leaving room for future scaling and refinement.
+
+---
+
+# Model Card
+
+| Attribute | Value |
+|------------|------------|
+| Model ID | GODELEV/Archaea-74M |
+| Parameters | ~74 Million |
+| Architecture | Decoder-only Transformer (LLaMA-style) |
+| Attention | Grouped Query Attention (GQA) |
+| Context Length | 1024 |
+| Tokenizer | GPT-2 |
+| Training Precision | BF16 |
+| Framework | PyTorch + Transformers |
+| License | MIT |
+
+---
+
+# Architecture
+
+## Transformer Configuration
+
+| Parameter | Value |
+|------------|------------|
+| Hidden Size | 512 |
+| Intermediate Size | 1408 |
+| Layers | 8 |
+| Attention Heads | 8 |
+| KV Heads | 2 |
+| GQA Ratio | 4:1 |
+| Activation | SiLU |
+| Normalization | RMSNorm |
+| Context Length | 1024 |
+
+The model implements Grouped Query Attention, reducing KV-cache memory requirements while maintaining strong representational capacity for a model of this scale.
+
+---
+
+# Training
+
+## Dataset
+
+Archaea-74M was pretrained on **GODELEV/BetterDataset-2M**, a multi-source corpus composed of:
+
+- General web text
+- Conversational content
+- Knowledge-focused material
+- Educational content
+- Instruction-like examples
+- Technical and programming text
+
+The complete corpus contains approximately **1.6 billion tokens**.
+
+### Training Progress
+
+| Metric | Value |
+|----------|----------|
+| Planned Tokens | ~1.6B |
+| Tokens Trained | ~1.23B |
+| Completion | ~77% |
+| Planned Steps | 25,000 |
+| Completed Steps | 18,800 |
+
+## Optimization
+
+| Parameter | Value |
+|------------|------------|
+| Optimizer | AdamW |
+| Scheduler | OneCycleLR |
+| Peak Learning Rate | 6e-4 |
+| Weight Decay | 0.1 |
+| Gradient Clipping | 1.0 |
+| Sequence Length | 1024 |
+| Effective Batch Size | 64 |
+| Precision | BF16 |
+
+## Training Statistics
+
+| Metric | Value |
+|------------|------------|
+| Initial Loss | 10.9223 |
+| Final Loss | 2.9488 |
+| Best Loss | 2.8071 |
+| Final Perplexity | 19.08 |
+| Best Perplexity | 16.56 |
+
+## Training Loss Curve
+
+<img src="Archaea74M_Training_Loss_Curve.png" width="700"/>
+
+## Learning Rate Schedule
+
+<img src="Archaea74M_Learning_Rate_Schedule.png" width="700"/>
+
+---
+
+# Evaluation
+
+Evaluated using EleutherAI LM Evaluation Harness.
+
+## Benchmark Results
+
+Done on 0-Shot
+
+| Benchmark | Metric | Score |
+|------------|------------|------------|
+| HellaSwag | acc_norm | 27.31% |
+| PIQA | acc_norm | 58.54% |
+| WinoGrande | acc | 51.54% |
+| BoolQ | acc | 56.33% |
+| ARC-Easy | acc_norm | 39.06% |
+| ARC-Challenge | acc_norm | 22.70% |
+| OpenBookQA | acc_norm | 26.00% |
+| CommonsenseQA | acc | 19.66% |
+| LAMBADA | acc | 18.01% |
+| BLiMP | acc | 74.91% |
+| MMLU | acc | 25.07% |
+| SciQ | acc_norm | 57.70% |
+| COPA | acc | 61.00% |
+| RACE | acc | 24.78% |
+| SWAG | acc_norm | 41.98% |
+| TruthfulQA MC2 | acc | 46.46% |
+| WikiText-2 | Word Perplexity | 68.06 |
+
+## Evaluation Notes
+
+Failed tasks from previous runs:
+
+- SocialIQA: dataset loading incompatibility.
+- Arithmetic benchmark: task identifier changes in recent lm-eval versions.
+
+---
+
+# Sample Generations
+
+The following examples were generated directly from the pretrained base model.
+
+## Example 1
+
+**Input**
+
+```text
+Once upon a time
+```
+
+**Output**
+
+```text
+Once upon a time
+
+In the days after, I was told that I could not make a living. So I said to myself,
+“I want to go back to my family.” I said, “This is a good one. You can take me to the doctor
+or something and do it with you.”
+
+The doctor asked me how I felt about this. He said he did. But I said,
+“You can't have a baby.”
+```
+
+## Example 2
+
+**Input**
+
+```text
+The horizon looked
+```
+
+**Output**
+
+```text
+the horizon looked like a big black hole. The sun was shining and the wind was blowing.
+
+The sun was shining with the sun, but it was still there. The stars were on top of each other.
+
+The sun was falling as a result of the weather. The sun was going up.
+```
+
+## Example 3
+
+**Input**
+
+```text
+what is the future of ai
+```
+
+**Output**
+
+```text
+what is the future of Ai?
+
+I don't know what you're talking about, but I'm sure that's a good reason for you to be optimistic.
+
+It doesn't matter if it doesn't take away your wandering in a positive direction.
+
+The answer is simple: I'll leave you with a feeling of uncertainty and optimism.
+```
+
+These outputs illustrate the model's ability to continue prompts coherently while also demonstrating typical limitations of small-scale pretrained language models, including repetition, topic drift, and inconsistent factual reasoning.
+
+---
+## Usage
+
+### Installation
+
+```bash
+pip install torch transformers
+```
+
+### Loading the Model
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "GODELEV/Archaea-74M"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
+    device_map="auto"
+)
+```
+
+### Text Generation
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "GODELEV/Archaea-74M"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
+
+prompt = "The future of artificial intelligence"
+
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+
+with torch.no_grad():
+    output = model.generate(
+        **inputs,
+        max_new_tokens=200,
+        temperature=0.8,
+        do_sample=True,
+        repetition_penalty=1.2,
+        pad_token_id=tokenizer.eos_token_id
+    )
+
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+---
+
+# Repository Structure
+
+```text
+Archaea-74M/
+├── config.json
+├── generation_config.json
+├── model.safetensors
+├── tokenizer.json
+├── tokenizer_config.json
+├── Archaea74M_Training_Loss_Curve.png
+├── Archaea74M_Learning_Rate_Schedule.png
+└── README.md
+```
+
+---
+
+# Limitations
+
+Archaea-74M is a base pretrained model and has not undergone:
+
+- Instruction tuning
+- RLHF
+- Preference optimization
+- Safety alignment
+
+Known limitations:
+
+- Hallucinations and factual inaccuracies
+- Limited reasoning due to model scale
+- Sensitivity to prompt phrasing
+- Fixed 1024-token context window
+- Not suitable for high-stakes applications
+
+---
+
+# Future Work
+
+- Instruction tuning
+- Expanded benchmark coverage
+- Longer context lengths
+- Improved data quality and curriculum design
+
+---
+
+# Citation
+
+```bibtex
+@misc{archaea74m,
+  title={Archaea-74M},
+  author={Akshit Kumar},
+  year={2026},
+  publisher={Hugging Face},
+  url={https://huggingface.co/GODELEV/Archaea-74M}
+}
+```
--- a/config.json
+++ b/config.json
@@ -0,0 +1,32 @@
+{
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 512,
+  "initializer_range": 0.02,
+  "intermediate_size": 1408,
+  "max_position_embeddings": 1024,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 8,
+  "num_hidden_layers": 8,
+  "num_key_value_heads": 2,
+  "pad_token_id": null,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "rope_theta": 10000.0,
+    "rope_type": "default"
+  },
+  "tie_word_embeddings": false,
+  "transformers_version": "5.9.0",
+  "use_cache": true,
+  "vocab_size": 50257
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,9 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "transformers_version": "5.9.0",
+  "use_cache": false
+}
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:48f265103ec709e8ea38cdf5f03dc36990d89b71703df63810b233c4d562d42f
+size 296073328
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,13 @@
+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "is_local": false,
+  "local_files_only": false,
+  "model_max_length": 1024,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}