初始化项目，由ModelHub XC社区提供模型

Model: GODELEV/Archaea-74M Source: Original Platform
2026-06-09 06:18:26 +08:00
commit 1098bfb11f
10 changed files with 250734 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,324 @@
+---
+license: mit
+datasets:
+- GODELEV/BetterDataset-2M
+language:
+- en
+pipeline_tag: text-generation
+---
+
+# Archaea-74M
+
+Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using BF16 mixed precision.
+
+This release represents approximately **1.23 billion trained tokens** out of a planned **1.6 billion token pretraining run**, making it a substantial intermediate checkpoint that captures most of the intended training curriculum while leaving room for future scaling and refinement.
+
+---
+
+# Model Card
+
+| Attribute | Value |
+|------------|------------|
+| Model ID | GODELEV/Archaea-74M |
+| Parameters | ~74 Million |
+| Architecture | Decoder-only Transformer (LLaMA-style) |
+| Attention | Grouped Query Attention (GQA) |
+| Context Length | 1024 |
+| Tokenizer | GPT-2 |
+| Training Precision | BF16 |
+| Framework | PyTorch + Transformers |
+| License | MIT |
+
+---
+
+# Architecture
+
+## Transformer Configuration
+
+| Parameter | Value |
+|------------|------------|
+| Hidden Size | 512 |
+| Intermediate Size | 1408 |
+| Layers | 8 |
+| Attention Heads | 8 |
+| KV Heads | 2 |
+| GQA Ratio | 4:1 |
+| Activation | SiLU |
+| Normalization | RMSNorm |
+| Context Length | 1024 |
+
+The model implements Grouped Query Attention, reducing KV-cache memory requirements while maintaining strong representational capacity for a model of this scale.
+
+---
+
+# Training
+
+## Dataset
+
+Archaea-74M was pretrained on **GODELEV/BetterDataset-2M**, a multi-source corpus composed of:
+
+- General web text
+- Conversational content
+- Knowledge-focused material
+- Educational content
+- Instruction-like examples
+- Technical and programming text
+
+The complete corpus contains approximately **1.6 billion tokens**.
+
+### Training Progress
+
+| Metric | Value |
+|----------|----------|
+| Planned Tokens | ~1.6B |
+| Tokens Trained | ~1.23B |
+| Completion | ~77% |
+| Planned Steps | 25,000 |
+| Completed Steps | 18,800 |
+
+## Optimization
+
+| Parameter | Value |
+|------------|------------|
+| Optimizer | AdamW |
+| Scheduler | OneCycleLR |
+| Peak Learning Rate | 6e-4 |
+| Weight Decay | 0.1 |
+| Gradient Clipping | 1.0 |
+| Sequence Length | 1024 |
+| Effective Batch Size | 64 |
+| Precision | BF16 |
+
+## Training Statistics
+
+| Metric | Value |
+|------------|------------|
+| Initial Loss | 10.9223 |
+| Final Loss | 2.9488 |
+| Best Loss | 2.8071 |
+| Final Perplexity | 19.08 |
+| Best Perplexity | 16.56 |
+
+## Training Loss Curve
+
+<img src="Archaea74M_Training_Loss_Curve.png" width="700"/>
+
+## Learning Rate Schedule
+
+<img src="Archaea74M_Learning_Rate_Schedule.png" width="700"/>
+
+---
+
+# Evaluation
+
+Evaluated using EleutherAI LM Evaluation Harness.
+
+## Benchmark Results
+
+Done on 0-Shot
+
+| Benchmark | Metric | Score |
+|------------|------------|------------|
+| HellaSwag | acc_norm | 27.31% |
+| PIQA | acc_norm | 58.54% |
+| WinoGrande | acc | 51.54% |
+| BoolQ | acc | 56.33% |
+| ARC-Easy | acc_norm | 39.06% |
+| ARC-Challenge | acc_norm | 22.70% |
+| OpenBookQA | acc_norm | 26.00% |
+| CommonsenseQA | acc | 19.66% |
+| LAMBADA | acc | 18.01% |
+| BLiMP | acc | 74.91% |
+| MMLU | acc | 25.07% |
+| SciQ | acc_norm | 57.70% |
+| COPA | acc | 61.00% |
+| RACE | acc | 24.78% |
+| SWAG | acc_norm | 41.98% |
+| TruthfulQA MC2 | acc | 46.46% |
+| WikiText-2 | Word Perplexity | 68.06 |
+
+## Evaluation Notes
+
+Failed tasks from previous runs:
+
+- SocialIQA: dataset loading incompatibility.
+- Arithmetic benchmark: task identifier changes in recent lm-eval versions.
+
+---
+
+# Sample Generations
+
+The following examples were generated directly from the pretrained base model.
+
+## Example 1
+
+**Input**
+
+```text
+Once upon a time
+```
+
+**Output**
+
+```text
+Once upon a time
+
+In the days after, I was told that I could not make a living. So I said to myself,
+“I want to go back to my family.” I said, “This is a good one. You can take me to the doctor
+or something and do it with you.”
+
+The doctor asked me how I felt about this. He said he did. But I said,
+“You can't have a baby.”
+```
+
+## Example 2
+
+**Input**
+
+```text
+The horizon looked
+```
+
+**Output**
+
+```text
+the horizon looked like a big black hole. The sun was shining and the wind was blowing.
+
+The sun was shining with the sun, but it was still there. The stars were on top of each other.
+
+The sun was falling as a result of the weather. The sun was going up.
+```
+
+## Example 3
+
+**Input**
+
+```text
+what is the future of ai
+```
+
+**Output**
+
+```text
+what is the future of Ai?
+
+I don't know what you're talking about, but I'm sure that's a good reason for you to be optimistic.
+
+It doesn't matter if it doesn't take away your wandering in a positive direction.
+
+The answer is simple: I'll leave you with a feeling of uncertainty and optimism.
+```
+
+These outputs illustrate the model's ability to continue prompts coherently while also demonstrating typical limitations of small-scale pretrained language models, including repetition, topic drift, and inconsistent factual reasoning.
+
+---
+## Usage
+
+### Installation
+
+```bash
+pip install torch transformers
+```
+
+### Loading the Model
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "GODELEV/Archaea-74M"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
+    device_map="auto"
+)
+```
+
+### Text Generation
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "GODELEV/Archaea-74M"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
+
+prompt = "The future of artificial intelligence"
+
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+
+with torch.no_grad():
+    output = model.generate(
+        **inputs,
+        max_new_tokens=200,
+        temperature=0.8,
+        do_sample=True,
+        repetition_penalty=1.2,
+        pad_token_id=tokenizer.eos_token_id
+    )
+
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+---
+
+# Repository Structure
+
+```text
+Archaea-74M/
+├── config.json
+├── generation_config.json
+├── model.safetensors
+├── tokenizer.json
+├── tokenizer_config.json
+├── Archaea74M_Training_Loss_Curve.png
+├── Archaea74M_Learning_Rate_Schedule.png
+└── README.md
+```
+
+---
+
+# Limitations
+
+Archaea-74M is a base pretrained model and has not undergone:
+
+- Instruction tuning
+- RLHF
+- Preference optimization
+- Safety alignment
+
+Known limitations:
+
+- Hallucinations and factual inaccuracies
+- Limited reasoning due to model scale
+- Sensitivity to prompt phrasing
+- Fixed 1024-token context window
+- Not suitable for high-stakes applications
+
+---
+
+# Future Work
+
+- Instruction tuning
+- Expanded benchmark coverage
+- Longer context lengths
+- Improved data quality and curriculum design
+
+---
+
+# Citation
+
+```bibtex
+@misc{archaea74m,
+  title={Archaea-74M},
+  author={Akshit Kumar},
+  year={2026},
+  publisher={Hugging Face},
+  url={https://huggingface.co/GODELEV/Archaea-74M}
+}
+```