初始化项目，由ModelHub XC社区提供模型

Model: GODELEV/Archaea-74M Source: Original Platform
2026-06-09 06:18:26 +08:00
commit 1098bfb11f
10 changed files with 250734 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,38 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 Archaea74M_Learning_Rate_Schedule.png filter=lfs diff=lfs merge=lfs -text
 Archaea74M_Loss_Curve.png filter=lfs diff=lfs merge=lfs -text
 Archaea74M_Training_Loss_Curve.png filter=lfs diff=lfs merge=lfs -text
--- a/Archaea74M_Learning_Rate_Schedule.png
+++ b/Archaea74M_Learning_Rate_Schedule.png
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:db80a2c4277516d8d047c8324bdd6cb7c1b8cfddee7f39004afdb323e967d0dd
 size 219216
--- a/Archaea74M_Loss_Curve.png
+++ b/Archaea74M_Loss_Curve.png
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:fd075d547356f8521df640919d4e3988e376591d6a5ec43afe521942f27f959e
 size 166161
--- a/Archaea74M_Training_Loss_Curve.png
+++ b/Archaea74M_Training_Loss_Curve.png
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:e76f2f43111c49609573d4750730c0eb47dec904f5e483986866270e8c969ecb
 size 283149
--- a/README.md
+++ b/README.md
@@ -0,0 +1,324 @@
 ---
 license: mit
 datasets:
 - GODELEV/BetterDataset-2M
 language:
 - en
 pipeline_tag: text-generation
 ---
 # Archaea-74M
 Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using BF16 mixed precision.
 This release represents approximately **1.23 billion trained tokens** out of a planned **1.6 billion token pretraining run**, making it a substantial intermediate checkpoint that captures most of the intended training curriculum while leaving room for future scaling and refinement.
 ---
 # Model Card
 | Attribute | Value |
 |------------|------------|
 | Model ID | GODELEV/Archaea-74M |
 | Parameters | ~74 Million |
 | Architecture | Decoder-only Transformer (LLaMA-style) |
 | Attention | Grouped Query Attention (GQA) |
 | Context Length | 1024 |
 | Tokenizer | GPT-2 |
 | Training Precision | BF16 |
 | Framework | PyTorch + Transformers |
 | License | MIT |
 ---
 # Architecture
 ## Transformer Configuration
 | Parameter | Value |
 |------------|------------|
 | Hidden Size | 512 |
 | Intermediate Size | 1408 |
 | Layers | 8 |
 | Attention Heads | 8 |
 | KV Heads | 2 |
 | GQA Ratio | 4:1 |
 | Activation | SiLU |
 | Normalization | RMSNorm |
 | Context Length | 1024 |
 The model implements Grouped Query Attention, reducing KV-cache memory requirements while maintaining strong representational capacity for a model of this scale.
 ---
 # Training
 ## Dataset
 Archaea-74M was pretrained on **GODELEV/BetterDataset-2M**, a multi-source corpus composed of:
 - General web text
 - Conversational content
 - Knowledge-focused material
 - Educational content
 - Instruction-like examples
 - Technical and programming text
 The complete corpus contains approximately **1.6 billion tokens**.
 ### Training Progress
 | Metric | Value |
 |----------|----------|
 | Planned Tokens | ~1.6B |
 | Tokens Trained | ~1.23B |
 | Completion | ~77% |
 | Planned Steps | 25,000 |
 | Completed Steps | 18,800 |
 ## Optimization
 | Parameter | Value |
 |------------|------------|
 | Optimizer | AdamW |
 | Scheduler | OneCycleLR |
 | Peak Learning Rate | 6e-4 |
 | Weight Decay | 0.1 |
 | Gradient Clipping | 1.0 |
 | Sequence Length | 1024 |
 | Effective Batch Size | 64 |
 | Precision | BF16 |
 ## Training Statistics
 | Metric | Value |
 |------------|------------|
 | Initial Loss | 10.9223 |
 | Final Loss | 2.9488 |
 | Best Loss | 2.8071 |
 | Final Perplexity | 19.08 |
 | Best Perplexity | 16.56 |
 ## Training Loss Curve
 <img src="Archaea74M_Training_Loss_Curve.png" width="700"/>
 ## Learning Rate Schedule
 <img src="Archaea74M_Learning_Rate_Schedule.png" width="700"/>
 ---
 # Evaluation
 Evaluated using EleutherAI LM Evaluation Harness.
 ## Benchmark Results
 Done on 0-Shot
 | Benchmark | Metric | Score |
 |------------|------------|------------|
 | HellaSwag | acc_norm | 27.31% |
 | PIQA | acc_norm | 58.54% |
 | WinoGrande | acc | 51.54% |
 | BoolQ | acc | 56.33% |
 | ARC-Easy | acc_norm | 39.06% |
 | ARC-Challenge | acc_norm | 22.70% |
 | OpenBookQA | acc_norm | 26.00% |
 | CommonsenseQA | acc | 19.66% |
 | LAMBADA | acc | 18.01% |
 | BLiMP | acc | 74.91% |
 | MMLU | acc | 25.07% |
 | SciQ | acc_norm | 57.70% |
 | COPA | acc | 61.00% |
 | RACE | acc | 24.78% |
 | SWAG | acc_norm | 41.98% |
 | TruthfulQA MC2 | acc | 46.46% |
 | WikiText-2 | Word Perplexity | 68.06 |
 ## Evaluation Notes
 Failed tasks from previous runs:
 - SocialIQA: dataset loading incompatibility.
 - Arithmetic benchmark: task identifier changes in recent lm-eval versions.
 ---
 # Sample Generations
 The following examples were generated directly from the pretrained base model.
 ## Example 1
 **Input**
 ```text
 Once upon a time
 ```
 **Output**
 ```text
 Once upon a time
 In the days after, I was told that I could not make a living. So I said to myself,
 “I want to go back to my family.” I said, “This is a good one. You can take me to the doctor
 or something and do it with you.”
 The doctor asked me how I felt about this. He said he did. But I said,
 “You can't have a baby.”
 ```
 ## Example 2
 **Input**
 ```text
 The horizon looked
 ```
 **Output**
 ```text
 the horizon looked like a big black hole. The sun was shining and the wind was blowing.
 The sun was shining with the sun, but it was still there. The stars were on top of each other.
 The sun was falling as a result of the weather. The sun was going up.
 ```
 ## Example 3
 **Input**
 ```text
 what is the future of ai
 ```
 **Output**
 ```text
 what is the future of Ai?
 I don't know what you're talking about, but I'm sure that's a good reason for you to be optimistic.
 It doesn't matter if it doesn't take away your wandering in a positive direction.
 The answer is simple: I'll leave you with a feeling of uncertainty and optimism.
 ```
 These outputs illustrate the model's ability to continue prompts coherently while also demonstrating typical limitations of small-scale pretrained language models, including repetition, topic drift, and inconsistent factual reasoning.
 ---
 ## Usage
 ### Installation
 ```bash
 pip install torch transformers
 ```
 ### Loading the Model
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 model_id = "GODELEV/Archaea-74M"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
 )
 ```
 ### Text Generation
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
 model_id = "GODELEV/Archaea-74M"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
 prompt = "The future of artificial intelligence"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.8,
        do_sample=True,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 ---
 # Repository Structure
 ```text
 Archaea-74M/
 ├── config.json
 ├── generation_config.json
 ├── model.safetensors
 ├── tokenizer.json
 ├── tokenizer_config.json
 ├── Archaea74M_Training_Loss_Curve.png
 ├── Archaea74M_Learning_Rate_Schedule.png
 └── README.md
 ```
 ---
 # Limitations
 Archaea-74M is a base pretrained model and has not undergone:
 - Instruction tuning
 - RLHF
 - Preference optimization
 - Safety alignment
 Known limitations:
 - Hallucinations and factual inaccuracies
 - Limited reasoning due to model scale
 - Sensitivity to prompt phrasing
 - Fixed 1024-token context window
 - Not suitable for high-stakes applications
 ---
 # Future Work
 - Instruction tuning
 - Expanded benchmark coverage
 - Longer context lengths
 - Improved data quality and curriculum design
 ---
 # Citation
 ```bibtex
@misc{archaea74m,
  title={Archaea-74M},
  author={Akshit Kumar},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/GODELEV/Archaea-74M}
 }
 ```
--- a/config.json
+++ b/config.json
@@ -0,0 +1,32 @@
 {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "dtype": "float32",
  "eos_token_id": 2,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 1408,
  "max_position_embeddings": 1024,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 8,
  "num_hidden_layers": 8,
  "num_key_value_heads": 2,
  "pad_token_id": null,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_parameters": {
    "rope_theta": 10000.0,
    "rope_type": "default"
  },
  "tie_word_embeddings": false,
  "transformers_version": "5.9.0",
  "use_cache": true,
  "vocab_size": 50257
 }
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,9 @@
 {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "transformers_version": "5.9.0",
  "use_cache": false
 }
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:48f265103ec709e8ea38cdf5f03dc36990d89b71703df63810b233c4d562d42f
 size 296073328
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,13 @@
 {
  "add_prefix_space": false,
  "backend": "tokenizers",
  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "errors": "replace",
  "is_local": false,
  "local_files_only": false,
  "model_max_length": 1024,
  "pad_token": "<|endoftext|>",
  "tokenizer_class": "GPT2Tokenizer",
  "unk_token": "<|endoftext|>"
 }