初始化项目,由ModelHub XC社区提供模型
Model: adityawakharkar/AstraGPTCoder-7B Source: Original Platform
This commit is contained in:
384
README.md
Normal file
384
README.md
Normal file
@@ -0,0 +1,384 @@
|
||||
---
|
||||
base_model: adityawakharkar/AstraGPTCoder-7B
|
||||
language:
|
||||
- en
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- from-scratch
|
||||
- custom-architecture
|
||||
- custom-tokenizer
|
||||
- reasoning
|
||||
- chain-of-thought
|
||||
- think-tags
|
||||
- coding
|
||||
- fine-tuned
|
||||
- lora
|
||||
- peft
|
||||
- unsloth
|
||||
- astragpt
|
||||
- tantra-ai-labs
|
||||
- rtx-4090
|
||||
pipeline_tag: text-generation
|
||||
library_name: transformers
|
||||
model_creator: Tantra AI Labs
|
||||
---
|
||||
|
||||
# AstraGPT-7B 🚀
|
||||
|
||||
<div align="center">
|
||||
|
||||
**A 7-Billion Parameter Language Model — Built From Scratch**
|
||||
|
||||
*Custom Architecture · Custom BPE Tokenizer · Reasoning Fine-Tuned on Dual RTX 4090*
|
||||
|
||||
[](https://opensource.org/licenses/Apache-2.0)
|
||||
[](https://huggingface.co/adityawakharkar/AstraGPT-7B)
|
||||
[]()
|
||||
[](https://www.nvidia.com)
|
||||
[](https://github.com/codewith-aditya)
|
||||
|
||||
Built by **Aditya Wakharkar** | [Tantra AI Labs](https://github.com/codewith-aditya)
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## 🧠 What is AstraGPT-7B?
|
||||
|
||||
AstraGPT-7B is a **7-billion parameter decoder-only language model** designed for coding and chain-of-thought reasoning.
|
||||
|
||||
Unlike most open-source fine-tunes, **every core component of AstraGPT was designed and implemented from scratch in PyTorch** — including the transformer architecture, the BPE tokenizer, and the supervised fine-tuning pipeline.
|
||||
|
||||
The model was then **fine-tuned on a reasoning dataset** using LoRA on a **private VPS equipped with dual NVIDIA RTX 4090 GPUs**, giving it native support for `<think>...</think>` style reasoning output.
|
||||
|
||||
> *"Most people fine-tune models. We built one."*
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Built From Scratch — Architecture Overview
|
||||
|
||||
Every layer of AstraGPT-7B was implemented from first principles in PyTorch. No `AutoModel`, no copy-paste — pure custom code.
|
||||
|
||||
```
|
||||
Input Token IDs
|
||||
│
|
||||
▼
|
||||
Token Embedding [64,000 → 4,096]
|
||||
│
|
||||
▼ ×32 Transformer Blocks
|
||||
┌─────────────────────────────────────┐
|
||||
│ AstraGPT Block │
|
||||
│ │
|
||||
│ RMSNorm (Pre-norm) │
|
||||
│ → Grouped Query Attention (GQA) │
|
||||
│ · 32 Query Heads │
|
||||
│ · 8 Key-Value Heads │
|
||||
│ · RoPE (θ = 1,000,000) │
|
||||
│ · KV Cache for inference │
|
||||
│ → Residual Add │
|
||||
│ │
|
||||
│ RMSNorm (Pre-norm) │
|
||||
│ → SwiGLU Feed-Forward Network │
|
||||
│ · gate_proj, up_proj, down_proj │
|
||||
│ · intermediate_size = 11,008 │
|
||||
│ → Residual Add │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
Final RMSNorm
|
||||
│
|
||||
▼
|
||||
LM Head [4,096 → 64,000]
|
||||
│
|
||||
▼
|
||||
Logits → Next Token
|
||||
```
|
||||
|
||||
### Architecture Highlights
|
||||
|
||||
| Component | Implementation | Why |
|
||||
|-----------|---------------|-----|
|
||||
| **Grouped Query Attention (GQA)** | 32Q / 8KV heads — built from scratch | 4× less KV memory vs MHA. Same used in LLaMA-3, Mistral |
|
||||
| **Rotary Position Embeddings (RoPE)** | Full RoPE math from scratch, θ=1M | Better long-context vs learned embeddings |
|
||||
| **SwiGLU FFN** | gate × SiLU(up) through down_proj | Outperforms GELU/ReLU on LM benchmarks |
|
||||
| **RMSNorm** | Pre-norm, no bias, no mean subtraction | ~30% faster than LayerNorm |
|
||||
| **Flash Attention** | PyTorch 2.0 `scaled_dot_product_attention` | Memory-efficient attention with O(n) space |
|
||||
|
||||
### Parameter Count (~7B)
|
||||
|
||||
| Component | Parameters |
|
||||
|-----------|-----------|
|
||||
| Token Embedding (64K × 4096) | ~262M |
|
||||
| Attention × 32 layers | ~2.15B |
|
||||
| SwiGLU FFN × 32 layers | ~4.32B |
|
||||
| RMSNorm × 65 | ~267K |
|
||||
| LM Head | ~262M |
|
||||
| **Total** | **~7.0B** |
|
||||
|
||||
---
|
||||
|
||||
## 🔤 Custom BPE Tokenizer — From Scratch
|
||||
|
||||
AstraGPT uses a **custom Byte Pair Encoding tokenizer** built entirely from scratch — no SentencePiece, no HuggingFace tokenizers library.
|
||||
|
||||
```python
|
||||
# Built from scratch
|
||||
from tokenizer import BPETokenizer
|
||||
|
||||
tok = BPETokenizer(vocab_size=64_000)
|
||||
tok.train(open("corpus.txt"), num_merges=60_000)
|
||||
```
|
||||
|
||||
**Tokenizer features:**
|
||||
- **Byte-level base vocabulary** — 256 raw bytes, handles any Unicode
|
||||
- **GPT-4 style pre-tokenization regex** — smart word boundary splitting
|
||||
- **64,000 vocab size** — 60K BPE merges on top of byte base
|
||||
- **Built-in special tokens:** `<think>`, `</think>`, `<|im_start|>`, `<|im_end|>`, BOS, EOS, PAD
|
||||
- **`apply_chat_template()`** — custom chat format support
|
||||
- **Save/load** — JSON-serializable merge rules
|
||||
|
||||
---
|
||||
|
||||
## ⚡ Training — Dual RTX 4090 on Private VPS
|
||||
|
||||
Fine-tuning was performed on a **private Linux VPS with 2× NVIDIA RTX 4090 GPUs** (total 48GB VRAM).
|
||||
|
||||
### Hardware Setup
|
||||
|
||||
| Spec | Value |
|
||||
|------|-------|
|
||||
| GPUs | **2× NVIDIA RTX 4090** (24GB VRAM each) |
|
||||
| Total VRAM | **48 GB** |
|
||||
| CPU | High-core count server CPU |
|
||||
| Infrastructure | Private VPS (bare metal) |
|
||||
| OS | Ubuntu 22.04 LTS |
|
||||
| CUDA | 12.x |
|
||||
|
||||
### Training Pipeline — Also Built From Scratch
|
||||
|
||||
The SFT (Supervised Fine-Tuning) training loop was implemented from scratch with production-grade features:
|
||||
|
||||
```python
|
||||
# Full custom training loop
|
||||
trainer = SFTTrainer(
|
||||
model=model,
|
||||
tokenizer=tokenizer,
|
||||
dataset=dataset,
|
||||
# Dual GPU via DDP
|
||||
use_bf16=True,
|
||||
grad_accumulation=8,
|
||||
learning_rate=2e-4,
|
||||
use_wandb=True,
|
||||
)
|
||||
trainer.train()
|
||||
```
|
||||
|
||||
**Training loop features:**
|
||||
- ✅ **Gradient accumulation** — effective large batch training
|
||||
- ✅ **Mixed precision (BF16)** — full RTX 4090 tensor core utilization
|
||||
- ✅ **Cosine LR schedule with warmup** — smooth convergence
|
||||
- ✅ **Gradient clipping** — stable training
|
||||
- ✅ **W&B logging** — real-time loss/LR tracking
|
||||
- ✅ **Checkpoint saving** — best model tracking by loss
|
||||
|
||||
### Fine-Tuning Hyperparameters
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| Method | LoRA (PEFT) via Unsloth |
|
||||
| LoRA Rank | 16 |
|
||||
| LoRA Alpha | 32 |
|
||||
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
||||
| Max Sequence Length | 2,048 tokens |
|
||||
| Effective Batch Size | 16 (2 × grad_accum 8) |
|
||||
| Learning Rate | 2e-4 |
|
||||
| LR Scheduler | Cosine with warmup |
|
||||
| Warmup Ratio | 5% |
|
||||
| Epochs | 3 |
|
||||
| Precision | BF16 mixed precision |
|
||||
| Optimizer | AdamW 8-bit |
|
||||
|
||||
### Post-Training
|
||||
|
||||
After fine-tuning, the LoRA adapter was **merged back into base model weights** — resulting in a single, self-contained model with no external adapter dependency.
|
||||
|
||||
---
|
||||
|
||||
## 🤔 Thinking / Reasoning Support
|
||||
|
||||
AstraGPT-7B natively generates `<think>` tag reasoning when triggered. This was trained in via the fine-tuning dataset, which used structured chain-of-thought formatting.
|
||||
|
||||
**Example:**
|
||||
|
||||
**Input:**
|
||||
```
|
||||
What is 15 * 47?
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```
|
||||
<think>
|
||||
The multiplication involves multiplying 15 by 47.
|
||||
15 × 47 = 15 × 40 + 15 × 7
|
||||
= 600 + 105
|
||||
= 705
|
||||
</think>
|
||||
705
|
||||
```
|
||||
|
||||
**Trigger thinking mode:**
|
||||
```python
|
||||
# Append this to your prompt to force reasoning
|
||||
prompt = tokenizer.apply_chat_template(messages, ...) + "<think>\n"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚡ Quick Start
|
||||
|
||||
### Install
|
||||
|
||||
```bash
|
||||
pip install transformers torch bitsandbytes accelerate
|
||||
```
|
||||
|
||||
### Basic Inference
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
import torch
|
||||
|
||||
model_id = "adityawakharkar/AstraGPT-7B"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype=torch.float16,
|
||||
device_map="auto"
|
||||
)
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are AstraGPT, a helpful coding AI built by Tantra AI Labs. Think carefully using <think>...</think> tags before answering."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Write a Python function to reverse a linked list."
|
||||
}
|
||||
]
|
||||
|
||||
prompt = tokenizer.apply_chat_template(
|
||||
messages, tokenize=False, add_generation_prompt=True
|
||||
) + "<think>\n" # ← triggers reasoning
|
||||
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
||||
|
||||
with torch.no_grad():
|
||||
output = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=1024,
|
||||
temperature=0.3,
|
||||
do_sample=True,
|
||||
repetition_penalty=1.1,
|
||||
pad_token_id=tokenizer.eos_token_id,
|
||||
)
|
||||
|
||||
response = tokenizer.decode(
|
||||
output[0][inputs["input_ids"].shape[1]:],
|
||||
skip_special_tokens=True
|
||||
)
|
||||
print(response)
|
||||
```
|
||||
|
||||
### 4-bit Quantized (Runs on ~6GB VRAM)
|
||||
|
||||
```python
|
||||
from transformers import BitsAndBytesConfig
|
||||
|
||||
bnb = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_quant_type="nf4",
|
||||
bnb_4bit_compute_dtype=torch.float16,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"adityawakharkar/AstraGPT-7B",
|
||||
quantization_config=bnb,
|
||||
device_map="auto"
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 Codebase
|
||||
|
||||
The full from-scratch implementation is open-source:
|
||||
|
||||
```
|
||||
AstraGPT-7B-scratch/
|
||||
├── model/
|
||||
│ ├── config.py ← AstraGPTConfig (7B hyperparams, 1B/3B presets)
|
||||
│ ├── rotary_embedding.py ← RoPE from scratch (precompute + apply)
|
||||
│ ├── attention.py ← GQA from scratch (32Q / 8KV + KV cache)
|
||||
│ ├── feedforward.py ← SwiGLU + RMSNorm + TransformerBlock
|
||||
│ └── transformer.py ← Full model + generate() + save/load
|
||||
├── tokenizer/
|
||||
│ ├── bpe_tokenizer.py ← Full BPE tokenizer (train, encode, decode)
|
||||
│ └── train_tokenizer.py ← Train on any text corpus
|
||||
└── training/
|
||||
└── sft_trainer.py ← Complete SFT loop (grad accum, bf16, cosine LR)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Bias, Risks, and Limitations
|
||||
|
||||
- **Hallucination:** Can produce confident but incorrect answers — always verify
|
||||
- **Math limits:** Complex multi-step math may fail — 7B is a small model
|
||||
- **English-primary:** Best performance in English
|
||||
- **Reasoning trigger:** `<think>` tags work most reliably with explicit `<think>\n` prefix in prompt
|
||||
|
||||
---
|
||||
|
||||
## Environmental Impact
|
||||
|
||||
- **Hardware:** 2× NVIDIA RTX 4090 (48GB combined VRAM)
|
||||
- **Infrastructure:** Private bare-metal VPS
|
||||
- **Training Duration:** ~3–4 hours
|
||||
- **Carbon Emitted:** Estimated ~2–3 kgCO2eq
|
||||
|
||||
---
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@misc{astragpt7b2026,
|
||||
author = {Aditya Wakharkar},
|
||||
title = {AstraGPT-7B: A 7B LLM Built From Scratch with Chain-of-Thought Reasoning},
|
||||
year = {2026},
|
||||
publisher = {HuggingFace},
|
||||
organization = {Tantra AI Labs},
|
||||
url = {https://huggingface.co/adityawakharkar/AstraGPT-7B},
|
||||
note = {Custom architecture, custom BPE tokenizer, trained on 2× RTX 4090}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Model Card Authors
|
||||
|
||||
**Aditya Wakharkar** — [@adityawakharkar](https://huggingface.co/adityawakharkar) | [GitHub @codewith-aditya](https://github.com/codewith-aditya)
|
||||
|
||||
## Contact
|
||||
|
||||
- 🐙 GitHub: [github.com/codewith-aditya](https://github.com/codewith-aditya)
|
||||
- 🤗 HuggingFace: [@adityawakharkar](https://huggingface.co/adityawakharkar)
|
||||
|
||||
---
|
||||
|
||||
<div align="center">
|
||||
<em>Built from scratch with ❤️ by <strong>Tantra AI Labs</strong></em><br/>
|
||||
<em>Every layer. Every weight. Every line of code.</em>
|
||||
</div>
|
||||
Reference in New Issue
Block a user