--- base_model: adityawakharkar/AstraGPTCoder-7B language: - en license: apache-2.0 tags: - from-scratch - custom-architecture - custom-tokenizer - reasoning - chain-of-thought - think-tags - coding - fine-tuned - lora - peft - unsloth - astragpt - tantra-ai-labs - rtx-4090 pipeline_tag: text-generation library_name: transformers model_creator: Tantra AI Labs --- # AstraGPT-7B ๐Ÿš€
**A 7-Billion Parameter Language Model โ€” Built From Scratch** *Custom Architecture ยท Custom BPE Tokenizer ยท Reasoning Fine-Tuned on Dual RTX 4090* [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0) [![Model](https://img.shields.io/badge/HuggingFace-AstraGPT--7B-yellow)](https://huggingface.co/adityawakharkar/AstraGPT-7B) [![Params](https://img.shields.io/badge/Parameters-7B-blue)]() [![GPU](https://img.shields.io/badge/Trained%20On-2ร—%20RTX%204090-76b900?logo=nvidia)](https://www.nvidia.com) [![By](https://img.shields.io/badge/By-Tantra%20AI%20Labs-purple)](https://github.com/codewith-aditya) Built by **Aditya Wakharkar** | [Tantra AI Labs](https://github.com/codewith-aditya)
--- ## ๐Ÿง  What is AstraGPT-7B? AstraGPT-7B is a **7-billion parameter decoder-only language model** designed for coding and chain-of-thought reasoning. Unlike most open-source fine-tunes, **every core component of AstraGPT was designed and implemented from scratch in PyTorch** โ€” including the transformer architecture, the BPE tokenizer, and the supervised fine-tuning pipeline. The model was then **fine-tuned on a reasoning dataset** using LoRA on a **private VPS equipped with dual NVIDIA RTX 4090 GPUs**, giving it native support for `...` style reasoning output. > *"Most people fine-tune models. We built one."* --- ## ๐Ÿ—๏ธ Built From Scratch โ€” Architecture Overview Every layer of AstraGPT-7B was implemented from first principles in PyTorch. No `AutoModel`, no copy-paste โ€” pure custom code. ``` Input Token IDs โ”‚ โ–ผ Token Embedding [64,000 โ†’ 4,096] โ”‚ โ–ผ ร—32 Transformer Blocks โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ AstraGPT Block โ”‚ โ”‚ โ”‚ โ”‚ RMSNorm (Pre-norm) โ”‚ โ”‚ โ†’ Grouped Query Attention (GQA) โ”‚ โ”‚ ยท 32 Query Heads โ”‚ โ”‚ ยท 8 Key-Value Heads โ”‚ โ”‚ ยท RoPE (ฮธ = 1,000,000) โ”‚ โ”‚ ยท KV Cache for inference โ”‚ โ”‚ โ†’ Residual Add โ”‚ โ”‚ โ”‚ โ”‚ RMSNorm (Pre-norm) โ”‚ โ”‚ โ†’ SwiGLU Feed-Forward Network โ”‚ โ”‚ ยท gate_proj, up_proj, down_proj โ”‚ โ”‚ ยท intermediate_size = 11,008 โ”‚ โ”‚ โ†’ Residual Add โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ Final RMSNorm โ”‚ โ–ผ LM Head [4,096 โ†’ 64,000] โ”‚ โ–ผ Logits โ†’ Next Token ``` ### Architecture Highlights | Component | Implementation | Why | |-----------|---------------|-----| | **Grouped Query Attention (GQA)** | 32Q / 8KV heads โ€” built from scratch | 4ร— less KV memory vs MHA. Same used in LLaMA-3, Mistral | | **Rotary Position Embeddings (RoPE)** | Full RoPE math from scratch, ฮธ=1M | Better long-context vs learned embeddings | | **SwiGLU FFN** | gate ร— SiLU(up) through down_proj | Outperforms GELU/ReLU on LM benchmarks | | **RMSNorm** | Pre-norm, no bias, no mean subtraction | ~30% faster than LayerNorm | | **Flash Attention** | PyTorch 2.0 `scaled_dot_product_attention` | Memory-efficient attention with O(n) space | ### Parameter Count (~7B) | Component | Parameters | |-----------|-----------| | Token Embedding (64K ร— 4096) | ~262M | | Attention ร— 32 layers | ~2.15B | | SwiGLU FFN ร— 32 layers | ~4.32B | | RMSNorm ร— 65 | ~267K | | LM Head | ~262M | | **Total** | **~7.0B** | --- ## ๐Ÿ”ค Custom BPE Tokenizer โ€” From Scratch AstraGPT uses a **custom Byte Pair Encoding tokenizer** built entirely from scratch โ€” no SentencePiece, no HuggingFace tokenizers library. ```python # Built from scratch from tokenizer import BPETokenizer tok = BPETokenizer(vocab_size=64_000) tok.train(open("corpus.txt"), num_merges=60_000) ``` **Tokenizer features:** - **Byte-level base vocabulary** โ€” 256 raw bytes, handles any Unicode - **GPT-4 style pre-tokenization regex** โ€” smart word boundary splitting - **64,000 vocab size** โ€” 60K BPE merges on top of byte base - **Built-in special tokens:** ``, ``, `<|im_start|>`, `<|im_end|>`, BOS, EOS, PAD - **`apply_chat_template()`** โ€” custom chat format support - **Save/load** โ€” JSON-serializable merge rules --- ## โšก Training โ€” Dual RTX 4090 on Private VPS Fine-tuning was performed on a **private Linux VPS with 2ร— NVIDIA RTX 4090 GPUs** (total 48GB VRAM). ### Hardware Setup | Spec | Value | |------|-------| | GPUs | **2ร— NVIDIA RTX 4090** (24GB VRAM each) | | Total VRAM | **48 GB** | | CPU | High-core count server CPU | | Infrastructure | Private VPS (bare metal) | | OS | Ubuntu 22.04 LTS | | CUDA | 12.x | ### Training Pipeline โ€” Also Built From Scratch The SFT (Supervised Fine-Tuning) training loop was implemented from scratch with production-grade features: ```python # Full custom training loop trainer = SFTTrainer( model=model, tokenizer=tokenizer, dataset=dataset, # Dual GPU via DDP use_bf16=True, grad_accumulation=8, learning_rate=2e-4, use_wandb=True, ) trainer.train() ``` **Training loop features:** - โœ… **Gradient accumulation** โ€” effective large batch training - โœ… **Mixed precision (BF16)** โ€” full RTX 4090 tensor core utilization - โœ… **Cosine LR schedule with warmup** โ€” smooth convergence - โœ… **Gradient clipping** โ€” stable training - โœ… **W&B logging** โ€” real-time loss/LR tracking - โœ… **Checkpoint saving** โ€” best model tracking by loss ### Fine-Tuning Hyperparameters | Parameter | Value | |-----------|-------| | Method | LoRA (PEFT) via Unsloth | | LoRA Rank | 16 | | LoRA Alpha | 32 | | Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Max Sequence Length | 2,048 tokens | | Effective Batch Size | 16 (2 ร— grad_accum 8) | | Learning Rate | 2e-4 | | LR Scheduler | Cosine with warmup | | Warmup Ratio | 5% | | Epochs | 3 | | Precision | BF16 mixed precision | | Optimizer | AdamW 8-bit | ### Post-Training After fine-tuning, the LoRA adapter was **merged back into base model weights** โ€” resulting in a single, self-contained model with no external adapter dependency. --- ## ๐Ÿค” Thinking / Reasoning Support AstraGPT-7B natively generates `` tag reasoning when triggered. This was trained in via the fine-tuning dataset, which used structured chain-of-thought formatting. **Example:** **Input:** ``` What is 15 * 47? ``` **Output:** ``` The multiplication involves multiplying 15 by 47. 15 ร— 47 = 15 ร— 40 + 15 ร— 7 = 600 + 105 = 705 705 ``` **Trigger thinking mode:** ```python # Append this to your prompt to force reasoning prompt = tokenizer.apply_chat_template(messages, ...) + "\n" ``` --- ## โšก Quick Start ### Install ```bash pip install transformers torch bitsandbytes accelerate ``` ### Basic Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "adityawakharkar/AstraGPT-7B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) messages = [ { "role": "system", "content": "You are AstraGPT, a helpful coding AI built by Tantra AI Labs. Think carefully using ... tags before answering." }, { "role": "user", "content": "Write a Python function to reverse a linked list." } ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) + "\n" # โ† triggers reasoning inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=1024, temperature=0.3, do_sample=True, repetition_penalty=1.1, pad_token_id=tokenizer.eos_token_id, ) response = tokenizer.decode( output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True ) print(response) ``` ### 4-bit Quantized (Runs on ~6GB VRAM) ```python from transformers import BitsAndBytesConfig bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "adityawakharkar/AstraGPT-7B", quantization_config=bnb, device_map="auto" ) ``` --- ## ๐Ÿ“ Codebase The full from-scratch implementation is open-source: ``` AstraGPT-7B-scratch/ โ”œโ”€โ”€ model/ โ”‚ โ”œโ”€โ”€ config.py โ† AstraGPTConfig (7B hyperparams, 1B/3B presets) โ”‚ โ”œโ”€โ”€ rotary_embedding.py โ† RoPE from scratch (precompute + apply) โ”‚ โ”œโ”€โ”€ attention.py โ† GQA from scratch (32Q / 8KV + KV cache) โ”‚ โ”œโ”€โ”€ feedforward.py โ† SwiGLU + RMSNorm + TransformerBlock โ”‚ โ””โ”€โ”€ transformer.py โ† Full model + generate() + save/load โ”œโ”€โ”€ tokenizer/ โ”‚ โ”œโ”€โ”€ bpe_tokenizer.py โ† Full BPE tokenizer (train, encode, decode) โ”‚ โ””โ”€โ”€ train_tokenizer.py โ† Train on any text corpus โ””โ”€โ”€ training/ โ””โ”€โ”€ sft_trainer.py โ† Complete SFT loop (grad accum, bf16, cosine LR) ``` --- ## Bias, Risks, and Limitations - **Hallucination:** Can produce confident but incorrect answers โ€” always verify - **Math limits:** Complex multi-step math may fail โ€” 7B is a small model - **English-primary:** Best performance in English - **Reasoning trigger:** `` tags work most reliably with explicit `\n` prefix in prompt --- ## Environmental Impact - **Hardware:** 2ร— NVIDIA RTX 4090 (48GB combined VRAM) - **Infrastructure:** Private bare-metal VPS - **Training Duration:** ~3โ€“4 hours - **Carbon Emitted:** Estimated ~2โ€“3 kgCO2eq --- ## Citation ```bibtex @misc{astragpt7b2026, author = {Aditya Wakharkar}, title = {AstraGPT-7B: A 7B LLM Built From Scratch with Chain-of-Thought Reasoning}, year = {2026}, publisher = {HuggingFace}, organization = {Tantra AI Labs}, url = {https://huggingface.co/adityawakharkar/AstraGPT-7B}, note = {Custom architecture, custom BPE tokenizer, trained on 2ร— RTX 4090} } ``` --- ## Model Card Authors **Aditya Wakharkar** โ€” [@adityawakharkar](https://huggingface.co/adityawakharkar) | [GitHub @codewith-aditya](https://github.com/codewith-aditya) ## Contact - ๐Ÿ™ GitHub: [github.com/codewith-aditya](https://github.com/codewith-aditya) - ๐Ÿค— HuggingFace: [@adityawakharkar](https://huggingface.co/adityawakharkar) ---
Built from scratch with โค๏ธ by Tantra AI Labs
Every layer. Every weight. Every line of code.