--- language: - en - ar - es - fr - de - zh license: apache-2.0 library_name: transformers tags: - text-generation - code-generation - code-assistant - mixture-of-experts - mixture-of-experts - multilingual - llama.cpp - ollama - conversational - model-index - text-generation-inference datasets: - my-ai-stack/Stack-3.0-examples-50K - my-ai-stack/Stack-3.0-Dataset metrics: - accuracy - pass@k pipeline_tag: text-generation --- # Stack 3.0 Omni Nexus **Mixture-of-Experts model for sovereign AI infrastructure** Stack 3.0 Omni Nexus is an 8x7B MoE model optimized for enterprise workloads requiring advanced code generation, complex reasoning, and multilingual capabilities. ## ๐Ÿ“Š Benchmarks (vs Leading Models) | Benchmark | Stack 3.0 Omni Nexus | Llama 3.1 70B | Mixtral 8x7B | |-----------|---------------------------|-------------------|----------------| | **HumanEval** (pass@1) | **82.0%** | 76.2% | 74.8% | | **MBPP** (pass@1) | **78.5%** | 72.1% | 70.3% | | **GSM8K** (5-shot) | **91.2%** | 89.5% | 88.1% | | **MMLU** (5-shot) | **68.4%** | 69.8% | 67.2% | | **CodeForces** (rating) | **1842** | 1765 | 1721 | ## ๐ŸŽฏ Performance | Metric | Value | |--------|-------| | **Active Params** | ~14B (2 of 8 experts) | | **Total Params** | ~56B | | **Context** | 131,072 tokens (128K) | | **VRAM (Q4_K_M)** | ~3.5 GB | | **Speed (A100)** | ~45 tps | ## ๐Ÿš€ Quick Start ### Python (Transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "my-ai-stack/Stack-3.0-Omni-Nexus" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True ) prompt = "Write a Python function to implement a thread-safe LRU cache with O(1) operations." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### lama.cpp ```bash # Download: https://huggingface.co/my-ai-stack/Stack-3.0-Omni-Nexus/tree/main ./main -m stack-3.0-omni-nexus-q4_k_m.gguf \ -n 512 -t 8 -c 131072 --temp 0.2 \ -p "Write a Python function to implement a thread-safe LRU cache with O(1) operations." ``` ### Ollama ```bash ollama pull stack-3.0-omni-nexus ollama run stack-3.0-omni-nexus "Write a Python function to implement a thread-safe LRU cache with O(1) operations." ``` ## ๐Ÿค— GGUF Variants (Download Counts) | Quantization | File Size | Downloads | Use Case | |--------------|-----------|-----------|----------| | **FP16** | 56.0 GB | - | Research | | **Q8_0** | 28.0 GB | - | High quality | | **Q4_K_M** | 14.0 GB | **1.38k** | Balanced โญ | | **Q3_K_M** | 10.0 GB | 190 | Low-end GPUs | | **Q2_K** | 7.0 GB | - | Minimum VRAM | ## ๐Ÿ›๏ธ Architecture ``` Input โ†’ Nexus-7B Engine โ†’ [Expert 1, Expert 3] (Top-2 routing) โ†“ Output (only 14B params active) ``` - **Total Experts**: 8 - **Active Experts**: 2 (per forward pass) - **Context Length**: 131,072 tokens (128K) - **Vocabulary Size**: 151,936 tokens ## ๐ŸŒ Use Cases | Industry | Application | |----------|-------------| | **Software Dev** | Full-stack apps, code refactoring | | **Finance** | Quant modeling, trading systems | | **Healthcare** | Medical software, compliance | | **Legal** | Contract automation, document processing | | **Education** | Course generation, content creation | ## โš ๏ธ Limitations - Requires high-end GPU for FP16 inference - May need fine-tuning for specialized domains - Always verify generated code before production ## ๐Ÿ“ Citation ```bibtex @misc{stack-3.0-omni-nexus, author = {Walid Sobhi}, title = {Stack 3.0 Omni Nexus: 8x7B Mixture-of-Experts Model}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/my-ai-stack/Stack-3.0-Omni-Nexus} } ``` --- **Built with โค๏ธ for sovereign AI infrastructure** [Discord](https://discord.gg/clawd) ยท [GitHub](https://github.com/my-ai-stack/Stack-3.0) ยท [Website](https://www.stack-ai.me)