--- license: other license_name: nvidia-open-model-license license_link: LICENSE language: - en library_name: transformers tags: - awq - int4 - quantization - orchestration - tool-calling - noesis - dhcf-fno - qwen3 base_model: nvidia/Nemotron-Orchestrator-8B quantized_by: AMAImedia pipeline_tag: text-generation --- # Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4 **AWQ INT4 quantization of [nvidia/Nemotron-Orchestrator-8B](https://huggingface.co/nvidia/Nemotron-Orchestrator-8B) optimized for low-VRAM consumer hardware (RTX 3060 6 GB).** Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform** (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators). - **Founder:** Ilia Bolotnikov - **Organization:** [AMAImedia.com](https://www.amaimedia.com) - **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom) - **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov) - **Telegram:** [@djbionicl](https://t.me/djbionicl) - **NOESIS version:** v14.6 - **Release date:** 2026-04 --- ## ⚠️ License notice This model inherits the **NVIDIA Open Model License** from the upstream `nvidia/Nemotron-Orchestrator-8B`. The base model is designated by NVIDIA as **"for research and development only"**. This AWQ derivative is published to make the model accessible to the broader research and development community on consumer GPUs. **Users are responsible for compliance with NVIDIA's license terms** — see the `LICENSE` file in this repository for the full text. By downloading or using this model you agree to the upstream NVIDIA license. --- ## Model summary | Property | Value | | --- | --- | | Base model | nvidia/Nemotron-Orchestrator-8B | | Underlying architecture | Qwen3-8B (decoder-only transformer, **dense, NOT MoE**) | | Original precision | FP32 safetensors (~32 GB) | | Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) | | Vocab size | 151936 | | Language | English (per base model) | | Disk footprint | ~4.5 GB | | Inference VRAM | ~5.0 GB (full-resident on 6 GB GPU) | | Quantization library | AutoAWQ 0.2.9 | | Calibration set | 128 in-house orchestration / tool-calling prompts, max_seq_len=512 | | RNG seed | 1729 (NOESIS reproducibility lock) | A companion **BF16 reference checkpoint** is also published: [amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS](https://huggingface.co/amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS). --- ## Why this quantization The original Nemotron-Orchestrator-8B is shipped in **FP32 (~32 GB on disk)** which does not fit any consumer GPU. Existing community quantizations exist (mostly GGUF) but none is calibrated specifically for orchestration / tool-calling scale search and packaged for the **AutoAWQ GEMM kernel path** that integrates directly with `transformers` and `vllm` on Windows hosts. This AWQ build: 1. Fits inside the **4.5 GB SEALED VRAM window** of the NOESIS specialist sequential-swapping protocol 2. Uses **GEMM kernel** (compatible with `device_map={"":0}` — no CPU offload) 3. Provenance-tracked (`noesis_provenance.json` ships with the model) 4. Calibrated on orchestration / tool-calling prompts matching base model training distribution (ToolScale + GeneralThought-430K) --- ## How to use ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer import torch model_id = "amaimedia/Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoAWQForCausalLM.from_quantized( model_id, device_map={"": 0}, torch_dtype=torch.float16, fuse_layers=False, ) prompt = "Plan a multi-step task: search for recent AWQ papers, then summarize." inputs = tokenizer(prompt, return_tensors="pt").to("cuda") out = model.generate(**inputs, max_new_tokens=256, do_sample=False) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` --- ## NOESIS context In NOESIS this model serves as the **English orchestration teacher** for Specialist **M9-ORCH-4B** during knowledge distillation. It is loaded sequentially (per the NOESIS swapping protocol) onto the RTX 3060, producing top-K=512 logits at temperature=4.0, which are then aggregated in `build_ensemble_labels.py` with proposed weight **w=0.22** on the orchestration data shard. NOESIS specialists overview: | ID | Role | Size | | --- | --- | --- | | M1 | ASR (150+ langs) | 10B/3B | | M2 | Dubbing LM (30 langs full) | 10B/3B | | M3 | TTS + voice cloning | 10B/3B | | M4 | Chat + creative writing | 10B/3B | | M5 | Code + math | 10B/3B | | M6 | Deep research (1M ctx) | 10B/3B | | M7 | Prompt engineering | 4B/0.8B | | M8 | Quality control (PRM) | 4B/0.8B | | **M9** | **Orchestrator + routing** | **4B/0.8B** | --- ## Acknowledgements & citation Base model: ToolOrchestra by NVIDIA & University of Hong Kong. ```bibtex @misc{toolorchestra, title = {ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration}, author = {Hongjin Su and Shizhe Diao and Ximing Lu and others}, year = {2025}, eprint = {2511.21689}, archivePrefix = {arXiv} } ``` Quantization & NOESIS integration: ```bibtex @misc{noesis_v14, title = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform}, author = {Bolotnikov, Ilia}, year = {2026}, publisher = {AMAImedia}, url = {https://amaimedia.com} } ```