Qwen3-8B-Nemotron-Orchestra…/README.md

---
license: other
license_name: nvidia-open-model-license
license_link: LICENSE
language:
- en
library_name: transformers
tags:
- awq
- int4
- quantization
- orchestration
- tool-calling
- noesis
- dhcf-fno
- qwen3
base_model: nvidia/Nemotron-Orchestrator-8B
quantized_by: AMAImedia
pipeline_tag: text-generation
---

# Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4
**AWQ INT4 quantization of [nvidia/Nemotron-Orchestrator-8B](https://huggingface.co/nvidia/Nemotron-Orchestrator-8B)
optimized for low-VRAM consumer hardware (RTX 3060 6 GB).**

Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform**
(framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).

- **Founder:** Ilia Bolotnikov
- **Organization:** [AMAImedia.com](https://www.amaimedia.com)
- **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom)
- **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov)
- **Telegram:** [@djbionicl](https://t.me/djbionicl)
- **NOESIS version:** v14.6
- **Release date:** 2026-04

---

## ⚠️ License notice

This model inherits the **NVIDIA Open Model License** from the upstream
`nvidia/Nemotron-Orchestrator-8B`. The base model is designated by NVIDIA as
**"for research and development only"**.

This AWQ derivative is published to make the model accessible to the broader
research and development community on consumer GPUs. **Users are responsible
for compliance with NVIDIA's license terms** — see the `LICENSE` file in
this repository for the full text.

By downloading or using this model you agree to the upstream NVIDIA license.

---

## Model summary

| Property | Value |
| --- | --- |
| Base model | nvidia/Nemotron-Orchestrator-8B |
| Underlying architecture | Qwen3-8B (decoder-only transformer, **dense, NOT MoE**) |
| Original precision | FP32 safetensors (~32 GB) |
| Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) |
| Vocab size | 151936 |
| Language | English (per base model) |
| Disk footprint | ~4.5 GB |
| Inference VRAM | ~5.0 GB (full-resident on 6 GB GPU) |
| Quantization library | AutoAWQ 0.2.9 |
| Calibration set | 128 in-house orchestration / tool-calling prompts, max_seq_len=512 |
| RNG seed | 1729 (NOESIS reproducibility lock) |

A companion **BF16 reference checkpoint** is also published:
[amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS](https://huggingface.co/amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS).

---

## Why this quantization

The original Nemotron-Orchestrator-8B is shipped in **FP32 (~32 GB on disk)** which
does not fit any consumer GPU. Existing community quantizations exist (mostly
GGUF) but none is calibrated specifically for orchestration / tool-calling
scale search and packaged for the **AutoAWQ GEMM kernel path** that integrates
directly with `transformers` and `vllm` on Windows hosts.

This AWQ build:

1. Fits inside the **4.5 GB SEALED VRAM window** of the NOESIS specialist
   sequential-swapping protocol
2. Uses **GEMM kernel** (compatible with `device_map={"":0}` — no CPU offload)
3. Provenance-tracked (`noesis_provenance.json` ships with the model)
4. Calibrated on orchestration / tool-calling prompts matching base model
   training distribution (ToolScale + GeneralThought-430K)

---

## How to use

```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "amaimedia/Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    device_map={"": 0},
    torch_dtype=torch.float16,
    fuse_layers=False,
)

prompt = "Plan a multi-step task: search for recent AWQ papers, then summarize."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

---

## NOESIS context

In NOESIS this model serves as the **English orchestration teacher** for
Specialist **M9-ORCH-4B** during knowledge distillation. It is loaded
sequentially (per the NOESIS swapping protocol) onto the RTX 3060,
producing top-K=512 logits at temperature=4.0, which are then aggregated
in `build_ensemble_labels.py` with proposed weight **w=0.22** on the
orchestration data shard.

NOESIS specialists overview:

| ID | Role | Size |
| --- | --- | --- |
| M1 | ASR (150+ langs) | 10B/3B |
| M2 | Dubbing LM (30 langs full) | 10B/3B |
| M3 | TTS + voice cloning | 10B/3B |
| M4 | Chat + creative writing | 10B/3B |
| M5 | Code + math | 10B/3B |
| M6 | Deep research (1M ctx) | 10B/3B |
| M7 | Prompt engineering | 4B/0.8B |
| M8 | Quality control (PRM) | 4B/0.8B |
| **M9** | **Orchestrator + routing** | **4B/0.8B** |

---

## Acknowledgements & citation

Base model: ToolOrchestra by NVIDIA & University of Hong Kong.

```bibtex
@misc{toolorchestra,
  title  = {ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
  author = {Hongjin Su and Shizhe Diao and Ximing Lu and others},
  year   = {2025},
  eprint = {2511.21689},
  archivePrefix = {arXiv}
}
```

Quantization & NOESIS integration:

```bibtex
@misc{noesis_v14,
  title  = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
  author = {Bolotnikov, Ilia},
  year   = {2026},
  publisher = {AMAImedia},
  url    = {https://amaimedia.com}
}
```