Qwen3-8B-Nemotron-Orchestra…/README.md

---
license: other
license_name: nvidia-open-model-license
license_link: LICENSE
language:
- en
library_name: transformers
tags:
- awq
- int4
- quantization
- orchestration
- tool-calling
- noesis
- dhcf-fno
- qwen3
base_model: nvidia/Nemotron-Orchestrator-8B
quantized_by: AMAImedia
pipeline_tag: text-generation
---

# Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4
**AWQ INT4 quantization of [nvidia/Nemotron-Orchestrator-8B](https://huggingface.co/nvidia/Nemotron-Orchestrator-8B)
optimized for low-VRAM consumer hardware (RTX 3060 6 GB).**

Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform**
(framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).

- **Founder:** Ilia Bolotnikov
- **Organization:** [AMAImedia.com](https://www.amaimedia.com)
- **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom)
- **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov)
- **Telegram:** [@djbionicl](https://t.me/djbionicl)
- **NOESIS version:** v14.6
- **Release date:** 2026-04

---

## ⚠️ License notice

This model inherits the **NVIDIA Open Model License** from the upstream
`nvidia/Nemotron-Orchestrator-8B`. The base model is designated by NVIDIA as
**"for research and development only"**.

This AWQ derivative is published to make the model accessible to the broader
research and development community on consumer GPUs. **Users are responsible
for compliance with NVIDIA's license terms** — see the `LICENSE` file in
this repository for the full text.

By downloading or using this model you agree to the upstream NVIDIA license.

---

## Model summary

| Property | Value |
| --- | --- |
| Base model | nvidia/Nemotron-Orchestrator-8B |
| Underlying architecture | Qwen3-8B (decoder-only transformer, **dense, NOT MoE**) |
| Original precision | FP32 safetensors (~32 GB) |
| Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) |
| Vocab size | 151936 |
| Language | English (per base model) |
| Disk footprint | ~4.5 GB |
| Inference VRAM | ~5.0 GB (full-resident on 6 GB GPU) |
| Quantization library | AutoAWQ 0.2.9 |
| Calibration set | 128 in-house orchestration / tool-calling prompts, max_seq_len=512 |
| RNG seed | 1729 (NOESIS reproducibility lock) |

A companion **BF16 reference checkpoint** is also published:
[amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS](https://huggingface.co/amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS).

---

## Why this quantization

The original Nemotron-Orchestrator-8B is shipped in **FP32 (~32 GB on disk)** which
does not fit any consumer GPU. Existing community quantizations exist (mostly
GGUF) but none is calibrated specifically for orchestration / tool-calling
scale search and packaged for the **AutoAWQ GEMM kernel path** that integrates
directly with `transformers` and `vllm` on Windows hosts.

This AWQ build:

1. Fits inside the **4.5 GB SEALED VRAM window** of the NOESIS specialist
   sequential-swapping protocol
2. Uses **GEMM kernel** (compatible with `device_map={"":0}` — no CPU offload)
3. Provenance-tracked (`noesis_provenance.json` ships with the model)
4. Calibrated on orchestration / tool-calling prompts matching base model
   training distribution (ToolScale + GeneralThought-430K)

---

## How to use

```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "amaimedia/Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    device_map={"": 0},
    torch_dtype=torch.float16,
    fuse_layers=False,
)

prompt = "Plan a multi-step task: search for recent AWQ papers, then summarize."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

---

## NOESIS context

In NOESIS this model serves as the **English orchestration teacher** for
Specialist **M9-ORCH-4B** during knowledge distillation. It is loaded
sequentially (per the NOESIS swapping protocol) onto the RTX 3060,
producing top-K=512 logits at temperature=4.0, which are then aggregated
in `build_ensemble_labels.py` with proposed weight **w=0.22** on the
orchestration data shard.

NOESIS specialists overview:

| ID | Role | Size |
| --- | --- | --- |
| M1 | ASR (150+ langs) | 10B/3B |
| M2 | Dubbing LM (30 langs full) | 10B/3B |
| M3 | TTS + voice cloning | 10B/3B |
| M4 | Chat + creative writing | 10B/3B |
| M5 | Code + math | 10B/3B |
| M6 | Deep research (1M ctx) | 10B/3B |
| M7 | Prompt engineering | 4B/0.8B |
| M8 | Quality control (PRM) | 4B/0.8B |
| **M9** | **Orchestrator + routing** | **4B/0.8B** |

---

## Acknowledgements & citation

Base model: ToolOrchestra by NVIDIA & University of Hong Kong.

```bibtex
@misc{toolorchestra,
  title  = {ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
  author = {Hongjin Su and Shizhe Diao and Ximing Lu and others},
  year   = {2025},
  eprint = {2511.21689},
  archivePrefix = {arXiv}
}
```

Quantization & NOESIS integration:

```bibtex
@misc{noesis_v14,
  title  = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
  author = {Bolotnikov, Ilia},
  year   = {2026},
  publisher = {AMAImedia},
  url    = {https://amaimedia.com}
}
```
初始化项目，由ModelHub XC社区提供模型 Model: AMAImedia/Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4 Source: Original Platform 2026-05-01 22:31:29 +08:00			`---`
			`license: other`
			`license_name: nvidia-open-model-license`
			`license_link: LICENSE`
			`language:`
			`- en`
			`library_name: transformers`
			`tags:`
			`- awq`
			`- int4`
			`- quantization`
			`- orchestration`
			`- tool-calling`
			`- noesis`
			`- dhcf-fno`
			`- qwen3`
			`base_model: nvidia/Nemotron-Orchestrator-8B`
			`quantized_by: AMAImedia`
			`pipeline_tag: text-generation`
			`---`

			`# Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4`
			`**AWQ INT4 quantization of [nvidia/Nemotron-Orchestrator-8B](https://huggingface.co/nvidia/Nemotron-Orchestrator-8B)`
			`optimized for low-VRAM consumer hardware (RTX 3060 6 GB).**`

			`Released as part of the NOESIS Professional Multilingual Dubbing Automation Platform`
			`(framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).`

			`- Founder: Ilia Bolotnikov`
			`- Organization: [AMAImedia.com](https://www.amaimedia.com)`
			`- X (Twitter): [@AMAImediacom](https://x.com/AMAImediacom)`
			`- LinkedIn: [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov)`
			`- Telegram: [@djbionicl](https://t.me/djbionicl)`
			`- NOESIS version: v14.6`
			`- Release date: 2026-04`

			`---`

			`## ⚠️ License notice`

			`This model inherits the NVIDIA Open Model License from the upstream`
			`nvidia/Nemotron-Orchestrator-8B`. The base model is designated by NVIDIA as
			`"for research and development only".`

			`This AWQ derivative is published to make the model accessible to the broader`
			`research and development community on consumer GPUs. **Users are responsible`
			for compliance with NVIDIA's license terms** — see the `LICENSE` file in
			`this repository for the full text.`

			`By downloading or using this model you agree to the upstream NVIDIA license.`

			`---`

			`## Model summary`

			`\| Property \| Value \|`
			`\| --- \| --- \|`
			`\| Base model \| nvidia/Nemotron-Orchestrator-8B \|`
			`\| Underlying architecture \| Qwen3-8B (decoder-only transformer, dense, NOT MoE) \|`
			`\| Original precision \| FP32 safetensors (~32 GB) \|`
			`\| Quantized precision \| AWQ INT4 (group_size=128, GEMM, zero_point=True) \|`
			`\| Vocab size \| 151936 \|`
			`\| Language \| English (per base model) \|`
			`\| Disk footprint \| ~4.5 GB \|`
			`\| Inference VRAM \| ~5.0 GB (full-resident on 6 GB GPU) \|`
			`\| Quantization library \| AutoAWQ 0.2.9 \|`
			`\| Calibration set \| 128 in-house orchestration / tool-calling prompts, max_seq_len=512 \|`
			`\| RNG seed \| 1729 (NOESIS reproducibility lock) \|`

			`A companion BF16 reference checkpoint is also published:`
			`[amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS](https://huggingface.co/amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS).`

			`---`

			`## Why this quantization`

			`The original Nemotron-Orchestrator-8B is shipped in FP32 (~32 GB on disk) which`
			`does not fit any consumer GPU. Existing community quantizations exist (mostly`
			`GGUF) but none is calibrated specifically for orchestration / tool-calling`
			`scale search and packaged for the AutoAWQ GEMM kernel path that integrates`
			directly with `transformers` and `vllm` on Windows hosts.

			`This AWQ build:`

			`1. Fits inside the 4.5 GB SEALED VRAM window of the NOESIS specialist`
			`sequential-swapping protocol`
			2. Uses GEMM kernel (compatible with `device_map={"":0}` — no CPU offload)
			3. Provenance-tracked (`noesis_provenance.json` ships with the model)
			`4. Calibrated on orchestration / tool-calling prompts matching base model`
			`training distribution (ToolScale + GeneralThought-430K)`

			`---`

			`## How to use`

			```python
			`from awq import AutoAWQForCausalLM`
			`from transformers import AutoTokenizer`
			`import torch`

			`model_id = "amaimedia/Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4"`
			`tokenizer = AutoTokenizer.from_pretrained(model_id)`
			`model = AutoAWQForCausalLM.from_quantized(`
			`model_id,`
			`device_map={"": 0},`
			`torch_dtype=torch.float16,`
			`fuse_layers=False,`
			`)`

			`prompt = "Plan a multi-step task: search for recent AWQ papers, then summarize."`
			`inputs = tokenizer(prompt, return_tensors="pt").to("cuda")`
			`out = model.generate(**inputs, max_new_tokens=256, do_sample=False)`
			`print(tokenizer.decode(out[0], skip_special_tokens=True))`
			```

			`---`

			`## NOESIS context`

			`In NOESIS this model serves as the English orchestration teacher for`
			`Specialist M9-ORCH-4B during knowledge distillation. It is loaded`
			`sequentially (per the NOESIS swapping protocol) onto the RTX 3060,`
			`producing top-K=512 logits at temperature=4.0, which are then aggregated`
			in `build_ensemble_labels.py` with proposed weight w=0.22 on the
			`orchestration data shard.`

			`NOESIS specialists overview:`

			`\| ID \| Role \| Size \|`
			`\| --- \| --- \| --- \|`
			`\| M1 \| ASR (150+ langs) \| 10B/3B \|`
			`\| M2 \| Dubbing LM (30 langs full) \| 10B/3B \|`
			`\| M3 \| TTS + voice cloning \| 10B/3B \|`
			`\| M4 \| Chat + creative writing \| 10B/3B \|`
			`\| M5 \| Code + math \| 10B/3B \|`
			`\| M6 \| Deep research (1M ctx) \| 10B/3B \|`
			`\| M7 \| Prompt engineering \| 4B/0.8B \|`
			`\| M8 \| Quality control (PRM) \| 4B/0.8B \|`
			`\| M9 \| Orchestrator + routing \| 4B/0.8B \|`

			`---`

			`## Acknowledgements & citation`

			`Base model: ToolOrchestra by NVIDIA & University of Hong Kong.`

			```bibtex
			`@misc{toolorchestra,`
			`title = {ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},`
			`author = {Hongjin Su and Shizhe Diao and Ximing Lu and others},`
			`year = {2025},`
			`eprint = {2511.21689},`
			`archivePrefix = {arXiv}`
			`}`
			```

			`Quantization & NOESIS integration:`

			```bibtex
			`@misc{noesis_v14,`
			`title = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},`
			`author = {Bolotnikov, Ilia},`
			`year = {2026},`
			`publisher = {AMAImedia},`
			`url = {https://amaimedia.com}`
			`}`
			```