Model: AMAImedia/Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4 Source: Original Platform
168 lines
5.3 KiB
Markdown
168 lines
5.3 KiB
Markdown
---
|
|
license: other
|
|
license_name: nvidia-open-model-license
|
|
license_link: LICENSE
|
|
language:
|
|
- en
|
|
library_name: transformers
|
|
tags:
|
|
- awq
|
|
- int4
|
|
- quantization
|
|
- orchestration
|
|
- tool-calling
|
|
- noesis
|
|
- dhcf-fno
|
|
- qwen3
|
|
base_model: nvidia/Nemotron-Orchestrator-8B
|
|
quantized_by: AMAImedia
|
|
pipeline_tag: text-generation
|
|
---
|
|
|
|
# Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4
|
|
**AWQ INT4 quantization of [nvidia/Nemotron-Orchestrator-8B](https://huggingface.co/nvidia/Nemotron-Orchestrator-8B)
|
|
optimized for low-VRAM consumer hardware (RTX 3060 6 GB).**
|
|
|
|
Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform**
|
|
(framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).
|
|
|
|
- **Founder:** Ilia Bolotnikov
|
|
- **Organization:** [AMAImedia.com](https://www.amaimedia.com)
|
|
- **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom)
|
|
- **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov)
|
|
- **Telegram:** [@djbionicl](https://t.me/djbionicl)
|
|
- **NOESIS version:** v14.6
|
|
- **Release date:** 2026-04
|
|
|
|
---
|
|
|
|
## ⚠️ License notice
|
|
|
|
This model inherits the **NVIDIA Open Model License** from the upstream
|
|
`nvidia/Nemotron-Orchestrator-8B`. The base model is designated by NVIDIA as
|
|
**"for research and development only"**.
|
|
|
|
This AWQ derivative is published to make the model accessible to the broader
|
|
research and development community on consumer GPUs. **Users are responsible
|
|
for compliance with NVIDIA's license terms** — see the `LICENSE` file in
|
|
this repository for the full text.
|
|
|
|
By downloading or using this model you agree to the upstream NVIDIA license.
|
|
|
|
---
|
|
|
|
## Model summary
|
|
|
|
| Property | Value |
|
|
| --- | --- |
|
|
| Base model | nvidia/Nemotron-Orchestrator-8B |
|
|
| Underlying architecture | Qwen3-8B (decoder-only transformer, **dense, NOT MoE**) |
|
|
| Original precision | FP32 safetensors (~32 GB) |
|
|
| Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) |
|
|
| Vocab size | 151936 |
|
|
| Language | English (per base model) |
|
|
| Disk footprint | ~4.5 GB |
|
|
| Inference VRAM | ~5.0 GB (full-resident on 6 GB GPU) |
|
|
| Quantization library | AutoAWQ 0.2.9 |
|
|
| Calibration set | 128 in-house orchestration / tool-calling prompts, max_seq_len=512 |
|
|
| RNG seed | 1729 (NOESIS reproducibility lock) |
|
|
|
|
A companion **BF16 reference checkpoint** is also published:
|
|
[amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS](https://huggingface.co/amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS).
|
|
|
|
---
|
|
|
|
## Why this quantization
|
|
|
|
The original Nemotron-Orchestrator-8B is shipped in **FP32 (~32 GB on disk)** which
|
|
does not fit any consumer GPU. Existing community quantizations exist (mostly
|
|
GGUF) but none is calibrated specifically for orchestration / tool-calling
|
|
scale search and packaged for the **AutoAWQ GEMM kernel path** that integrates
|
|
directly with `transformers` and `vllm` on Windows hosts.
|
|
|
|
This AWQ build:
|
|
|
|
1. Fits inside the **4.5 GB SEALED VRAM window** of the NOESIS specialist
|
|
sequential-swapping protocol
|
|
2. Uses **GEMM kernel** (compatible with `device_map={"":0}` — no CPU offload)
|
|
3. Provenance-tracked (`noesis_provenance.json` ships with the model)
|
|
4. Calibrated on orchestration / tool-calling prompts matching base model
|
|
training distribution (ToolScale + GeneralThought-430K)
|
|
|
|
---
|
|
|
|
## How to use
|
|
|
|
```python
|
|
from awq import AutoAWQForCausalLM
|
|
from transformers import AutoTokenizer
|
|
import torch
|
|
|
|
model_id = "amaimedia/Qwen3-8B-Nemotron-Orchestrator-NOESIS-AWQ-INT4"
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
model = AutoAWQForCausalLM.from_quantized(
|
|
model_id,
|
|
device_map={"": 0},
|
|
torch_dtype=torch.float16,
|
|
fuse_layers=False,
|
|
)
|
|
|
|
prompt = "Plan a multi-step task: search for recent AWQ papers, then summarize."
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
|
|
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
|
|
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
|
```
|
|
|
|
---
|
|
|
|
## NOESIS context
|
|
|
|
In NOESIS this model serves as the **English orchestration teacher** for
|
|
Specialist **M9-ORCH-4B** during knowledge distillation. It is loaded
|
|
sequentially (per the NOESIS swapping protocol) onto the RTX 3060,
|
|
producing top-K=512 logits at temperature=4.0, which are then aggregated
|
|
in `build_ensemble_labels.py` with proposed weight **w=0.22** on the
|
|
orchestration data shard.
|
|
|
|
NOESIS specialists overview:
|
|
|
|
| ID | Role | Size |
|
|
| --- | --- | --- |
|
|
| M1 | ASR (150+ langs) | 10B/3B |
|
|
| M2 | Dubbing LM (30 langs full) | 10B/3B |
|
|
| M3 | TTS + voice cloning | 10B/3B |
|
|
| M4 | Chat + creative writing | 10B/3B |
|
|
| M5 | Code + math | 10B/3B |
|
|
| M6 | Deep research (1M ctx) | 10B/3B |
|
|
| M7 | Prompt engineering | 4B/0.8B |
|
|
| M8 | Quality control (PRM) | 4B/0.8B |
|
|
| **M9** | **Orchestrator + routing** | **4B/0.8B** |
|
|
|
|
---
|
|
|
|
## Acknowledgements & citation
|
|
|
|
Base model: ToolOrchestra by NVIDIA & University of Hong Kong.
|
|
|
|
```bibtex
|
|
@misc{toolorchestra,
|
|
title = {ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
|
|
author = {Hongjin Su and Shizhe Diao and Ximing Lu and others},
|
|
year = {2025},
|
|
eprint = {2511.21689},
|
|
archivePrefix = {arXiv}
|
|
}
|
|
```
|
|
|
|
Quantization & NOESIS integration:
|
|
|
|
```bibtex
|
|
@misc{noesis_v14,
|
|
title = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
|
|
author = {Bolotnikov, Ilia},
|
|
year = {2026},
|
|
publisher = {AMAImedia},
|
|
url = {https://amaimedia.com}
|
|
}
|
|
```
|