AMAImedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4

Go to file

ModelHub XC 9666f81410 初始化项目，由ModelHub XC社区提供模型

Model: AMAImedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4
Source: Original Platform

2026-05-01 23:24:18 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

LICENSE

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

model-00001-of-00005.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

model-00002-of-00005.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

model-00003-of-00005.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

model-00004-of-00005.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

model-00005-of-00005.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

model.safetensors.index.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

noesis_provenance.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 23:24:18 +08:00

README.md

license, language, library_name, tags, base_model, quantized_by, pipeline_tag

license

language

library_name

Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4

AWQ INT4 quantization of t-tech/T-pro-it-2.1 optimized for low-VRAM consumer hardware via streaming inference.

Released as part of the NOESIS Professional Multilingual Dubbing Automation Platform (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).

Founder: Ilia Bolotnikov
Organization: AMAImedia.com
X (Twitter): @AMAImediacom
LinkedIn: Ilia Bolotnikov
Telegram: @djbionicl
NOESIS version: v14.6
License: Apache-2.0 (inherited from base model — fully permissive, commercial use allowed)

ℹ️ Architecture clarification

T-pro-it-2.1 is a dense Qwen3-32B model, NOT a Mixture-of-Experts (MoE). Upstream training used a SLERP merge of three GRPO-trained experts as a training-time technique, but the resulting checkpoint is a single set of dense weights with one forward pass and no router. This release follows that architecture exactly — there are no expert layers, no gating networks, and no conditional computation.

Model summary

Property	Value
Base model	t-tech/T-pro-it-2.1
Underlying architecture	Qwen3-32B (decoder-only transformer, 64 layers, dense)
Original precision	BF16 safetensors (~64 GB)
Quantized precision	AWQ INT4 (group_size=128, GEMM, zero_point=True)
Vocab size	151936
Languages	Russian (primary), English, Ukrainian, Belarusian
Disk footprint	~8.5 GB
Inference VRAM (full-resident)	~9 GB (does NOT fit 6 GB GPUs without streaming)
Inference VRAM (streaming)	~3.4 GB peak (per-layer offload — fits 6 GB GPU)
Quantization library	AutoAWQ 0.2.9
Calibration set	128 prompts (70% RU / 20% EN / 10% code), max_seq_len=512
RNG seed	1729 (NOESIS reproducibility lock)

Key feature: 6 GB GPU compatibility via streaming

Standard AWQ-INT4 of a 32B model needs ~9 GB VRAM, which excludes RTX 3060 / 4060 class hardware. NOESIS ships a per-layer weight-streaming inference path where individual transformer layers are streamed from CPU RAM onto the GPU on demand, executed, and freed. Peak VRAM stays at ~3.4 GB, well inside the SEALED 4.5 GB NOESIS specialist window.

Throughput on RTX 3060 (i7-12700H, DDR5-4800):

Prefill: ~25 tok/s
Per-layer load overhead: ~7 ms × 64 layers = 0.45 s amortized per batch

Suitable for: KD logits extraction, batch inference, offline summarization. For low-latency interactive chat use the same checkpoint on a 12 GB+ GPU in standard AutoAWQ inference mode.

How to use

Standard inference (12 GB+ GPU):

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "amaimedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    device_map={"": 0},
    torch_dtype=torch.float16,
    fuse_layers=False,
)

messages = [
    {"role": "system", "content": "Ты T-pro, полезный ассистент."},
    {"role": "user", "content": "Объясни принцип работы трансформера."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
out = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7, top_p=0.8, top_k=20,
    repetition_penalty=1.0,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Recommended generation parameters per upstream T-Tech guidance: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.0. Both temperature and presence_penalty should be set explicitly.

Streaming inference (6 GB GPU): see the NOESIS extract_kd_streaming.py reference implementation.

NOESIS context

In NOESIS this model serves as the Russian-language teacher for several specialists during knowledge distillation:

Target specialist	Role	Proposed KD weight
M2-DUB-LM-10B	Dubbing LM (Russian segments)	0.18
M4-CHAT-10B	Chat / creative writing (Russian)	0.18
M9-ORCH-4B	Orchestrator (Russian routing)	0.15

Vocab match (151936) is identical to the NOESIS base (Qwen3-8B), enabling direct logit alignment without cross-tokenizer projection — a critical property for clean KD on Russian shards.

Quantization details

Calibration distribution:

70% Russian: chat, technical instruction, scientific exposition, creative writing
20% English: technical & instructional
10% Code: Python, Rust (RU and EN comments)

Quantization performed on:

CPU: Intel i7-12700H (14 cores)
RAM: 64 GB DDR5
GPU: RTX 3060 6 GB (per-layer scale search)
Disk offload: NVMe (B:\noesis_offload_tpro\, freed after quantization)

Wall time: ~3.5 hours.

Acknowledgements & citation

Base model:

@misc{stoianov2025tpro20,
  title  = {T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground},
  author = {Dmitrii Stoianov and Danil Taranets and Olga Tsymboi and others},
  year   = {2025},
  eprint = {2512.10430},
  archivePrefix = {arXiv}
}

Quantization & NOESIS integration:

@misc{noesis_v14,
  title  = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
  author = {Bolotnikov, Ilia},
  year   = {2026},
  publisher = {AMAImedia},
  url    = {https://amaimedia.com}
}

README.md Unescape Escape

Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4

ℹ️ Architecture clarification

Model summary

Key feature: 6 GB GPU compatibility via streaming

How to use

NOESIS context

Quantization details

Acknowledgements & citation

README.md