--- license: apache-2.0 language: - ru - en - uk - be library_name: transformers tags: - awq - int4 - quantization - russian - qwen3 - noesis - dhcf-fno base_model: t-tech/T-pro-it-2.1 quantized_by: AMAImedia pipeline_tag: text-generation --- # Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4 **AWQ INT4 quantization of [t-tech/T-pro-it-2.1](https://huggingface.co/t-tech/T-pro-it-2.1) optimized for low-VRAM consumer hardware via streaming inference.** Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform** (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators). - **Founder:** Ilia Bolotnikov - **Organization:** [AMAImedia.com](https://www.amaimedia.com) - **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom) - **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov) - **Telegram:** [@djbionicl](https://t.me/djbionicl) - **NOESIS version:** v14.6 - **License:** Apache-2.0 (inherited from base model — fully permissive, commercial use allowed) --- ## ℹ️ Architecture clarification T-pro-it-2.1 is a **dense Qwen3-32B model**, NOT a Mixture-of-Experts (MoE). Upstream training used a SLERP merge of three GRPO-trained experts as a **training-time technique**, but the resulting checkpoint is a single set of dense weights with one forward pass and no router. This release follows that architecture exactly — there are no expert layers, no gating networks, and no conditional computation. --- ## Model summary | Property | Value | | --- | --- | | Base model | t-tech/T-pro-it-2.1 | | Underlying architecture | Qwen3-32B (decoder-only transformer, 64 layers, **dense**) | | Original precision | BF16 safetensors (~64 GB) | | Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) | | Vocab size | 151936 | | Languages | Russian (primary), English, Ukrainian, Belarusian | | Disk footprint | ~8.5 GB | | Inference VRAM (full-resident) | ~9 GB (does NOT fit 6 GB GPUs without streaming) | | Inference VRAM (streaming) | ~3.4 GB peak (per-layer offload — fits 6 GB GPU) | | Quantization library | AutoAWQ 0.2.9 | | Calibration set | 128 prompts (70% RU / 20% EN / 10% code), max_seq_len=512 | | RNG seed | 1729 (NOESIS reproducibility lock) | --- ## Key feature: 6 GB GPU compatibility via streaming Standard AWQ-INT4 of a 32B model needs ~9 GB VRAM, which excludes RTX 3060 / 4060 class hardware. **NOESIS ships a per-layer weight-streaming inference path** where individual transformer layers are streamed from CPU RAM onto the GPU on demand, executed, and freed. Peak VRAM stays at **~3.4 GB**, well inside the SEALED 4.5 GB NOESIS specialist window. Throughput on RTX 3060 (i7-12700H, DDR5-4800): - Prefill: ~25 tok/s - Per-layer load overhead: ~7 ms × 64 layers = 0.45 s amortized per batch Suitable for: KD logits extraction, batch inference, offline summarization. For low-latency interactive chat use the same checkpoint on a 12 GB+ GPU in standard AutoAWQ inference mode. --- ## How to use **Standard inference (12 GB+ GPU):** ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer import torch model_id = "amaimedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoAWQForCausalLM.from_quantized( model_id, device_map={"": 0}, torch_dtype=torch.float16, fuse_layers=False, ) messages = [ {"role": "system", "content": "Ты T-pro, полезный ассистент."}, {"role": "user", "content": "Объясни принцип работы трансформера."}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to("cuda") out = model.generate( **inputs, max_new_tokens=512, temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.0, ) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` Recommended generation parameters per upstream T-Tech guidance: `temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.0`. Both `temperature` and `presence_penalty` should be set explicitly. **Streaming inference (6 GB GPU):** see the NOESIS `extract_kd_streaming.py` reference implementation. --- ## NOESIS context In NOESIS this model serves as the **Russian-language teacher** for several specialists during knowledge distillation: | Target specialist | Role | Proposed KD weight | | --- | --- | --- | | M2-DUB-LM-10B | Dubbing LM (Russian segments) | 0.18 | | M4-CHAT-10B | Chat / creative writing (Russian) | 0.18 | | M9-ORCH-4B | Orchestrator (Russian routing) | 0.15 | Vocab match (151936) is identical to the NOESIS base (Qwen3-8B), enabling **direct logit alignment** without cross-tokenizer projection — a critical property for clean KD on Russian shards. --- ## Quantization details Calibration distribution: - 70% Russian: chat, technical instruction, scientific exposition, creative writing - 20% English: technical & instructional - 10% Code: Python, Rust (RU and EN comments) Quantization performed on: - CPU: Intel i7-12700H (14 cores) - RAM: 64 GB DDR5 - GPU: RTX 3060 6 GB (per-layer scale search) - Disk offload: NVMe (`B:\noesis_offload_tpro\`, freed after quantization) Wall time: ~3.5 hours. --- ## Acknowledgements & citation Base model: ```bibtex @misc{stoianov2025tpro20, title = {T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground}, author = {Dmitrii Stoianov and Danil Taranets and Olga Tsymboi and others}, year = {2025}, eprint = {2512.10430}, archivePrefix = {arXiv} } ``` Quantization & NOESIS integration: ```bibtex @misc{noesis_v14, title = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform}, author = {Bolotnikov, Ilia}, year = {2026}, publisher = {AMAImedia}, url = {https://amaimedia.com} } ```