初始化项目，由ModelHub XC社区提供模型

Model: AMAImedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4 Source: Original Platform
2026-05-01 23:24:18 +08:00
commit 9666f81410
15 changed files with 2312 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,187 @@
+---
+license: apache-2.0
+language:
+- ru
+- en
+- uk
+- be
+library_name: transformers
+tags:
+- awq
+- int4
+- quantization
+- russian
+- qwen3
+- noesis
+- dhcf-fno
+base_model: t-tech/T-pro-it-2.1
+quantized_by: AMAImedia
+pipeline_tag: text-generation
+---
+
+# Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4
+**AWQ INT4 quantization of [t-tech/T-pro-it-2.1](https://huggingface.co/t-tech/T-pro-it-2.1)
+optimized for low-VRAM consumer hardware via streaming inference.**
+
+Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform**
+(framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).
+
+- **Founder:** Ilia Bolotnikov
+- **Organization:** [AMAImedia.com](https://www.amaimedia.com)
+- **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom)
+- **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov)
+- **Telegram:** [@djbionicl](https://t.me/djbionicl)
+- **NOESIS version:** v14.6
+- **License:** Apache-2.0 (inherited from base model — fully permissive, commercial use allowed)
+
+---
+
+## ℹ️ Architecture clarification
+
+T-pro-it-2.1 is a **dense Qwen3-32B model**, NOT a Mixture-of-Experts (MoE).
+Upstream training used a SLERP merge of three GRPO-trained experts as a
+**training-time technique**, but the resulting checkpoint is a single set of
+dense weights with one forward pass and no router. This release follows
+that architecture exactly — there are no expert layers, no gating networks,
+and no conditional computation.
+
+---
+
+## Model summary
+
+| Property | Value |
+| --- | --- |
+| Base model | t-tech/T-pro-it-2.1 |
+| Underlying architecture | Qwen3-32B (decoder-only transformer, 64 layers, **dense**) |
+| Original precision | BF16 safetensors (~64 GB) |
+| Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) |
+| Vocab size | 151936 |
+| Languages | Russian (primary), English, Ukrainian, Belarusian |
+| Disk footprint | ~8.5 GB |
+| Inference VRAM (full-resident) | ~9 GB (does NOT fit 6 GB GPUs without streaming) |
+| Inference VRAM (streaming) | ~3.4 GB peak (per-layer offload — fits 6 GB GPU) |
+| Quantization library | AutoAWQ 0.2.9 |
+| Calibration set | 128 prompts (70% RU / 20% EN / 10% code), max_seq_len=512 |
+| RNG seed | 1729 (NOESIS reproducibility lock) |
+
+---
+
+## Key feature: 6 GB GPU compatibility via streaming
+
+Standard AWQ-INT4 of a 32B model needs ~9 GB VRAM, which excludes RTX 3060 / 4060
+class hardware. **NOESIS ships a per-layer weight-streaming inference path**
+where individual transformer layers are streamed from CPU RAM onto the GPU
+on demand, executed, and freed. Peak VRAM stays at **~3.4 GB**, well inside
+the SEALED 4.5 GB NOESIS specialist window.
+
+Throughput on RTX 3060 (i7-12700H, DDR5-4800):
+- Prefill: ~25 tok/s
+- Per-layer load overhead: ~7 ms × 64 layers = 0.45 s amortized per batch
+
+Suitable for: KD logits extraction, batch inference, offline summarization.
+For low-latency interactive chat use the same checkpoint on a 12 GB+ GPU
+in standard AutoAWQ inference mode.
+
+---
+
+## How to use
+
+**Standard inference (12 GB+ GPU):**
+
+```python
+from awq import AutoAWQForCausalLM
+from transformers import AutoTokenizer
+import torch
+
+model_id = "amaimedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoAWQForCausalLM.from_quantized(
+    model_id,
+    device_map={"": 0},
+    torch_dtype=torch.float16,
+    fuse_layers=False,
+)
+
+messages = [
+    {"role": "system", "content": "Ты T-pro, полезный ассистент."},
+    {"role": "user", "content": "Объясни принцип работы трансформера."},
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to("cuda")
+out = model.generate(
+    **inputs,
+    max_new_tokens=512,
+    temperature=0.7, top_p=0.8, top_k=20,
+    repetition_penalty=1.0,
+)
+print(tokenizer.decode(out[0], skip_special_tokens=True))
+```
+
+Recommended generation parameters per upstream T-Tech guidance:
+`temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.0`.
+Both `temperature` and `presence_penalty` should be set explicitly.
+
+**Streaming inference (6 GB GPU):** see the NOESIS `extract_kd_streaming.py`
+reference implementation.
+
+---
+
+## NOESIS context
+
+In NOESIS this model serves as the **Russian-language teacher** for several
+specialists during knowledge distillation:
+
+| Target specialist | Role | Proposed KD weight |
+| --- | --- | --- |
+| M2-DUB-LM-10B | Dubbing LM (Russian segments) | 0.18 |
+| M4-CHAT-10B | Chat / creative writing (Russian) | 0.18 |
+| M9-ORCH-4B | Orchestrator (Russian routing) | 0.15 |
+
+Vocab match (151936) is identical to the NOESIS base (Qwen3-8B), enabling
+**direct logit alignment** without cross-tokenizer projection — a critical
+property for clean KD on Russian shards.
+
+---
+
+## Quantization details
+
+Calibration distribution:
+- 70% Russian: chat, technical instruction, scientific exposition, creative writing
+- 20% English: technical & instructional
+- 10% Code: Python, Rust (RU and EN comments)
+
+Quantization performed on:
+- CPU: Intel i7-12700H (14 cores)
+- RAM: 64 GB DDR5
+- GPU: RTX 3060 6 GB (per-layer scale search)
+- Disk offload: NVMe (`B:\noesis_offload_tpro\`, freed after quantization)
+
+Wall time: ~3.5 hours.
+
+---
+
+## Acknowledgements & citation
+
+Base model:
+
+```bibtex
+@misc{stoianov2025tpro20,
+  title  = {T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground},
+  author = {Dmitrii Stoianov and Danil Taranets and Olga Tsymboi and others},
+  year   = {2025},
+  eprint = {2512.10430},
+  archivePrefix = {arXiv}
+}
+```
+
+Quantization & NOESIS integration:
+
+```bibtex
+@misc{noesis_v14,
+  title  = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
+  author = {Bolotnikov, Ilia},
+  year   = {2026},
+  publisher = {AMAImedia},
+  url    = {https://amaimedia.com}
+}
+```