Files
Qwen3-32B-T-pro-it-2.1-NOES…/README.md
ModelHub XC 9666f81410 初始化项目,由ModelHub XC社区提供模型
Model: AMAImedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4
Source: Original Platform
2026-05-01 23:24:18 +08:00

188 lines
5.7 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- ru
- en
- uk
- be
library_name: transformers
tags:
- awq
- int4
- quantization
- russian
- qwen3
- noesis
- dhcf-fno
base_model: t-tech/T-pro-it-2.1
quantized_by: AMAImedia
pipeline_tag: text-generation
---
# Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4
**AWQ INT4 quantization of [t-tech/T-pro-it-2.1](https://huggingface.co/t-tech/T-pro-it-2.1)
optimized for low-VRAM consumer hardware via streaming inference.**
Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform**
(framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).
- **Founder:** Ilia Bolotnikov
- **Organization:** [AMAImedia.com](https://www.amaimedia.com)
- **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom)
- **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov)
- **Telegram:** [@djbionicl](https://t.me/djbionicl)
- **NOESIS version:** v14.6
- **License:** Apache-2.0 (inherited from base model — fully permissive, commercial use allowed)
---
## Architecture clarification
T-pro-it-2.1 is a **dense Qwen3-32B model**, NOT a Mixture-of-Experts (MoE).
Upstream training used a SLERP merge of three GRPO-trained experts as a
**training-time technique**, but the resulting checkpoint is a single set of
dense weights with one forward pass and no router. This release follows
that architecture exactly — there are no expert layers, no gating networks,
and no conditional computation.
---
## Model summary
| Property | Value |
| --- | --- |
| Base model | t-tech/T-pro-it-2.1 |
| Underlying architecture | Qwen3-32B (decoder-only transformer, 64 layers, **dense**) |
| Original precision | BF16 safetensors (~64 GB) |
| Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) |
| Vocab size | 151936 |
| Languages | Russian (primary), English, Ukrainian, Belarusian |
| Disk footprint | ~8.5 GB |
| Inference VRAM (full-resident) | ~9 GB (does NOT fit 6 GB GPUs without streaming) |
| Inference VRAM (streaming) | ~3.4 GB peak (per-layer offload — fits 6 GB GPU) |
| Quantization library | AutoAWQ 0.2.9 |
| Calibration set | 128 prompts (70% RU / 20% EN / 10% code), max_seq_len=512 |
| RNG seed | 1729 (NOESIS reproducibility lock) |
---
## Key feature: 6 GB GPU compatibility via streaming
Standard AWQ-INT4 of a 32B model needs ~9 GB VRAM, which excludes RTX 3060 / 4060
class hardware. **NOESIS ships a per-layer weight-streaming inference path**
where individual transformer layers are streamed from CPU RAM onto the GPU
on demand, executed, and freed. Peak VRAM stays at **~3.4 GB**, well inside
the SEALED 4.5 GB NOESIS specialist window.
Throughput on RTX 3060 (i7-12700H, DDR5-4800):
- Prefill: ~25 tok/s
- Per-layer load overhead: ~7 ms × 64 layers = 0.45 s amortized per batch
Suitable for: KD logits extraction, batch inference, offline summarization.
For low-latency interactive chat use the same checkpoint on a 12 GB+ GPU
in standard AutoAWQ inference mode.
---
## How to use
**Standard inference (12 GB+ GPU):**
```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch
model_id = "amaimedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_quantized(
model_id,
device_map={"": 0},
torch_dtype=torch.float16,
fuse_layers=False,
)
messages = [
{"role": "system", "content": "Ты T-pro, полезный ассистент."},
{"role": "user", "content": "Объясни принцип работы трансформера."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
out = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7, top_p=0.8, top_k=20,
repetition_penalty=1.0,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```
Recommended generation parameters per upstream T-Tech guidance:
`temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.0`.
Both `temperature` and `presence_penalty` should be set explicitly.
**Streaming inference (6 GB GPU):** see the NOESIS `extract_kd_streaming.py`
reference implementation.
---
## NOESIS context
In NOESIS this model serves as the **Russian-language teacher** for several
specialists during knowledge distillation:
| Target specialist | Role | Proposed KD weight |
| --- | --- | --- |
| M2-DUB-LM-10B | Dubbing LM (Russian segments) | 0.18 |
| M4-CHAT-10B | Chat / creative writing (Russian) | 0.18 |
| M9-ORCH-4B | Orchestrator (Russian routing) | 0.15 |
Vocab match (151936) is identical to the NOESIS base (Qwen3-8B), enabling
**direct logit alignment** without cross-tokenizer projection — a critical
property for clean KD on Russian shards.
---
## Quantization details
Calibration distribution:
- 70% Russian: chat, technical instruction, scientific exposition, creative writing
- 20% English: technical & instructional
- 10% Code: Python, Rust (RU and EN comments)
Quantization performed on:
- CPU: Intel i7-12700H (14 cores)
- RAM: 64 GB DDR5
- GPU: RTX 3060 6 GB (per-layer scale search)
- Disk offload: NVMe (`B:\noesis_offload_tpro\`, freed after quantization)
Wall time: ~3.5 hours.
---
## Acknowledgements & citation
Base model:
```bibtex
@misc{stoianov2025tpro20,
title = {T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground},
author = {Dmitrii Stoianov and Danil Taranets and Olga Tsymboi and others},
year = {2025},
eprint = {2512.10430},
archivePrefix = {arXiv}
}
```
Quantization & NOESIS integration:
```bibtex
@misc{noesis_v14,
title = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
author = {Bolotnikov, Ilia},
year = {2026},
publisher = {AMAImedia},
url = {https://amaimedia.com}
}
```