188 lines
5.7 KiB
Markdown
188 lines
5.7 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- ru
|
|||
|
|
- en
|
|||
|
|
- uk
|
|||
|
|
- be
|
|||
|
|
library_name: transformers
|
|||
|
|
tags:
|
|||
|
|
- awq
|
|||
|
|
- int4
|
|||
|
|
- quantization
|
|||
|
|
- russian
|
|||
|
|
- qwen3
|
|||
|
|
- noesis
|
|||
|
|
- dhcf-fno
|
|||
|
|
base_model: t-tech/T-pro-it-2.1
|
|||
|
|
quantized_by: AMAImedia
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4
|
|||
|
|
**AWQ INT4 quantization of [t-tech/T-pro-it-2.1](https://huggingface.co/t-tech/T-pro-it-2.1)
|
|||
|
|
optimized for low-VRAM consumer hardware via streaming inference.**
|
|||
|
|
|
|||
|
|
Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform**
|
|||
|
|
(framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).
|
|||
|
|
|
|||
|
|
- **Founder:** Ilia Bolotnikov
|
|||
|
|
- **Organization:** [AMAImedia.com](https://www.amaimedia.com)
|
|||
|
|
- **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom)
|
|||
|
|
- **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov)
|
|||
|
|
- **Telegram:** [@djbionicl](https://t.me/djbionicl)
|
|||
|
|
- **NOESIS version:** v14.6
|
|||
|
|
- **License:** Apache-2.0 (inherited from base model — fully permissive, commercial use allowed)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ℹ️ Architecture clarification
|
|||
|
|
|
|||
|
|
T-pro-it-2.1 is a **dense Qwen3-32B model**, NOT a Mixture-of-Experts (MoE).
|
|||
|
|
Upstream training used a SLERP merge of three GRPO-trained experts as a
|
|||
|
|
**training-time technique**, but the resulting checkpoint is a single set of
|
|||
|
|
dense weights with one forward pass and no router. This release follows
|
|||
|
|
that architecture exactly — there are no expert layers, no gating networks,
|
|||
|
|
and no conditional computation.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Model summary
|
|||
|
|
|
|||
|
|
| Property | Value |
|
|||
|
|
| --- | --- |
|
|||
|
|
| Base model | t-tech/T-pro-it-2.1 |
|
|||
|
|
| Underlying architecture | Qwen3-32B (decoder-only transformer, 64 layers, **dense**) |
|
|||
|
|
| Original precision | BF16 safetensors (~64 GB) |
|
|||
|
|
| Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) |
|
|||
|
|
| Vocab size | 151936 |
|
|||
|
|
| Languages | Russian (primary), English, Ukrainian, Belarusian |
|
|||
|
|
| Disk footprint | ~8.5 GB |
|
|||
|
|
| Inference VRAM (full-resident) | ~9 GB (does NOT fit 6 GB GPUs without streaming) |
|
|||
|
|
| Inference VRAM (streaming) | ~3.4 GB peak (per-layer offload — fits 6 GB GPU) |
|
|||
|
|
| Quantization library | AutoAWQ 0.2.9 |
|
|||
|
|
| Calibration set | 128 prompts (70% RU / 20% EN / 10% code), max_seq_len=512 |
|
|||
|
|
| RNG seed | 1729 (NOESIS reproducibility lock) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Key feature: 6 GB GPU compatibility via streaming
|
|||
|
|
|
|||
|
|
Standard AWQ-INT4 of a 32B model needs ~9 GB VRAM, which excludes RTX 3060 / 4060
|
|||
|
|
class hardware. **NOESIS ships a per-layer weight-streaming inference path**
|
|||
|
|
where individual transformer layers are streamed from CPU RAM onto the GPU
|
|||
|
|
on demand, executed, and freed. Peak VRAM stays at **~3.4 GB**, well inside
|
|||
|
|
the SEALED 4.5 GB NOESIS specialist window.
|
|||
|
|
|
|||
|
|
Throughput on RTX 3060 (i7-12700H, DDR5-4800):
|
|||
|
|
- Prefill: ~25 tok/s
|
|||
|
|
- Per-layer load overhead: ~7 ms × 64 layers = 0.45 s amortized per batch
|
|||
|
|
|
|||
|
|
Suitable for: KD logits extraction, batch inference, offline summarization.
|
|||
|
|
For low-latency interactive chat use the same checkpoint on a 12 GB+ GPU
|
|||
|
|
in standard AutoAWQ inference mode.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## How to use
|
|||
|
|
|
|||
|
|
**Standard inference (12 GB+ GPU):**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from awq import AutoAWQForCausalLM
|
|||
|
|
from transformers import AutoTokenizer
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
model_id = "amaimedia/Qwen3-32B-T-pro-it-2.1-NOESIS-AWQ-INT4"
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|||
|
|
model = AutoAWQForCausalLM.from_quantized(
|
|||
|
|
model_id,
|
|||
|
|
device_map={"": 0},
|
|||
|
|
torch_dtype=torch.float16,
|
|||
|
|
fuse_layers=False,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
messages = [
|
|||
|
|
{"role": "system", "content": "Ты T-pro, полезный ассистент."},
|
|||
|
|
{"role": "user", "content": "Объясни принцип работы трансформера."},
|
|||
|
|
]
|
|||
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
|||
|
|
inputs = tokenizer(text, return_tensors="pt").to("cuda")
|
|||
|
|
out = model.generate(
|
|||
|
|
**inputs,
|
|||
|
|
max_new_tokens=512,
|
|||
|
|
temperature=0.7, top_p=0.8, top_k=20,
|
|||
|
|
repetition_penalty=1.0,
|
|||
|
|
)
|
|||
|
|
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Recommended generation parameters per upstream T-Tech guidance:
|
|||
|
|
`temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.0`.
|
|||
|
|
Both `temperature` and `presence_penalty` should be set explicitly.
|
|||
|
|
|
|||
|
|
**Streaming inference (6 GB GPU):** see the NOESIS `extract_kd_streaming.py`
|
|||
|
|
reference implementation.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## NOESIS context
|
|||
|
|
|
|||
|
|
In NOESIS this model serves as the **Russian-language teacher** for several
|
|||
|
|
specialists during knowledge distillation:
|
|||
|
|
|
|||
|
|
| Target specialist | Role | Proposed KD weight |
|
|||
|
|
| --- | --- | --- |
|
|||
|
|
| M2-DUB-LM-10B | Dubbing LM (Russian segments) | 0.18 |
|
|||
|
|
| M4-CHAT-10B | Chat / creative writing (Russian) | 0.18 |
|
|||
|
|
| M9-ORCH-4B | Orchestrator (Russian routing) | 0.15 |
|
|||
|
|
|
|||
|
|
Vocab match (151936) is identical to the NOESIS base (Qwen3-8B), enabling
|
|||
|
|
**direct logit alignment** without cross-tokenizer projection — a critical
|
|||
|
|
property for clean KD on Russian shards.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Quantization details
|
|||
|
|
|
|||
|
|
Calibration distribution:
|
|||
|
|
- 70% Russian: chat, technical instruction, scientific exposition, creative writing
|
|||
|
|
- 20% English: technical & instructional
|
|||
|
|
- 10% Code: Python, Rust (RU and EN comments)
|
|||
|
|
|
|||
|
|
Quantization performed on:
|
|||
|
|
- CPU: Intel i7-12700H (14 cores)
|
|||
|
|
- RAM: 64 GB DDR5
|
|||
|
|
- GPU: RTX 3060 6 GB (per-layer scale search)
|
|||
|
|
- Disk offload: NVMe (`B:\noesis_offload_tpro\`, freed after quantization)
|
|||
|
|
|
|||
|
|
Wall time: ~3.5 hours.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Acknowledgements & citation
|
|||
|
|
|
|||
|
|
Base model:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@misc{stoianov2025tpro20,
|
|||
|
|
title = {T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground},
|
|||
|
|
author = {Dmitrii Stoianov and Danil Taranets and Olga Tsymboi and others},
|
|||
|
|
year = {2025},
|
|||
|
|
eprint = {2512.10430},
|
|||
|
|
archivePrefix = {arXiv}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Quantization & NOESIS integration:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@misc{noesis_v14,
|
|||
|
|
title = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
|
|||
|
|
author = {Bolotnikov, Ilia},
|
|||
|
|
year = {2026},
|
|||
|
|
publisher = {AMAImedia},
|
|||
|
|
url = {https://amaimedia.com}
|
|||
|
|
}
|
|||
|
|
```
|