Go to file

ModelHub XC 3b0ebe82cb 初始化项目，由ModelHub XC社区提供模型

Model: RthItalia/NanoLLM-Qwen2.5-3B-v3.1
Source: Original Platform

2026-05-06 07:44:15 +08:00

nano_compact

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

LICENSE

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

load_artifact.py

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

modeling_nanollm.py

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

quantized_modules.pt

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

spec.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-05-06 07:44:15 +08:00

README.md

language, license, library_name, pipeline_tag, base_model, tags, model-index

language

license

library_name

pipeline_tag

base_model

Nano Compact 3B QKV-FP16

RthItalia/nano_compact_3b_qkvfp16 is the validated compact self-contained variant derived from Qwen/Qwen2.5-3B-Instruct.

This release is not the original overlay artifact. It is the final exported self-contained folder that loads directly with transformers plus trust_remote_code=True.

What This Variant Is

This model uses a mixed runtime policy:

q_proj, k_proj, v_proj: stored and loaded in fp16
o_proj and most of the remaining transformer body: stored in Nano compact format
model.embed_tokens: stored as a single quantized copy
lm_head: tied custom head over the quantized embeddings

The objective of this policy is not maximum compression at any cost. It is the best validated tradeoff found between:

disk size
VRAM usage
quality relative to the true 8bit baseline

Validated Runtime Envelope

Measured on the validated 3B run:

model size: 2.3432 GB
allocated after load: 2.3432 GB
peak generation VRAM: ~2.44 GB

True 8bit baseline used for comparison:

allocated after load: 3.1703 GB
peak generation VRAM: ~3.21 GB

So this winner variant preserved a meaningful VRAM advantage over the 8bit baseline while recovering enough quality to pass the smoke comparison used during validation.

Quality Claim

The quality claim for this release is intentionally narrow:

it was compared against the true 8bit baseline on a small internal prompt suite
it is not claimed to match the full original model in all tasks
it is not claimed to outperform the base model

During development, more aggressive variants such as:

fully tied quantized head (tiedq)
fully quantized attention

reached better size and VRAM numbers but failed the quality gate against the true 8bit reference.
qkvfp16 was the first variant that restored acceptable behavior on the reference prompt set while keeping a substantial memory advantage.

How To Load

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "RthItalia/nano_compact_3b_qkvfp16"

tok = AutoTokenizer.from_pretrained(
    repo_id,
    use_fast=True,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="cuda",
    dtype=torch.float16,
).eval()

Example Generation

messages = [
    {"role": "user", "content": "Explain what a neural network is in exactly 3 simple sentences."}
]

text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inp = tok(text, return_tensors="pt").to(next(model.parameters()).device)

with torch.no_grad():
    out = model.generate(
        **inp,
        max_new_tokens=120,
        do_sample=False,
        repetition_penalty=1.08,
        eos_token_id=tok.eos_token_id,
        pad_token_id=tok.eos_token_id,
    )

print(tok.decode(out[0][inp["input_ids"].shape[-1]:], skip_special_tokens=True))

Requirements

pip install torch transformers accelerate safetensors

bitsandbytes is not required for this exported winner variant at runtime.

Important Notes

trust_remote_code=True is required.
The custom runtime uses a NanoTiedHead implementation that ties output logits to the quantized embedding table without registering the embedding module twice.
The custom linear layers use chunked forward paths to keep peak VRAM under control.

Limitations

Validation was narrow and engineering-driven, not a full benchmark suite.
This release is specifically tuned around Qwen/Qwen2.5-3B-Instruct.
It should be treated as a compact experimental runtime artifact, not as a drop-in scientific proof of broader architectural claims.

License Note

The base model is derived from Qwen/Qwen2.5-3B-Instruct, but this compact release should follow the licensing and distribution terms chosen for this Nano release repository.
For that reason the model card metadata uses license: other instead of asserting Apache coverage for the full release package.

Provenance

base model: Qwen/Qwen2.5-3B-Instruct
winner policy name: qkvfp16
published repo: RthItalia/nano_compact_3b_qkvfp16