Files
qwen3-1.7b-uc2p79/README.md
ModelHub XC c4649e9454 初始化项目,由ModelHub XC社区提供模型
Model: SipsaLabs/qwen3-1.7b-uc2p79
Source: Original Platform
2026-05-16 01:58:32 +08:00

15 KiB
Raw Permalink Blame History

Sipsa Labs, Inc. update — 2026-05-11. UltraCompress v0.6.9 on PyPI under BUSL-1.1 + Additional Use Grant (free for sub-$1M ARR companies, research, and individuals; auto-converts to Apache 2.0 four years post-release). OpenAI-compatible inference API at api.sipsalabs.com/v1 is publicly self-serve — Pro $99/mo + Team $499/mo at sipsalabs.com/pricing, or free $5 credits (no card). The pip install ultracompress substrate is fully production today (no API key required for self-host). 22 architectures verified, 0.6B405B parameters, sub-1.005× perplexity ratio on Mixtral-8x7B / Qwen3-14B / Mistral-7B. Live discussion on Hacker News. Commercial inquiries: founder@sipsalabs.com.



license: other license_name: sipsa-labs-research-evaluation-v1.0 license_link: LICENSE base_model: Qwen/Qwen3-1.7B tags:

  • ultracompress
  • quantization
  • row-overlay-quantization
  • llm
  • inference
  • on-device
  • edge
  • patent-pending library_name: transformers language:
  • en pipeline_tag: text-generation

qwen3-1.7b-uc2p79

A patent-pending compressed reference variant of Qwen/Qwen3-1.7B, shipping the low-rank correction overlay post-training row-overlay quantization at 2.798 bits per weight (patent-pending,511 — patent pending; this specific fit measures 2.7767 bpw effective).

UltraCompress is a two-track patent estate. low-rank correction overlay (this artifact, shipping today) compresses each weight via row-overlay quantization at sub-3 bpw. shared-block parameter dispatch (patent-pending,517 — patent pending; research-stage, v0.2 Q3 2026) is a separate architectural compression method — shared-block parameter dispatch — that replaces the N transformer layers of a teacher with a single shared block applied iteratively, with measured compression ratios of 311× and 734× on the Qwen3-1.7B body at 68-69.6% top-10-token-agreement on held-out data. Combined low-rank correction overlay × shared-block parameter dispatch is the multiplicative compression Sipsa Labs is building toward; this v0.1 artifact is the low-rank correction overlay standalone, demonstrating the cohort consistency of the row-overlay quantization line as the foundation under that combined estate.

Read this first — this repository ships in dual format.

  • model.safetensors (~3.3 GB) — FP16 reconstruction. Loadable directly via transformers.from_pretrained. Use this if your runtime expects standard HF safetensors.
  • model.uc.bin (~491 MB at 2.7871 bpw on-disk) — the actually-packed binary at the claimed sub-3-bpw operating point. Loadable via pip install ultracompress. This is the artifact whose disk size matches the headline compression number.

Both files reconstruct the same compressed weights to within FP16 precision (verified bit-equivalent per the pack_v17.py round-trip protocol on this fit). Buyers pick based on runtime: enterprise inference platforms running standard transformers loaders use the safetensors; edge / on-device deployments using the UltraCompress runtime use the packed binary. The model card claims (2.798 bpw cohort design target, 2.7767 bpw measured on this fit) describe the information content of either file — the safetensors is bigger on disk but represents the same compressed model, not a different one.

The ultracompress.json manifest declares both files in its formats block with per-file SHA-256, so uc info validates either format end-to-end.

Quick start

pip install ultracompress
uc pull SipsaLabs/qwen3-1.7b-uc2p79
uc info ./models/SipsaLabs_qwen3-1.7b-uc2p79

The CLI streams the artifact, validates the manifest (SHA-256 + size for every declared file), and surfaces the compression metadata in one read.

Or with huggingface_hub directly:

from huggingface_hub import snapshot_download
local = snapshot_download("SipsaLabs/qwen3-1.7b-uc2p79")

Loading the model

The substituted weights are stored in standard HF FP16 safetensors layout, so any transformers-compatible runtime can load the model. Sample:

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import torch

# Load the compressed weights from this repository
local = "./models/SipsaLabs_qwen3-1.7b-uc2p79"
cfg = AutoConfig.from_pretrained(local)
model = AutoModelForCausalLM.from_pretrained(
    local,
    dtype=torch.float16,
    config=cfg,
).to("cuda")

# NOTE on trust_remote_code: we ship only pure quantized weights.
# `trust_remote_code=True` is therefore not needed for loading the local
# artifact. The flag IS still passed to the upstream tokenizer below because
# the base model's tokenizer uses it; that is the customer's choice to trust
# the upstream model author.
#
# The base Qwen tokenizer is unchanged from Qwen/Qwen3-1.7B.
# We recommend loading it directly from the upstream Qwen3-1.7B repo,
# which is the path that the `transformers` AutoTokenizer auto-resolves
# most cleanly across versions.
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B", trust_remote_code=True)

prompt = "The capital of France is"
inputs = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))

What's in this artifact

File Size Description
model.safetensors ~3.3 GB FP16-reconstructed weights — direct transformers.from_pretrained compatibility
model.uc.bin ~491 MB Packed UltraCompress binary at 2.7871 bpw on-disk — load via pip install ultracompress
ultracompress.json <2 KB Provenance manifest with method, bpw, base-model id, USPTO references, license name, per-file SHA-256, formats block declaring both weight files
config.json <2 KB Inherited from the base Qwen3-1.7B model
tokenizer.json / tokenizer_config.json / special_tokens_map.json / merges.txt / vocab.json / added_tokens.json / chat_template.jinja ~14 MB Tokenizer files copied from the base model
LICENSE ~7 KB Sipsa Labs Research and Evaluation License v1.0 (full text)
generation_config.json <1 KB Inherited from base

uc info ./models/SipsaLabs_qwen3-1.7b-uc2p79 will validate every entry in the manifest's files block against the actual on-disk size and SHA-256 — tamper-evidence you can read in one command.

Compression details

Metric Value
Method UltraCompress row-overlay quantization (low-rank correction overlay)
Method version v17hi
Operating-point bpw (cohort design target) 2.798
Measured effective bpw (this specific fit) 2.7767 (2.7708 body + 0.0059 codec overhead)
Base model Qwen/Qwen3-1.7B
On-disk file size ~3.3 GB (FP16 reconstruction; see "Read this first" above)
Patent posture patent-pending,511 (low-rank correction overlay) + (shared-block parameter dispatch) — patent pending
Filed 2026-04-25

The 2.777-bpw operating point is the v17hi line of the patent-pending row-overlay quantization method described in patent-pending,511. A complementary 2.40-bpw operating point on the same model and base is documented internally (v17 line, packed binary round-trip verified on Qwen3-1.7B + 5 other models in the Sipsa Labs cohort) and will be published as a sibling artifact in this organization.

Catastrophic-failure check

A "catastrophic failure" is defined as a downstream-task perplexity ratio greater than 10× the FP16 baseline. On Sipsa Labs' internal 6-model cohort (TinyLlama-1.1B, OLMo-2-1B, SmolLM2-1.7B, Qwen3-1.7B, Mistral-7B-v0.3, Qwen3-8B) at the low-rank correction overlay operating point: 0 of 6 models exhibit catastrophic failure. This artifact (Qwen3-1.7B at 2.777 bpw): non-catastrophic.

The cohort framing matters — this is a property of the method on this cohort, not an absolute claim about every possible base model.

Cohort scaling — retention scales with model size

The same low-rank correction overlay.798 bpw operating point measured across the 6-model Sipsa cohort (n=500, seed=42, WikiText-103 perplexity ratio):

Model Body params T1 retention vs FP16 T10 retention vs FP16 PPL ratio
OLMo-2-1B 1.00B 94.19% 97.04% 1.165
TinyLlama-1.1B 1.10B 96.37% 97.88% 1.097
SmolLM2-1.7B 1.71B 93.72% 96.71% 1.218
Qwen3-1.7B (this artifact) 1.72B 93.81% 96.55% 1.225
Mistral-7B-v0.3 7.25B 98.04% 99.06% 1.075
Qwen3-8B 8.19B 97.63% 98.84% 1.067

Spearman rank correlation between body-parameter count and T1 retention: +0.486 for UltraCompress. bitsandbytes NF4 at 4.0 bpw on the same cohort: 0.086 — essentially flat.

UltraCompress retention scales +4.32 percentage points going from 1B to 8B. NF4 scales +1.93 pp. The scaling slope is 2.2× NF4's.

The mechanism is design-level: row-overlay's per-row scale + learned codebook + rotation matrix calibrate to the per-model magnitude distribution. Larger transformer matrices give the codebook more rows to learn from. NF4 is a fixed dictionary — no per-model adaptation, no scaling.

This artifact (Qwen3-1.7B at 93.81% T1) is mid-cohort. The same method on 7B+ class models retains substantially more quality. For 30B+ class teachers the trend extrapolates further (not yet measured — replication invited; open an issue at github.com/sipsalabs/ultracompress/issues with model + seed + result).

n=6 caveat: this is the 6-model cohort tested by Sipsa Labs. The scaling claim is a property of the method on this cohort. Generalization to broader cohorts is the open empirical question.

Quality benchmarks

A small live benchmark is included below as a sanity-check; the full per-task benchmark numbers are intended to be reproduced in the buyer's own evaluation harness against their own baselines.

Reference benchmark (this artifact, paired against FP16 baseline)

Same 200-sample subset, same seed (1234), same batch size, same fp16 inference, same lm-eval-harness.

Task FP16 baseline (Qwen/Qwen3-1.7B) Compressed (this artifact at 2.798 bpw) Retention
HellaSwag — acc 43.50% (±3.51%) 40.00% (±3.47%) 91.95%
HellaSwag — acc_norm 49.50% (±3.54%) 47.00% (±3.54%) 94.95%

Both compressed-model values are within ±1 standard error of the FP16 baseline at n=200 — statistically indistinguishable. For a final-eval-grade benchmark the buyer should run on the full 10042-sample HellaSwag with multiple seeds and broader task coverage; the table above is a reproducible sanity-check, not a final claim.

Reproduce

# via lm-eval-harness directly (recommended workaround for transformers 4.57.x
# Qwen3-tokenizer-from-local-path issue — point tokenizer at the upstream repo):
python -m lm_eval --model hf \
    --model_args "pretrained=./models/SipsaLabs_qwen3-1.7b-uc2p79,tokenizer=Qwen/Qwen3-1.7B,dtype=float16,trust_remote_code=True" \
    --tasks hellaswag,arc_challenge,mmlu \
    --limit 500 --batch_size 8 --device cuda:0

For a paired FP16-baseline-vs-compressed comparison on the same task and same seed (the right way to read retention numbers), substitute pretrained=Qwen/Qwen3-1.7B in a separate run and compare task-by-task.

The cohort-level claim (95.6% T1 retention, zero catastrophic failures across 6 models at the low-rank correction overlay operating point) comes from the WikiText-103 perplexity protocol documented in the patent specifications, not from HellaSwag accuracy. Different evaluation surfaces measure different things; the artifact-specific numbers above are the reproducible HellaSwag sanity-check, not the full cohort claim.

For a Compression Assessment engagement that includes the buyer's specific baseline, evaluation tasks, and a written readout: email founder@sipsalabs.com.

Intended use

Permitted under this License (free of charge):

  • Personal, non-commercial research
  • Academic research at non-profit institutions (with attribution)
  • Pre-purchase evaluation by an enterprise considering negotiating a commercial license — for a period not to exceed 90 days from first download

Requires a separate commercial license (email legal@sipsalabs.com):

  • Production deployment in any commercial product or service
  • Use in an API or hosted inference service offered to third parties
  • Embedding in or shipping within hardware products, consumer devices, automobiles, robotics platforms
  • Training of any derivative model for commercial use
  • Any use by for-profit entities other than internal evaluation

The full License is in LICENSE and at sipsalabs.com.

Out-of-scope use

This artifact is published for research and evaluation. It is not intended for safety-critical, life-critical, or human-subject decision-making applications. Compression introduces measurable quality regression versus the FP16 baseline; do not deploy this artifact in a setting where that regression is unacceptable. Run the buyer's own evaluation before any production decision.

Limitations

  • The compression methods are post-training and preserve the base model's strengths and weaknesses. Whatever bias, refusal behavior, or out-of-distribution failure modes the Qwen3-1.7B FP16 base has, this artifact inherits.
  • This release stores reconstructed weights in FP16 layout — the runtime savings live in the loader and the future packed model.uc.bin artifact, not in this model.safetensors file's on-disk footprint.
  • Direct integration with quantization-aware runtimes (llama.cpp / TensorRT-LLM / vLLM quantization paths) is on the v0.2 roadmap. For v0.1.x, transformers and the UltraCompress CLI are the supported load paths.

Reproducibility

Every public claim on this card maps to a verifiable on-disk artifact:

# Pull this artifact
uc pull SipsaLabs/qwen3-1.7b-uc2p79

# Verify the manifest end-to-end (size + SHA-256 for every declared file)
uc info ./models/SipsaLabs_qwen3-1.7b-uc2p79

# Reproduce the benchmark numbers in your own evaluation harness
uc bench ./models/SipsaLabs_qwen3-1.7b-uc2p79 \
    --tasks hellaswag,arc_challenge,mmlu \
    --limit 500 --batch-size 8 --device cuda:0

For a SHA-256 manifest of all training and evaluation inputs that produced this artifact (private context, available under NDA): legal@sipsalabs.com.

Citation

If you use this artifact in research, please cite:

@misc{ounnar2026ultracompress,
  title        = {UltraCompress: Patent-Pending Compression Infrastructure for Large Language Models},
  author       = {{Sipsa Labs, Inc.}},
  year         = {2026},
  note         = {U.S.\ patent applications  and , patent pending. Filed 2026-04-25.},
  howpublished = {\url{https://sipsalabs.com}},
}

Get in touch


Sipsa Labs, Inc. — sipsalabs.com — patent pending — patent-pending (filed 2026-04-25)