license, library_name, pipeline_tag, language, base_model, tags, datasets, model-index
license library_name pipeline_tag language base_model tags datasets model-index
apache-2.0 transformers text-generation
en
meta-llama/Llama-3.1-8B-Instruct
dystrio
sculpt
pruned
compressed
efficient
dense
runtime-agnostic
no-custom-kernels
hf-drop-in
drop-in-replacement
smaller
faster
llama
wikitext
name results
Dystrio Sculpt (Llama-3.1-8B-Instruct Default)
task dataset metrics
type
text-generation
name type
WikiText-103 (validation) wikitext
name type value
perplexity perplexity 14.7778
name type value
ppl_ratio ppl_ratio 1.0641

dystrio/Llama-3.1-8B-Instruct-sculpt-default

10% smaller, quality preserved (1.0641x PPL), drop-in replacement. No custom kernels. No runtime changes.

Dystrio Sculpt structurally compresses transformer models, producing dense models that load with standard transformers — no custom code, no new ops, no deployment friction.

This is the Default tier of Llama 3.1 8B Instruct.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("dystrio/Llama-3.1-8B-Instruct-sculpt-default", torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("dystrio/Llama-3.1-8B-Instruct-sculpt-default")

inputs = tokenizer("The future of AI inference is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Benchmark Results

All tiers compiled from Llama 3.1 8B Instruct on A100 80GB, bf16:

Model PPL PPL Ratio Weights (GB) Chat Prefill TPS RAG TTFT p95 (ms) Decode TPS
Baseline 13.8879 1.0 14.957527 10570.4 126.745 66.8
sculpt-default 14.7778 1.0641 13.457527 11418.6 116.957 65.5
sculpt-production 21.9236 1.5786 11.863777 12760.5 112.529 66.7
sculpt-throughput 27.7463 1.9979 11.020027 13408.6 104.086 67.5
sculpt-experimental 29.3853 2.1159 10.832527 13483.3 103.432 67.4

Key Metrics (this model)

Metric Value
Weights memory 13.457527 GB (10% smaller)
PPL ratio 1.0641
Chat prefill TPS 11418.6 (+8%)
RAG TTFT p95 116.957 ms (-8%)
Decode TPS 65.5 (flat)
Parameters 7.22B

All Sculpt Tiers

Tier HuggingFace Size PPL Ratio Use Case
default dystrio/Llama-3.1-8B-Instruct-sculpt-default 👈 this model 13.457527 GB 1.0641 Zero-regret: quality preserved, smaller footprint
production dystrio/Llama-3.1-8B-Instruct-sculpt-production 11.863777 GB 1.5786 Practical savings with modest quality tradeoff
throughput dystrio/Llama-3.1-8B-Instruct-sculpt-throughput 11.020027 GB 1.9979 Maximum usable compression for speed/edge
experimental dystrio/Llama-3.1-8B-Instruct-sculpt-experimental 10.832527 GB 2.1159 Boundary exploration, maximum structural compression

What is Dystrio Sculpt?

Dystrio Sculpt compiles transformer models into smaller, faster variants. Output models:

  • Are dense (not sparse) — standard architecture, fewer parameters
  • Load with standard HuggingFace Transformers — no custom code needed
  • Require no custom kernels and no runtime changes
  • Work as a one-step compile before deployment
  • Stack with quantization (AWQ, GPTQ, GGUF) for compound savings

Compatibility

  • HuggingFace Transformers
  • vLLM
  • TGI (Text Generation Inference)
  • llama.cpp / GGUF conversion
  • AWQ / GPTQ quantization
  • Any framework that loads standard safetensors

Benchmark Environment

  • GPU: NVIDIA A100-SXM4-80GB
  • dtype: bf16
  • Torch: 2.10.0+cu128
  • Transformers: 5.3.0
  • Deterministic: True
  • Single-GPU, standard HuggingFace Transformers, no custom kernels.

Metric Definitions

  • PPL ratio: WikiText-103 perplexity relative to baseline. <1.0 = quality improved.
  • Prefill TPS: Tokens per second during prompt encoding (higher = faster).
  • TTFT p95: Time to first token at 95th percentile (lower = faster).
  • Decode TPS: Tokens per second during generation (higher = faster).
  • Weights (GB): Model parameter memory (deterministic, runtime-independent).

Citation

@misc{dystrio_sculpt_2026,
  title={Dystrio Sculpt: Structural Compilation for Transformer LLMs},
  author={Dystrio},
  year={2026},
  url={https://huggingface.co/dystrio}
}

Downstream Benchmarks (lm-eval)

Evaluated with lm-eval-harness on A100-80GB, bf16, zero-shot.

Benchmark Baseline This Model Delta
ARC-Challenge 0.5358 0.4283 -0.1075
HellaSwag 0.5977 0.5416 -0.0561
MMLU 0.6844 0.5590 -0.1254
TruthfulQA MC2 0.5456 0.4824 -0.0632
Description
Model synced from source: dystrio/Llama-3.1-8B-Instruct-sculpt-default
Readme 29 KiB
Languages
Jinja 100%