license, library_name, pipeline_tag, language, base_model, tags, datasets, model-index
license
library_name
pipeline_tag
language
base_model
tags
datasets
model-index
apache-2.0
transformers
text-generation
mistralai/Mistral-7B-Instruct-v0.3
dystrio
sculpt
pruned
compressed
efficient
dense
runtime-agnostic
no-custom-kernels
hf-drop-in
drop-in-replacement
smaller
faster
mistral
name
results
Dystrio Sculpt (Mistral-7B-Instruct-v0.3 Throughput)
task
dataset
metrics
name
type
WikiText-103 (validation)
wikitext
name
type
value
perplexity
perplexity
16.3355
name
type
value
ppl_ratio
ppl_ratio
1.2966
dystrio/Mistral-7B-Instruct-v0.3-sculpt-throughput
23% smaller, +20% faster prefill, drop-in replacement. No custom kernels. No runtime changes.
Dystrio Sculpt structurally compresses transformer models, producing dense models that load with standard transformers — no custom code, no new ops, no deployment friction.
This is the Throughput tier of Mistral 7B Instruct v0.3 .
Quick Start
Benchmark Results
All tiers compiled from Mistral 7B Instruct v0.3 on A100 80GB, bf16:
Model
PPL
PPL Ratio
Weights (GB)
Chat Prefill TPS
RAG TTFT p95 (ms)
Decode TPS
Baseline
12.5983
1.0
13.500496
10557.3
133.325
66.8
sculpt-default
11.6283
0.923
12.000496
11594.3
123.069
65.3
sculpt-production
14.2859
1.134
11.250496
12093.9
120.842
66.0
sculpt-throughput
16.3355
1.2966
10.406746
12667.0
112.683
65.8
sculpt-experimental
25.1515
1.9964
9.562996
13595.9
110.293
66.5
Key Metrics (this model)
Metric
Value
Weights memory
10.406746 GB (23% smaller)
PPL ratio
1.2966
Chat prefill TPS
12667.0 (+20%)
RAG TTFT p95
112.683 ms (-15%)
Decode TPS
65.8 (flat)
Parameters
5.59B
All Sculpt Tiers
What is Dystrio Sculpt?
Dystrio Sculpt compiles transformer models into smaller, faster variants. Output models:
Are dense (not sparse) — standard architecture, fewer parameters
Load with standard HuggingFace Transformers — no custom code needed
Require no custom kernels and no runtime changes
Work as a one-step compile before deployment
Stack with quantization (AWQ, GPTQ, GGUF) for compound savings
Compatibility
✅ HuggingFace Transformers
✅ vLLM
✅ TGI (Text Generation Inference)
✅ llama.cpp / GGUF conversion
✅ AWQ / GPTQ quantization
✅ Any framework that loads standard safetensors
Benchmark Environment
GPU : NVIDIA A100-SXM4-80GB
dtype : bf16
Torch : 2.10.0+cu128
Transformers : 5.3.0
Deterministic : True
Single-GPU, standard HuggingFace Transformers, no custom kernels.
Metric Definitions
PPL ratio : WikiText-103 perplexity relative to baseline. <1.0 = quality improved.
Prefill TPS : Tokens per second during prompt encoding (higher = faster).
TTFT p95 : Time to first token at 95th percentile (lower = faster).
Decode TPS : Tokens per second during generation (higher = faster).
Weights (GB) : Model parameter memory (deterministic, runtime-independent).
Citation
Downstream Benchmarks (lm-eval)
Evaluated with lm-eval-harness on A100-80GB, bf16, zero-shot.
Benchmark
Baseline
This Model
Delta
ARC-Challenge
0.5794
0.3797
-0.1997
HellaSwag
0.6573
0.5075
-0.1498
MMLU
0.5975
0.3982
-0.1993
TruthfulQA MC2
0.5939
0.4860
-0.1079