This model is a 50% expert-pruned version of Qwen/Qwen3-30B-A3B, compressed using REAP (Router-weighted Expert Activation Pruning) from Cerebras Research.
REAP is a one-shot compression technique for Mixture-of-Experts (MoE) models that physically removes low-importance experts based on a saliency criterion combining router gate-values and expert activation norms. The method was published at ICLR 2026.
What Changed
Property
Original
Pruned
Total Experts per Layer
128
64
Active Experts per Token
8
8 (unchanged)
Model Size on Disk
57 GB
30 GB
Safetensor Shards
16
7
Architecture
Qwen3MoeForCausalLM
Qwen3MoeForCausalLM (unchanged)
Hidden Size
2048
2048 (unchanged)
Layers
48
48 (unchanged)
Precision
bfloat16
bfloat16 (unchanged)
The pruned model is a standard HuggingFace model and can be loaded directly with transformers -- no custom code required.
How REAP Works
The Problem
MoE models like Qwen3-30B-A3B use sparsely-activated expert networks: each token is routed to only 8 of 128 available experts per layer. This means most experts sit idle for any given input, making many experts redundant. REAP exploits this by identifying and removing the least important experts.
The REAP Saliency Criterion
REAP scores each expert using a dual criterion that captures both how often an expert is selected and how much it contributes when active:
REAP_score(expert_i) = mean over calibration tokens of:
router_weight(expert_i) * activation_norm(expert_i)
Where:
Router weight (router_weight): The softmax probability assigned by the gating network when selecting this expert. Higher means the router "prefers" this expert.
Expert Activation Norm (activation_norm): The L2 norm of the expert's output vector. Higher means the expert produces larger (more impactful) modifications to the hidden state.
The product captures experts that are both frequently/strongly selected AND produce meaningful outputs. An expert with high router weight but low activation norm is just noise; one with high activation norm but low router weight is rarely used. REAP finds the experts that matter on both dimensions.
Why Pruning Beats Merging
The REAP paper (ICLR 2026) demonstrates a key finding: expert pruning consistently outperforms expert merging for MoE compression on generative tasks. Merging (combining similar experts into one) degrades all participating experts, while pruning (removing entire experts) preserves the full capacity of remaining experts and the router's ability to select among them.
The Full Pipeline
1. Load Model
|
2. Attach Observer Hooks to every MoE layer
|
3. Forward Pass over calibration data (1024 samples)
|-- Record router weights per expert per token
|-- Record L2 norm of expert outputs per token
|
4. Compute REAP saliency score for each expert
|-- score = mean(router_weight * activation_norm)
|
5. Rank experts by saliency score (lowest = least important)
|
6. Prune bottom 50% of experts per layer
|-- Remove expert modules from ModuleList
|-- Slice router weight matrix to match
|
7. Update config.json (num_experts: 128 -> 64)
|
8. Save compressed model
All benchmarks run at 0-shot (no few-shot examples)
Evaluation performed on the base model (not instruction-tuned)
Evaluated using lm-eval-harness v0.4.11 with model="hf" backend
Model loaded with device_map="auto" across 2 GPUs
Usage
Direct Loading with Transformers
fromtransformersimportAutoModelForCausalLM,AutoTokenizermodel_name="harryadav3/Qwen3-30B-A3B-REAP-50"tokenizer=AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)model=AutoModelForCausalLM.from_pretrained(model_name,device_map="auto",torch_dtype="auto",trust_remote_code=True,)messages=[{"role":"user","content":"Write a Python function to check if a number is prime."}]inputs=tokenizer.apply_chat_template(messages,return_tensors="pt",add_generation_prompt=True)inputs=inputs.to(model.device)outputs=model.generate(inputs,max_new_tokens=512)print(tokenizer.decode(outputs[0][inputs.shape[1]:],skip_special_tokens=True))
If you use this model, please cite the REAP paper:
@inproceedings{klasby2025reap,title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},author={Mike Klasby and Thao Nguyen and Robert D Nowak},booktitle={The Fourteenth International Conference on Learning Representations},year={2025},url={https://arxiv.org/abs/2510.13999}}