Files

ModelHub XC 64eda1a024 初始化项目，由ModelHub XC社区提供模型

Model: harryadav3/Qwen3-30B-A3B-REAP-50
Source: Original Platform

2026-04-26 14:48:05 +08:00

11 KiB

Raw Permalink Blame History

license, base_model, tags, library_name, pipeline_tag, model-index

license

base_model

Qwen3-30B-A3B-REAP-50: 50% Expert-Pruned Qwen3 MoE

This model is a 50% expert-pruned version of Qwen/Qwen3-30B-A3B, compressed using REAP (Router-weighted Expert Activation Pruning) from Cerebras Research.

REAP is a one-shot compression technique for Mixture-of-Experts (MoE) models that physically removes low-importance experts based on a saliency criterion combining router gate-values and expert activation norms. The method was published at ICLR 2026.

What Changed

Property	Original	Pruned
Total Experts per Layer	128	64
Active Experts per Token	8	8 (unchanged)
Model Size on Disk	57 GB	30 GB
Safetensor Shards	16	7
Architecture	Qwen3MoeForCausalLM	Qwen3MoeForCausalLM (unchanged)
Hidden Size	2048	2048 (unchanged)
Layers	48	48 (unchanged)
Precision	bfloat16	bfloat16 (unchanged)

The pruned model is a standard HuggingFace model and can be loaded directly with transformers -- no custom code required.

How REAP Works

The Problem

MoE models like Qwen3-30B-A3B use sparsely-activated expert networks: each token is routed to only 8 of 128 available experts per layer. This means most experts sit idle for any given input, making many experts redundant. REAP exploits this by identifying and removing the least important experts.

The REAP Saliency Criterion

REAP scores each expert using a dual criterion that captures both how often an expert is selected and how much it contributes when active:

REAP_score(expert_i) = mean over calibration tokens of:
    router_weight(expert_i) * activation_norm(expert_i)

Where:

Router weight (router_weight): The softmax probability assigned by the gating network when selecting this expert. Higher means the router "prefers" this expert.
Expert Activation Norm (activation_norm): The L2 norm of the expert's output vector. Higher means the expert produces larger (more impactful) modifications to the hidden state.

The product captures experts that are both frequently/strongly selected AND produce meaningful outputs. An expert with high router weight but low activation norm is just noise; one with high activation norm but low router weight is rarely used. REAP finds the experts that matter on both dimensions.

Why Pruning Beats Merging

The REAP paper (ICLR 2026) demonstrates a key finding: expert pruning consistently outperforms expert merging for MoE compression on generative tasks. Merging (combining similar experts into one) degrades all participating experts, while pruning (removing entire experts) preserves the full capacity of remaining experts and the router's ability to select among them.

The Full Pipeline

1. Load Model
   |
2. Attach Observer Hooks to every MoE layer
   |
3. Forward Pass over calibration data (1024 samples)
   |-- Record router weights per expert per token
   |-- Record L2 norm of expert outputs per token
   |
4. Compute REAP saliency score for each expert
   |-- score = mean(router_weight * activation_norm)
   |
5. Rank experts by saliency score (lowest = least important)
   |
6. Prune bottom 50% of experts per layer
   |-- Remove expert modules from ModuleList
   |-- Slice router weight matrix to match
   |
7. Update config.json (num_experts: 128 -> 64)
   |
8. Save compressed model

Detailed Parameters Used

Model Configuration

Parameter	Value	Description
`model_name`	`Qwen/Qwen3-30B-A3B`	Base model: 30B total params, 3B active per token
`num_hidden_layers`	48	Number of transformer layers
`hidden_size`	2048	Hidden dimension
`num_attention_heads`	32	Multi-head attention heads
`num_key_value_heads`	4	GQA key-value heads
`head_dim`	128	Per-head dimension
`intermediate_size`	6144	FFN intermediate size (shared experts)
`moe_intermediate_size`	768	Per-expert FFN intermediate size
`num_experts`	128 -> 64	Experts per MoE layer (before -> after)
`num_experts_per_tok`	8	Top-K experts activated per token (unchanged)
`vocab_size`	151,936	Vocabulary size
`max_position_embeddings`	40,960	Maximum sequence length
`torch_dtype`	bfloat16	Model precision

Pruning Configuration

Parameter	Value	Description
`prune_method`	`reap`	REAP saliency criterion (router_weight * activation_norm)
`compression_ratio`	0.50	Remove 50% of experts (128 -> 64 per layer)
`seed`	42	Random seed for reproducibility
`singleton_super_experts`	`false`	Do not force high-activation outlier experts into singleton clusters
`singleton_outlier_experts`	`false`	Do not force outlier experts into singleton clusters

Observer Configuration (Activation Collection)

Parameter	Value	Description
`samples_per_category`	1024	Number of calibration samples processed
`batch_size`	1	Samples per forward pass
`model_max_length`	2048	Maximum sequence length for calibration
`distance_measure`	`cosine`	Distance metric for expert similarity
`renormalize_router_weights`	`true`	Renormalize router logits after softmax
`record_pruning_metrics_only`	`true`	Only collect metrics needed for pruning (skip merging metrics)
`overwrite_observations`	`false`	Do not overwrite existing observation files

Calibration Dataset

Parameter	Value	Description
`dataset_name`	`theblackcat102/evol-codealpaca-v1`	Code instruction-following dataset
`split`	`train`	Dataset split used
`shuffle`	`true`	Shuffle before sampling

Clustering Configuration

Parameter	Value	Description
`cluster_method`	`agglomerative`	Hierarchical agglomerative clustering
`expert_sim`	`ttm`	Token-to-token similarity matrix for expert similarity
`linkage_method`	`average`	Average linkage for hierarchical clustering
`frequency_penalty`	`true`	Penalize frequently-used experts during clustering

Timing

Phase	Duration
Model loading	~5 seconds
Observer pass (1024 samples)	~6.5 hours
Expert pruning (all 48 layers)	< 1 second
Model saving	~26 seconds
Total	~6.5 hours

Evaluation Results (0-shot, lm-eval-harness v0.4.11)

Benchmark	Metric	Score
MMLU (57 subjects)	acc	49.42%
-- Humanities	acc	39.17%
-- Social Sciences	acc	60.38%
-- STEM	acc	56.68%
-- Other	acc	46.73%
ARC Challenge	acc	33.62%
ARC Challenge	acc_norm	38.65%
ARC Easy	acc	53.16%
ARC Easy	acc_norm	50.51%
HellaSwag	acc	37.70%
HellaSwag	acc_norm	47.64%
BoolQ	acc	74.22%
WinoGrande	acc	58.80%
OpenBookQA	acc	19.80%
OpenBookQA	acc_norm	31.20%
RTE	acc	58.48%

Evaluation Notes

All benchmarks run at 0-shot (no few-shot examples)
Evaluation performed on the base model (not instruction-tuned)
Evaluated using lm-eval-harness v0.4.11 with model="hf" backend
Model loaded with device_map="auto" across 2 GPUs

Usage

Direct Loading with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "harryadav3/Qwen3-30B-A3B-REAP-50"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Serving with vLLM

vllm serve harryadav3/Qwen3-30B-A3B-REAP-50 \
    --tensor-parallel-size 2 \
    --port 8000 \
    --trust-remote-code

Reproducing This Model

# Clone REAP
git clone https://github.com/CerebrasResearch/reap.git
cd reap
git submodule init && git submodule update --recursive

# Install
uv venv .venv --seed --python 3.12
source .venv/bin/activate
uv pip install --editable . --native-tls --torch-backend auto

# Download base model
huggingface-cli download Qwen/Qwen3-30B-A3B

# Run REAP pruning
bash experiments/pruning-cli.sh \
    0,1 \
    "Qwen/Qwen3-30B-A3B" \
    "reap" \
    42 \
    0.50 \
    "theblackcat102/evol-codealpaca-v1" \
    false false false false false false false

Citation

If you use this model, please cite the REAP paper:

@inproceedings{klasby2025reap,
    title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
    author={Mike Klasby and Thao Nguyen and Robert D Nowak},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2025},
    url={https://arxiv.org/abs/2510.13999}
}

11 KiB Raw Permalink Blame History