Files
Qwen3-8B-LaCo-Pruned/README.md
ModelHub XC b1d3ea69a0 初始化项目,由ModelHub XC社区提供模型
Model: Mercity/Qwen3-8B-LaCo-Pruned
Source: Original Platform
2026-05-12 06:01:10 +08:00

296 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
base_model: Qwen/Qwen3-8B-Base
arvix: arxiv:2507.02279
tags:
- pruning
- layer-pruning
- laco
- compressed
- qwen3
- llm
- efficient
library_name: transformers
pipeline_tag: text-generation
language:
- en
- zh
- multilingual
datasets:
- wikipedia
model-index:
- name: Qwen3-8B-LaCo-Pruned
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag
type: hellaswag
metrics:
- type: accuracy_norm
value: 48.52
name: Accuracy (Normalized)
- task:
type: text-generation
name: Text Generation
dataset:
name: PIQA
type: piqa
metrics:
- type: accuracy_norm
value: 65.67
name: Accuracy (Normalized)
- task:
type: text-generation
name: Text Generation
dataset:
name: BoolQ
type: boolq
metrics:
- type: accuracy
value: 61.77
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU
type: mmlu
metrics:
- type: accuracy
value: 25.12
name: Accuracy (5-shot)
---
# Qwen3-8B-LaCo-Pruned
This model is a **layer-pruned** version of [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) using the [LaCo (Layer Collapse)](https://arxiv.org/abs/2402.11187) structured pruning method.
## Model Summary
| Attribute | Value |
|-----------|-------|
| **Base Model** | [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) |
| **Pruning Method** | LaCo (Layer Collapse) |
| **Original Layers** | 36 |
| **Pruned Layers** | 26 |
| **Layers Removed** | 10 |
| **Compression** | 27.8% |
| **Parameters** | ~5.8B (reduced from ~8B) |
## Intended Use
- **Research** on model compression and efficiency
- **Fine-tuning base** for domain-specific applications
- **Inference optimization** where speed/memory matters more than factual accuracy
- **Edge deployment** scenarios with limited computational resources
## ⚠️ Important Limitations
This pruned model has **significantly reduced factual knowledge capabilities**. It performs at near-random levels on knowledge-intensive benchmarks like MMLU.
| Use Case | Status |
|----------|--------|
| Physical reasoning tasks | ✅ Good (82.6% retained) |
| Reading comprehension | ⚠️ Acceptable (74.3% retained) |
| Common sense reasoning | ⚠️ Degraded (61.8% retained) |
| Factual question answering | ❌ Not recommended |
| Knowledge-intensive tasks | ❌ Not recommended |
**Recommendation:** Fine-tune this model on your target domain before deployment.
---
## Pruning Details
### LaCo Hyperparameters
| Parameter | Value | Description |
|-----------|-------|-------------|
| MERGE_LAYERS (C) | 3 | Layers merged per operation |
| LOWEST_LAY (L) | 4 | Minimum layer index for merging |
| HIGHEST_LAY (H) | 28 | Maximum layer index for merging |
| INTERVAL (I) | 2 | Minimum gap between merge points |
| THRESHOLD (T) | 0.85 | Cosine similarity threshold |
| MAX_COMPRESSION | 30% | Maximum allowed compression |
### Pruning Statistics
| Metric | Value |
|--------|-------|
| Successful Merges | 5 |
| Rejected Merges | 0 |
| Total Iterations | 6 |
| Final Compression | 27.8% |
### Hidden State Similarity (Calibration Set)
| Metric | Value |
|--------|-------|
| Average | 0.9680 |
| Min | 0.9492 |
| Max | 0.9766 |
Individual similarities: `[0.9492, 0.9727, 0.9609, 0.9766, 0.9688, 0.9648, 0.9648, 0.9766, 0.9727, 0.9727]`
### Perplexity Results
| Model | Perplexity | Ratio |
|-------|------------|-------|
| Original (Qwen3-8B-Base) | 26.19 | 1.00× |
| Pruned (this model) | 71.48 | **2.73×** |
---
## Benchmark Results
### Comparison with Original Qwen3-8B-Base
| Benchmark | Original | Pruned | Retention | Status |
|-----------|----------|--------|-----------|--------|
| **PIQA** | 79.54% | 65.67% | 82.6% | ✅ Good |
| **BoolQ** | 83.09% | 61.77% | 74.3% | ⚠️ Acceptable |
| **HellaSwag** | 78.55% | 48.52% | 61.8% | ⚠️ Degraded |
| **MMLU (5-shot)** | 76.89% | 25.12% | 32.7% | ❌ Near random |
*Original scores from [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)*
### Key Findings
1. **Physical reasoning preserved:** PIQA retained 82.6% of original performance
2. **Factual knowledge destroyed:** MMLU collapsed to random-chance (25% for 4-way MCQ)
3. **Perplexity underestimates damage:** 2.73× PPL ratio doesn't predict the benchmark collapse
4. **Layer-specific knowledge:** Factual knowledge appears encoded in specific removed layers
---
## Usage
### Basic Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Mercity/Qwen3-8B-LaCo-Pruned"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
# Text generation
prompt = "The process of photosynthesis"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### With 4-bit Quantization (Further Compression)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
"Mercity/Qwen3-8B-LaCo-Pruned",
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True
)
```
---
## Recovery Recommendations
To restore performance after pruning:
### Option 1: LoRA Fine-tuning (Recommended)
```python
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# Fine-tune on OpenOrca, Alpaca, or domain-specific data
```
### Option 2: Knowledge Distillation
Use original Qwen3-8B-Base as teacher to transfer knowledge back.
### Expected Recovery
- With fine-tuning: +15-25% on MMLU
- With knowledge distillation: +25-35% on MMLU
---
## Technical Specifications
| Attribute | Value |
|-----------|-------|
| Architecture | Transformer decoder-only |
| Parameters | ~5.8B |
| Layers | 26 |
| Hidden Size | 4096 |
| Attention Heads (Q) | 32 |
| Attention Heads (KV) | 8 (GQA) |
| Intermediate Size | 12288 |
| Vocabulary Size | 151,669 |
| Max Context Length | 32,768 tokens |
| Precision | bfloat16 |
---
## Citation
If you use this model, please cite the original LaCo paper and Qwen3:
```bibtex
@article{yang2024laco,
title={LaCo: Large Language Model Pruning via Layer Collapse},
author={Yang, Yifei and Cao, Zouying and Zhao, Hai},
journal={arXiv preprint arXiv:2402.11187},
year={2024}
}
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388}
}
```
## References
- [LaCo Paper](https://arxiv.org/abs/2402.11187)
- [LaCo Official Implementation](https://github.com/yangyifei729/LaCo)
- [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)
- [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)
## License
Apache 2.0 (same as base Qwen3 model)
## Acknowledgments
- Qwen Team for the excellent Qwen3-8B-Base model
- LaCo authors for the pruning methodology
- Hugging Face for model hosting