Inference optimization where speed/memory matters more than factual accuracy
Edge deployment scenarios with limited computational resources
⚠️ Important Limitations
This pruned model has significantly reduced factual knowledge capabilities. It performs at near-random levels on knowledge-intensive benchmarks like MMLU.
Use Case
Status
Physical reasoning tasks
✅ Good (82.6% retained)
Reading comprehension
⚠️ Acceptable (74.3% retained)
Common sense reasoning
⚠️ Degraded (61.8% retained)
Factual question answering
❌ Not recommended
Knowledge-intensive tasks
❌ Not recommended
Recommendation: Fine-tune this model on your target domain before deployment.
Physical reasoning preserved: PIQA retained 82.6% of original performance
Factual knowledge destroyed: MMLU collapsed to random-chance (25% for 4-way MCQ)
Perplexity underestimates damage: 2.73× PPL ratio doesn't predict the benchmark collapse
Layer-specific knowledge: Factual knowledge appears encoded in specific removed layers
Usage
Basic Inference
fromtransformersimportAutoModelForCausalLM,AutoTokenizermodel_name="Mercity/Qwen3-8B-LaCo-Pruned"tokenizer=AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype="auto",device_map="auto",trust_remote_code=True)# Text generationprompt="The process of photosynthesis"inputs=tokenizer(prompt,return_tensors="pt").to(model.device)outputs=model.generate(**inputs,max_new_tokens=100,do_sample=True,temperature=0.7)print(tokenizer.decode(outputs[0],skip_special_tokens=True))
frompeftimportLoraConfig,get_peft_modellora_config=LoraConfig(r=32,lora_alpha=64,target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],lora_dropout=0.05,)model=get_peft_model(model,lora_config)# Fine-tune on OpenOrca, Alpaca, or domain-specific data
Option 2: Knowledge Distillation
Use original Qwen3-8B-Base as teacher to transfer knowledge back.
Expected Recovery
With fine-tuning: +15-25% on MMLU
With knowledge distillation: +25-35% on MMLU
Technical Specifications
Attribute
Value
Architecture
Transformer decoder-only
Parameters
~5.8B
Layers
26
Hidden Size
4096
Attention Heads (Q)
32
Attention Heads (KV)
8 (GQA)
Intermediate Size
12288
Vocabulary Size
151,669
Max Context Length
32,768 tokens
Precision
bfloat16
Citation
If you use this model, please cite the original LaCo paper and Qwen3:
@article{yang2024laco,title={LaCo: Large Language Model Pruning via Layer Collapse},author={Yang, Yifei and Cao, Zouying and Zhao, Hai},journal={arXiv preprint arXiv:2402.11187},year={2024}}@misc{qwen3technicalreport,title={Qwen3 Technical Report},author={Qwen Team},year={2025},eprint={2505.09388},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2505.09388}}