72 lines
3.0 KiB
Markdown
72 lines
3.0 KiB
Markdown
---
|
|
tags:
|
|
- merge
|
|
- mergekit
|
|
- qwen2.5
|
|
license: apache-2.0
|
|
pipeline_tag: text-generation
|
|
base_model:
|
|
- Xiaojian9992024/Qwen2.5-Dyanka-7B-Preview
|
|
- Xiaojian9992024/Qwen2.5-THREADRIPPER-Small
|
|
- suayptalha/Clarus-7B-v0.3
|
|
- gz987/qwen2.5-7b-cabs-v0.3
|
|
---
|
|
|
|
# 7B Linear Merge (Qwen2.5)
|
|
|
|
A linear merge of four Qwen2.5-7B fine-tunes, with mixing weights chosen by random search over the simplex (30 Dirichlet samples) and selected against a small held-out eval set.
|
|
|
|
This was a learning project to build an end-to-end merge + evaluation pipeline. The numbers below are honest results — the merge is competent but not state-of-the-art for 7B Qwen2.5 fine-tunes.
|
|
|
|
## Source models
|
|
|
|
- Xiaojian9992024/Qwen2.5-Dyanka-7B-Preview
|
|
- Xiaojian9992024/Qwen2.5-THREADRIPPER-Small
|
|
- suayptalha/Clarus-7B-v0.3
|
|
- gz987/qwen2.5-7b-cabs-v0.3
|
|
|
|
## Method
|
|
|
|
Linear merge via [mergekit](https://github.com/arcee-ai/mergekit). Mixing weights were selected by sampling 30 weight vectors from a Dirichlet prior, evaluating each merged candidate on a 20-example proxy eval (mixed MMLU + IFEval-style instruction following), and keeping the best-scoring weights. The proxy eval was small and the search procedure was random sampling rather than a true evolutionary algorithm — limitations worth noting for anyone building on this.
|
|
|
|
## Evaluation
|
|
|
|
Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) on the Open LLM Leaderboard v2 task suite, single H100, vLLM backend, bf16.
|
|
|
|
| Benchmark | Metric | Score |
|
|
|------------------------|-------------------------|------:|
|
|
| IFEval | prompt_level_strict_acc | 38.63 |
|
|
| IFEval | inst_level_strict_acc | 52.76 |
|
|
| BBH | acc_norm | 55.55 |
|
|
| MATH-Lvl-5 (hard) | exact_match | 36.93 |
|
|
| GPQA | acc_norm | 32.30 |
|
|
| MuSR | acc_norm | 44.58 |
|
|
| MMLU-Pro | acc | 44.92 |
|
|
|
|
### Observations
|
|
|
|
- **Strong:** MATH-Hard (36.9, with algebra-hard at 63.5%) — likely inherited from Clarus and qwen2.5-7b-cabs.
|
|
- **Weak:** IFEval at 38.6 prompt-level strict is below what individual strong Qwen2.5-7B fine-tunes achieve. Linear merging appears to dilute instruction-following behavior when the source models disagree on response formatting.
|
|
- **Average:** BBH, MMLU-Pro, GPQA, MuSR all land in the typical mid-range for 7B models.
|
|
|
|
### Reproduce
|
|
|
|
```bash
|
|
lm_eval \
|
|
--model vllm \
|
|
--model_args pretrained=Jagan666/7B-merge-champion,dtype=bfloat16,gpu_memory_utilization=0.9,max_model_len=4096 \
|
|
--tasks leaderboard \
|
|
--batch_size auto \
|
|
--output_path ./eval_results \
|
|
--log_samples
|
|
```
|
|
|
|
## Limitations
|
|
|
|
- Linear merge: simple, but can dilute task-specific behaviors (especially instruction following).
|
|
- Search was random Dirichlet sampling on a small proxy eval — likely overfits to the proxy.
|
|
- No safety / alignment evaluation was performed beyond the leaderboard tasks.
|
|
|
|
## License
|
|
|
|
Apache 2.0, inherited from the source models. |