初始化项目,由ModelHub XC社区提供模型
Model: Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1 Source: Original Platform
This commit is contained in:
267
README.md
Normal file
267
README.md
Normal file
@@ -0,0 +1,267 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
base_model:
|
||||
- Qwen/Qwen2.5-7B-Instruct
|
||||
- mistralai/Mistral-7B-Instruct-v0.3
|
||||
- microsoft/Phi-3-mini-4k-instruct
|
||||
- microsoft/phi-2
|
||||
- HuggingFaceTB/SmolLM2-1.7B-Instruct
|
||||
- ibm-granite/granite-3.0-2b-instruct
|
||||
- EleutherAI/pythia-2.8b
|
||||
- EleutherAI/pythia-1.4b
|
||||
- facebook/opt-2.7b
|
||||
base_model_relation: merge
|
||||
language:
|
||||
- en
|
||||
library_name: transformers
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- merge
|
||||
- model-merge
|
||||
- cross-architecture
|
||||
- cross-family
|
||||
- cross-family-merge
|
||||
- weight-merge
|
||||
- training-free
|
||||
- training-free-merge
|
||||
- procrustes
|
||||
- canonical-key-namespace
|
||||
- svd-filter
|
||||
- crdt-merge
|
||||
- qwen
|
||||
- qwen2
|
||||
- qwen2.5
|
||||
- instruction-tuned
|
||||
- reasoning
|
||||
- mathematical-reasoning
|
||||
- instruction-following
|
||||
- 7b
|
||||
- text-generation
|
||||
- llama-factory-compatible
|
||||
- vllm-compatible
|
||||
- llama-cpp-compatible
|
||||
model-index:
|
||||
- name: Qwen2.5-7B-Instruct-borg-merge-v1
|
||||
results:
|
||||
- task:
|
||||
type: text-generation
|
||||
name: Grade School Math
|
||||
dataset:
|
||||
name: GSM8K
|
||||
type: gsm8k
|
||||
split: test
|
||||
metrics:
|
||||
- type: exact_match
|
||||
value: 0.8446
|
||||
name: exact_match (strict-match)
|
||||
verified: false
|
||||
- task:
|
||||
type: text-generation
|
||||
name: AI2 Reasoning Challenge
|
||||
dataset:
|
||||
name: ARC-Challenge
|
||||
type: ai2_arc
|
||||
split: test
|
||||
metrics:
|
||||
- type: acc_norm
|
||||
value: 0.5572
|
||||
name: acc_norm
|
||||
verified: false
|
||||
- task:
|
||||
type: text-generation
|
||||
name: Instruction Following
|
||||
dataset:
|
||||
name: IFEval
|
||||
type: ifeval
|
||||
split: test
|
||||
metrics:
|
||||
- type: inst_level_strict_acc
|
||||
value: 0.6811
|
||||
name: instruction-level strict accuracy
|
||||
verified: false
|
||||
- task:
|
||||
type: text-generation
|
||||
name: Massive Multitask Language Understanding
|
||||
dataset:
|
||||
name: MMLU
|
||||
type: cais/mmlu
|
||||
split: test
|
||||
metrics:
|
||||
- type: acc
|
||||
value: 0.7094
|
||||
name: acc
|
||||
verified: false
|
||||
- task:
|
||||
type: text-generation
|
||||
name: TruthfulQA
|
||||
dataset:
|
||||
name: TruthfulQA mc2
|
||||
type: truthful_qa
|
||||
split: validation
|
||||
metrics:
|
||||
- type: acc
|
||||
value: 0.6285
|
||||
name: mc2
|
||||
verified: false
|
||||
- task:
|
||||
type: text-generation
|
||||
name: Commonsense Reasoning
|
||||
dataset:
|
||||
name: HellaSwag
|
||||
type: hellaswag
|
||||
split: validation
|
||||
metrics:
|
||||
- type: acc
|
||||
value: 0.6830
|
||||
name: acc
|
||||
verified: false
|
||||
- task:
|
||||
type: text-generation
|
||||
name: Physical Commonsense
|
||||
dataset:
|
||||
name: PIQA
|
||||
type: ybisk/piqa
|
||||
split: validation
|
||||
metrics:
|
||||
- type: acc
|
||||
value: 0.8014
|
||||
name: acc
|
||||
verified: false
|
||||
- task:
|
||||
type: text-generation
|
||||
name: Code Generation
|
||||
dataset:
|
||||
name: HumanEval
|
||||
type: openai_humaneval
|
||||
split: test
|
||||
metrics:
|
||||
- type: pass@1
|
||||
value: 0.5854
|
||||
name: pass@1 (greedy)
|
||||
verified: false
|
||||
---
|
||||
|
||||
# Qwen2.5-7B-Instruct-borg-merge-v1
|
||||
|
||||
**A training-free cross-family weight merge of Qwen2.5-7B-Instruct with 8 donors from 4 architecture families. Lifts GSM8K +3.3 pp, ARC-Challenge +3.2 pp, and IFEval +2.6 pp absolute over the unmerged anchor. No fine-tuning. No distillation. No router. Drop-in `safetensors`.**
|
||||
|
||||
| Task | Anchor SOLO | This model | Δ |
|
||||
|---|---:|---:|---:|
|
||||
| **GSM8K** (`exact_match,strict-match`) | 0.8120 | **0.8446** | **+0.0326** |
|
||||
| **ARC-Challenge** (`acc_norm,none`) | 0.5256 | **0.5572** | **+0.0316** |
|
||||
| **IFEval** (`inst_level_strict_acc,none`) | 0.6547 | **0.6811** | **+0.0264** |
|
||||
| MMLU (`acc,none`) | 0.7180 | 0.7094 | -0.0086 |
|
||||
| TruthfulQA mc2 (`acc,none`) | 0.6475 | 0.6285 | -0.0190 |
|
||||
| HellaSwag (`acc,none`) | 0.6895 | 0.6830 | -0.0065 |
|
||||
| PIQA (`acc,none`) | 0.8030 | 0.8014 | -0.0016 |
|
||||
| HumanEval (`pass@1,greedy`) | 0.6463 | 0.5854 | -0.0610 |
|
||||
|
||||
Lifts on **3 of 8 standard benchmarks** vs. the unmerged anchor -- on the tasks where the donor pool is competence-concentrated (instruction following + broad reasoning). Regresses on HumanEval, where the donor pool was code-light by design. The regression structure is itself a falsifiable prediction about the recipe.
|
||||
|
||||
## Quick start
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
import torch
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1",
|
||||
torch_dtype=torch.float16,
|
||||
device_map="auto",
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained("Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1")
|
||||
|
||||
prompt = "Q: What is 17 multiplied by 23? Show your work.\nA:"
|
||||
ids = tokenizer(prompt, return_tensors="pt").to(model.device)
|
||||
out = model.generate(**ids, max_new_tokens=128, do_sample=False)
|
||||
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
Compatible with `vLLM`, `llama.cpp` (after GGUF conversion), `text-generation-inference`, `text-generation-webui`, and any standard HuggingFace inference stack.
|
||||
|
||||
## What's special about this merge
|
||||
|
||||
Cross-family weight merging across architecture families (Llama, Phi, NeoX, OPT) is conventionally considered impossible -- different attention head dimensions, different FFN expansion factors, different vocabularies. A naive linear interpolation between, say, a Qwen attention block and a Mistral attention block does not even type-check.
|
||||
|
||||
This model is the result of a training-free pipeline that solves this:
|
||||
|
||||
1. **Canonicalize** each donor's tensors into a shared key namespace via per-architecture detectors (10 architecture families covered: BERT, RoBERTa, Llama/Qwen, Mistral, Pythia, OPT, Phi, T5, w2v-bert, and more).
|
||||
2. **Procrustes-align** each donor's basis to the anchor via per-tensor orthogonal rotation (smaller-side SVD).
|
||||
3. **Compute donor deltas** in canonical space; filter via per-role tolerance (asymmetric: `τ_attn=0.05`, `τ_ffn=0.20`); keep top-3 SVD components.
|
||||
4. **Absorb** the rotated, filtered, low-rank delta into the anchor with anchor blend `β=0.60`.
|
||||
5. **Decanonicalize** to the anchor's native key namespace; save as standard `safetensors`.
|
||||
|
||||
This is the **asymmetric tolerance recipe**: tight on attention to preserve circuits, loose on FFN to absorb knowledge.
|
||||
|
||||
## Donor pool (8 donors, 4 architecture families)
|
||||
|
||||
| Source | Family | License |
|
||||
|---|---|---|
|
||||
| Qwen/Qwen2.5-7B-Instruct (anchor) | Qwen / Llama-arch | Apache 2.0 |
|
||||
| mistralai/Mistral-7B-Instruct-v0.3 | Mistral / Llama-arch | Apache 2.0 |
|
||||
| microsoft/Phi-3-mini-4k-instruct | Phi (new) | MIT |
|
||||
| microsoft/phi-2 | Phi (old) | MIT |
|
||||
| HuggingFaceTB/SmolLM2-1.7B-Instruct | Llama-arch (small) | Apache 2.0 |
|
||||
| ibm-granite/granite-3.0-2b-instruct | Llama-arch (Granite tweaks) | Apache 2.0 |
|
||||
| EleutherAI/pythia-2.8b | NeoX | Apache 2.0 |
|
||||
| EleutherAI/pythia-1.4b | NeoX | Apache 2.0 |
|
||||
| facebook/opt-2.7b | OPT | OPT license |
|
||||
|
||||
## Verification
|
||||
|
||||
- **Cross-run reproducibility**: an independent preflight evaluation two days prior to the headline run produces byte-identical scores to all 16 decimal places across every overlapping (variant, task) cell. The merge is fully deterministic.
|
||||
- **Pre-flight gates**: G1 round-trip across all 6 cross-family canonicalization tests reports `r_max=0.0`, `n_bad=0` (lossless canonical key namespace). G3 multi-seed slice-bias on the anchor MMLU 200-sample slice returns `0.7480126320374605` to 16 decimal places across seeds 7, 42, 1337. G4 anchor MMLU full matches the published Qwen2.5-7B-Instruct leaderboard reference.
|
||||
- **Behavioural inspection**: 5 reasoning-heavy prompts (math word problem, French translation, long-multiplication, recursive Fibonacci, factual enumeration) produce coherent, instruction-following, mathematically-correct output with no gibberish, no tokenizer drift, no instruction-format collapse.
|
||||
- **Eval framework**: `lm-eval-harness` 0.4.4 with `transformers` 4.55.0, `tokenizers` 0.21.4, `datasets` >=2.20 <4.0, fp16, batch 2, single A100 80GB.
|
||||
|
||||
## Comparison to recent work in the model-merging landscape
|
||||
|
||||
For a comprehensive map of model-merging methods, theory, and applications, see Yang et al.'s curated survey **Awesome-Model-Merging-Methods-Theories-Applications** (forthcoming *ACM Computing Surveys 2026*).
|
||||
|
||||
Closest direct relatives:
|
||||
|
||||
- **Transport and Merge** (Cui et al., Feb 2026) -- cross-architecture merging via activation-space optimal transport. Different problem class: theirs produces a runtime-aligned composition; this model is a permanent merged checkpoint.
|
||||
- **Unconstrained Model Merging for Enhanced LLM Reasoning** (Zhang et al., Oct 2024) -- closest direct relative on substrate scale (7B-class) and donor count (9 reasoning-optimized LLMs). The result above extends this lineage with absolute benchmark deltas against a state-competitive instruction-tuned anchor.
|
||||
- **Git Re-Basin** (Ainsworth, Hayase & Srinivasa, ICLR 2023) -- same-architecture merging modulo permutation symmetries. The pipeline above is essentially the cross-architecture generalization (continuous Procrustes rotation rather than discrete permutation matching).
|
||||
- **OT-Fusion** (Singh & Jaggi, NeurIPS 2020) -- same-architecture optimal transport on weight rows. Spiritual ancestor of Cui et al.'s 2026 cross-architecture extension.
|
||||
- **REPAIR** (Jordan et al., 2022) -- re-normalization to address variance collapse after permutation interpolation. The pipeline above sidesteps this by using anchor-plus-delta absorption rather than midpoint interpolation.
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Code generation regresses** by 6.10 pp on HumanEval. The donor pool was reasoning-heavy and instruction-tuned; it contained no code-specialist models (CodeLlama, StarCoder, Qwen2.5-Coder). Documented as falsifiable prediction: a code-heavy donor pool should restore HumanEval while preserving the GSM8K, ARC-Challenge, and IFEval gains. This is the explicit subject of the next research cycle.
|
||||
- **Mild MMLU regression** (-0.86 pp). The merge trades some broad knowledge for instruction-following + reasoning concentration. Within typical eval noise on TruthfulQA mc2 (-0.19), HellaSwag (-0.07), PIQA (-0.02).
|
||||
- **Single substrate tested**: results are on Qwen2.5-7B-Instruct. Generalization to other instruction-tuned 7B-class anchors (Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3 as anchor, etc.) is the next experiment.
|
||||
- **HumanEval pass@1 measured via custom isolated-subprocess scorer**, not via lm-eval (the pinned `lm-eval-harness 0.4.4` does not ship the humaneval task). Greedy decoding, 164 problems, no temperature sweep. Identical methodology to bigcode-evaluation-harness with subprocess-isolated test execution.
|
||||
|
||||
## Intended use
|
||||
|
||||
- Research and evaluation of cross-family weight-merging techniques.
|
||||
- Drop-in replacement for `Qwen/Qwen2.5-7B-Instruct` in workflows where the trade-off (GSM8K / ARC-Challenge / IFEval lifts vs. mild HumanEval regression) is favorable.
|
||||
- Compatible with vLLM, llama.cpp (after GGUF conversion), TGI, text-generation-webui, and any standard HuggingFace inference stack.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Code generation as primary use case -- use `Qwen/Qwen2.5-Coder-7B-Instruct` instead, or wait for the next merge variant which targets a code-heavy donor pool.
|
||||
- Production deployment without your own evaluation on your specific task distribution.
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this model, please cite:
|
||||
|
||||
```bibtex
|
||||
@misc{borg-merge-v1-2026,
|
||||
title = {Conflict-Free Replicated Datatypes for Neural Network Model Merging},
|
||||
author = {Optitransfer},
|
||||
year = {2026},
|
||||
url = {https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1}
|
||||
}
|
||||
```
|
||||
|
||||
## Contact
|
||||
|
||||
- `rgillespie83@icloud.com`
|
||||
- `data@optitransfer.ch`
|
||||
|
||||
For arXiv endorsement requests on the full technical paper covering cross-family weight merging (cs.LG / secondary cs.CL): same contacts, subject line *"arXiv endorsement: cross-family weight merging"*.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user