119 lines
7.8 KiB
Markdown
119 lines
7.8 KiB
Markdown
|
|
---
|
||
|
|
base_model:
|
||
|
|
- Lambent/Qwen3-4B-Base-Continued-GRPO-Wave
|
||
|
|
library_name: transformers
|
||
|
|
tags:
|
||
|
|
- mergekit
|
||
|
|
- merge
|
||
|
|
license: apache-2.0
|
||
|
|
---
|
||
|
|
|
||
|
|
For this one...
|
||
|
|
|
||
|
|
... (over)trained a SmolLM2-360M on 5 epochs at swept-for LR and rank on each of the target domains to fit style,
|
||
|
|
then rewarded the model for lowering perplexity on the proxy model.
|
||
|
|
|
||
|
|
In this case, trained an adapter per domain and then Karcher merged them.
|
||
|
|
I'm not sure if any of the domains had notably different effect, they all basically had the same result on evals.
|
||
|
|
However, the karcher combination of them seem to have significantly lowered perplexity on lambada_openai, which is interesting enough to publish.
|
||
|
|
|
||
|
|
Additionally, attempted to implement MARA from https://im-ant.github.io/mara/ on the GRPO side to help preserve distribution entropy, though I'm unsure how correctly/usefully we did so.
|
||
|
|
|
||
|
|
| Task | Metric | Qwen3-4B-Base | GRPO-Merge | Δ Base | GRPO-Wave | Δ Base | Δ Merge | Style-Karcher | Δ Base | Δ Wave |
|
||
|
|
|:-----|:-------|:-------------:|:----------:|:------:|:---------:|:------:|:-------:|:-------------:|:------:|:------:|
|
||
|
|
| arc_easy | acc | 0.7891 | 0.7870 | -0.27% | 0.7912 | +0.27% | +0.53% | 0.7883 | -0.10% | -0.37% |
|
||
|
|
| arc_easy | acc_norm | 0.7609 | 0.7605 | -0.05% | 0.7643 | +0.45% | +0.50% | 0.7576 | -0.43% | -1.04% |
|
||
|
|
| lambada_openai | acc | 0.6912 | 0.6984 | +1.04% | 0.7006 | +1.36% | +0.31% | **0.7087** | **+2.53%** | +1.16% |
|
||
|
|
| lambada_openai | perplexity ↓ | 4.2433 | 4.0490 | -4.58% | 3.9616 | -6.64% | -2.16% | **3.8343** | **-9.63%** | -3.21% |
|
||
|
|
| openbookqa | acc | 0.3160 | 0.3180 | +0.63% | 0.3180 | +0.63% | ±0.00% | 0.3160 | ±0.00% | -0.63% |
|
||
|
|
| openbookqa | acc_norm | 0.4100 | 0.4120 | +0.49% | 0.4100 | ±0.00% | -0.49% | 0.4080 | -0.49% | -0.49% |
|
||
|
|
| piqa | acc | 0.7797 | 0.7807 | +0.13% | 0.7813 | +0.21% | +0.08% | 0.7786 | -0.14% | -0.35% |
|
||
|
|
| piqa | acc_norm | 0.7807 | 0.7807 | ±0.00% | 0.7813 | +0.08% | +0.08% | 0.7807 | ±0.00% | -0.08% |
|
||
|
|
|
||
|
|
|
||
|
|
Some very interesting results on diversity also:
|
||
|
|
|
||
|
|
**Diversity Metrics (Qwen3-4B-Base vs Style-Karcher, temperature=1.0, 8 completions per prompt)**
|
||
|
|
|
||
|
|
| Domain | Metric | Base | Karcher | Δ |
|
||
|
|
|--------|--------|:----:|:-------:|:-:|
|
||
|
|
| ao3_english | Prefix entropy | 3.309 | 3.238 | -2.1% |
|
||
|
|
| ao3_english | Distinct-1 | 0.618 | **0.683** | **+10.5%** |
|
||
|
|
| ao3_english | Distinct-2 | 0.962 | **0.984** | +2.3% |
|
||
|
|
| ao3_english | Pairwise diversity | 0.919 | **0.932** | +1.4% |
|
||
|
|
| github_python | Prefix entropy | 1.514 | 1.456 | -3.8% |
|
||
|
|
| github_python | Distinct-1 | 0.610 | **0.624** | +2.3% |
|
||
|
|
| github_python | Distinct-2 | 0.890 | 0.876 | -1.6% |
|
||
|
|
| github_python | Pairwise diversity | 0.933 | 0.933 | ±0.0% |
|
||
|
|
| wikipedia_english | Prefix entropy | 1.974 | 1.892 | -4.2% |
|
||
|
|
| wikipedia_english | Distinct-1 | 0.599 | 0.559 | -6.7% |
|
||
|
|
| wikipedia_english | Distinct-2 | 0.932 | 0.898 | -3.6% |
|
||
|
|
| wikipedia_english | Pairwise diversity | 0.907 | 0.900 | -0.8% |
|
||
|
|
| bbc_news | Prefix entropy | 2.252 | 2.186 | -2.9% |
|
||
|
|
| bbc_news | Distinct-1 | 0.557 | **0.577** | +3.6% |
|
||
|
|
| bbc_news | Distinct-2 | 0.949 | **0.951** | +0.3% |
|
||
|
|
| bbc_news | Pairwise diversity | 0.901 | **0.908** | +0.8% |
|
||
|
|
| arxiv_cs | Prefix entropy | 2.455 | 2.346 | -4.4% |
|
||
|
|
| arxiv_cs | Distinct-1 | 0.555 | **0.567** | +2.3% |
|
||
|
|
| arxiv_cs | Distinct-2 | 0.905 | **0.906** | +0.2% |
|
||
|
|
| arxiv_cs | Pairwise diversity | 0.895 | **0.901** | +0.7% |
|
||
|
|
|
||
|
|
|
||
|
|
Additional experiment (after quantization, should affect further training but not existing quants):
|
||
|
|
Initializing the \<think\>\</think\> tokens in embedding space.
|
||
|
|
|
||
|
|
Original embeddings were identical (cos=1.0) at 0.3x norm, untrained.
|
||
|
|
|
||
|
|
Optimized via AdamW on GSM8k reasoning traces with 3-shot prefix, loss on
|
||
|
|
reasoning+answer tokens, norm clamped to 1.5x avg embedding norm.
|
||
|
|
|
||
|
|
After: two distinct vectors (cos=0.07) at 1.5x norm.
|
||
|
|
GSM8k 3-shot accuracy: 96.7% (29/30) vs 90.0% with original embeddings.
|
||
|
|
CE loss improvement: +7.8% on held-out eval.
|
||
|
|
|
||
|
|
This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
|
||
|
|
|
||
|
|
## Merge Details
|
||
|
|
### Merge Method
|
||
|
|
|
||
|
|
This model was merged using the [Karcher Mean](https://en.wikipedia.org/wiki/Karcher_mean) merge method.
|
||
|
|
|
||
|
|
### Models Merged
|
||
|
|
|
||
|
|
The following models were included in the merge:
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_javascript-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_cs-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-general-ao3style-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-ao3_english-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_math-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_python-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-wikipedia_english-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_physics-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_cpp-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-bbc_news-mara-360m
|
||
|
|
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_markdown-mara-360m
|
||
|
|
|
||
|
|
### Configuration
|
||
|
|
|
||
|
|
The following YAML configuration was used to produce this model:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
models:
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-ao3_english-mara-360m
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_cs-mara-360m
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_math-mara-360m
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_physics-mara-360m
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-bbc_news-mara-360m
|
||
|
|
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_cpp-mara-360m
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_javascript-mara-360m
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_markdown-mara-360m
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_python-mara-360m
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-wikipedia_english-mara-360m
|
||
|
|
|
||
|
|
- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-general-ao3style-360m
|
||
|
|
merge_method: karcher
|
||
|
|
dtype: bfloat16
|
||
|
|
tokenizer_source: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave
|
||
|
|
|
||
|
|
```
|