Qwen3-4B-Base-Continued-GRP…/README.md

---
base_model:
- Lambent/Qwen3-4B-Base-Continued-GRPO-Wave
library_name: transformers
tags:
- mergekit
- merge
license: apache-2.0
---

For this one...

... (over)trained a SmolLM2-360M on 5 epochs at swept-for LR and rank on each of the target domains to fit style,
then rewarded the model for lowering perplexity on the proxy model.

In this case, trained an adapter per domain and then Karcher merged them.
I'm not sure if any of the domains had notably different effect, they all basically had the same result on evals.
However, the karcher combination of them seem to have significantly lowered perplexity on lambada_openai, which is interesting enough to publish.

Additionally, attempted to implement MARA from https://im-ant.github.io/mara/ on the GRPO side to help preserve distribution entropy, though I'm unsure how correctly/usefully we did so.

| Task | Metric | Qwen3-4B-Base | GRPO-Merge | Δ Base | GRPO-Wave | Δ Base | Δ Merge | Style-Karcher | Δ Base | Δ Wave |
|:-----|:-------|:-------------:|:----------:|:------:|:---------:|:------:|:-------:|:-------------:|:------:|:------:|
| arc_easy | acc | 0.7891 | 0.7870 | -0.27% | 0.7912 | +0.27% | +0.53% | 0.7883 | -0.10% | -0.37% |
| arc_easy | acc_norm | 0.7609 | 0.7605 | -0.05% | 0.7643 | +0.45% | +0.50% | 0.7576 | -0.43% | -1.04% |
| lambada_openai | acc | 0.6912 | 0.6984 | +1.04% | 0.7006 | +1.36% | +0.31% | **0.7087** | **+2.53%** | +1.16% |
| lambada_openai | perplexity ↓ | 4.2433 | 4.0490 | -4.58% | 3.9616 | -6.64% | -2.16% | **3.8343** | **-9.63%** | -3.21% |                                                                                          
| openbookqa | acc | 0.3160 | 0.3180 | +0.63% | 0.3180 | +0.63% | ±0.00% | 0.3160 | ±0.00% | -0.63% |
| openbookqa | acc_norm | 0.4100 | 0.4120 | +0.49% | 0.4100 | ±0.00% | -0.49% | 0.4080 | -0.49% | -0.49% |                                                                                                          
| piqa | acc | 0.7797 | 0.7807 | +0.13% | 0.7813 | +0.21% | +0.08% | 0.7786 | -0.14% | -0.35% |
| piqa | acc_norm | 0.7807 | 0.7807 | ±0.00% | 0.7813 | +0.08% | +0.08% | 0.7807 | ±0.00% | -0.08% |                                                                                                                
                                                            

Some very interesting results on diversity also:

**Diversity Metrics (Qwen3-4B-Base vs Style-Karcher, temperature=1.0, 8 completions per prompt)**                                                                                                                   
                                                                                                                                                                                                                    
  | Domain | Metric | Base | Karcher | Δ |                                                                                                                                                                            
  |--------|--------|:----:|:-------:|:-:|                                                                                                                                                                            
  | ao3_english | Prefix entropy | 3.309 | 3.238 | -2.1% |                                                                                                                                                            
  | ao3_english | Distinct-1 | 0.618 | **0.683** | **+10.5%** |                                                                                                                                                       
  | ao3_english | Distinct-2 | 0.962 | **0.984** | +2.3% |
  | ao3_english | Pairwise diversity | 0.919 | **0.932** | +1.4% |
  | github_python | Prefix entropy | 1.514 | 1.456 | -3.8% |
  | github_python | Distinct-1 | 0.610 | **0.624** | +2.3% |
  | github_python | Distinct-2 | 0.890 | 0.876 | -1.6% |
  | github_python | Pairwise diversity | 0.933 | 0.933 | ±0.0% |
  | wikipedia_english | Prefix entropy | 1.974 | 1.892 | -4.2% |
  | wikipedia_english | Distinct-1 | 0.599 | 0.559 | -6.7% |
  | wikipedia_english | Distinct-2 | 0.932 | 0.898 | -3.6% |
  | wikipedia_english | Pairwise diversity | 0.907 | 0.900 | -0.8% |
  | bbc_news | Prefix entropy | 2.252 | 2.186 | -2.9% |
  | bbc_news | Distinct-1 | 0.557 | **0.577** | +3.6% |
  | bbc_news | Distinct-2 | 0.949 | **0.951** | +0.3% |
  | bbc_news | Pairwise diversity | 0.901 | **0.908** | +0.8% |
  | arxiv_cs | Prefix entropy | 2.455 | 2.346 | -4.4% |
  | arxiv_cs | Distinct-1 | 0.555 | **0.567** | +2.3% |
  | arxiv_cs | Distinct-2 | 0.905 | **0.906** | +0.2% |
  | arxiv_cs | Pairwise diversity | 0.895 | **0.901** | +0.7% |


Additional experiment (after quantization, should affect further training but not existing quants):
Initializing the \<think\>\</think\> tokens in embedding space.

Original embeddings were identical (cos=1.0) at 0.3x norm, untrained.

Optimized via AdamW on GSM8k reasoning traces with 3-shot prefix, loss on
reasoning+answer tokens, norm clamped to 1.5x avg embedding norm.

After: two distinct vectors (cos=0.07) at 1.5x norm.
GSM8k 3-shot accuracy: 96.7% (29/30) vs 90.0% with original embeddings.
CE loss improvement: +7.8% on held-out eval.

This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).

## Merge Details
### Merge Method

This model was merged using the [Karcher Mean](https://en.wikipedia.org/wiki/Karcher_mean) merge method.

### Models Merged

The following models were included in the merge:
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_javascript-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_cs-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-general-ao3style-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-ao3_english-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_math-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_python-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-wikipedia_english-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_physics-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_cpp-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-bbc_news-mara-360m
* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_markdown-mara-360m

### Configuration

The following YAML configuration was used to produce this model:

```yaml
models:
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-ao3_english-mara-360m
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_cs-mara-360m
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_math-mara-360m
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_physics-mara-360m
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-bbc_news-mara-360m

  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_cpp-mara-360m
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_javascript-mara-360m
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_markdown-mara-360m
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_python-mara-360m
  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-wikipedia_english-mara-360m

  - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-general-ao3style-360m
merge_method: karcher
dtype: bfloat16
tokenizer_source: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave

```
初始化项目，由ModelHub XC社区提供模型 Model: Lambent/Qwen3-4B-Base-Continued-GRPO-Style-Karcher Source: Original Platform 2026-05-29 16:56:43 +08:00			`---`
			`base_model:`
			`- Lambent/Qwen3-4B-Base-Continued-GRPO-Wave`
			`library_name: transformers`
			`tags:`
			`- mergekit`
			`- merge`
			`license: apache-2.0`
			`---`

			`For this one...`

			`... (over)trained a SmolLM2-360M on 5 epochs at swept-for LR and rank on each of the target domains to fit style,`
			`then rewarded the model for lowering perplexity on the proxy model.`

			`In this case, trained an adapter per domain and then Karcher merged them.`
			`I'm not sure if any of the domains had notably different effect, they all basically had the same result on evals.`
			`However, the karcher combination of them seem to have significantly lowered perplexity on lambada_openai, which is interesting enough to publish.`

			`Additionally, attempted to implement MARA from https://im-ant.github.io/mara/ on the GRPO side to help preserve distribution entropy, though I'm unsure how correctly/usefully we did so.`

			`\| Task \| Metric \| Qwen3-4B-Base \| GRPO-Merge \| Δ Base \| GRPO-Wave \| Δ Base \| Δ Merge \| Style-Karcher \| Δ Base \| Δ Wave \|`
			`\|:-----\|:-------\|:-------------:\|:----------:\|:------:\|:---------:\|:------:\|:-------:\|:-------------:\|:------:\|:------:\|`
			`\| arc_easy \| acc \| 0.7891 \| 0.7870 \| -0.27% \| 0.7912 \| +0.27% \| +0.53% \| 0.7883 \| -0.10% \| -0.37% \|`
			`\| arc_easy \| acc_norm \| 0.7609 \| 0.7605 \| -0.05% \| 0.7643 \| +0.45% \| +0.50% \| 0.7576 \| -0.43% \| -1.04% \|`
			`\| lambada_openai \| acc \| 0.6912 \| 0.6984 \| +1.04% \| 0.7006 \| +1.36% \| +0.31% \| 0.7087 \| +2.53% \| +1.16% \|`
			`\| lambada_openai \| perplexity ↓ \| 4.2433 \| 4.0490 \| -4.58% \| 3.9616 \| -6.64% \| -2.16% \| 3.8343 \| -9.63% \| -3.21% \|`
			`\| openbookqa \| acc \| 0.3160 \| 0.3180 \| +0.63% \| 0.3180 \| +0.63% \| ±0.00% \| 0.3160 \| ±0.00% \| -0.63% \|`
			`\| openbookqa \| acc_norm \| 0.4100 \| 0.4120 \| +0.49% \| 0.4100 \| ±0.00% \| -0.49% \| 0.4080 \| -0.49% \| -0.49% \|`
			`\| piqa \| acc \| 0.7797 \| 0.7807 \| +0.13% \| 0.7813 \| +0.21% \| +0.08% \| 0.7786 \| -0.14% \| -0.35% \|`
			`\| piqa \| acc_norm \| 0.7807 \| 0.7807 \| ±0.00% \| 0.7813 \| +0.08% \| +0.08% \| 0.7807 \| ±0.00% \| -0.08% \|`


			`Some very interesting results on diversity also:`

			`Diversity Metrics (Qwen3-4B-Base vs Style-Karcher, temperature=1.0, 8 completions per prompt)`

			`\| Domain \| Metric \| Base \| Karcher \| Δ \|`
			`\|--------\|--------\|:----:\|:-------:\|:-:\|`
			`\| ao3_english \| Prefix entropy \| 3.309 \| 3.238 \| -2.1% \|`
			`\| ao3_english \| Distinct-1 \| 0.618 \| 0.683 \| +10.5% \|`
			`\| ao3_english \| Distinct-2 \| 0.962 \| 0.984 \| +2.3% \|`
			`\| ao3_english \| Pairwise diversity \| 0.919 \| 0.932 \| +1.4% \|`
			`\| github_python \| Prefix entropy \| 1.514 \| 1.456 \| -3.8% \|`
			`\| github_python \| Distinct-1 \| 0.610 \| 0.624 \| +2.3% \|`
			`\| github_python \| Distinct-2 \| 0.890 \| 0.876 \| -1.6% \|`
			`\| github_python \| Pairwise diversity \| 0.933 \| 0.933 \| ±0.0% \|`
			`\| wikipedia_english \| Prefix entropy \| 1.974 \| 1.892 \| -4.2% \|`
			`\| wikipedia_english \| Distinct-1 \| 0.599 \| 0.559 \| -6.7% \|`
			`\| wikipedia_english \| Distinct-2 \| 0.932 \| 0.898 \| -3.6% \|`
			`\| wikipedia_english \| Pairwise diversity \| 0.907 \| 0.900 \| -0.8% \|`
			`\| bbc_news \| Prefix entropy \| 2.252 \| 2.186 \| -2.9% \|`
			`\| bbc_news \| Distinct-1 \| 0.557 \| 0.577 \| +3.6% \|`
			`\| bbc_news \| Distinct-2 \| 0.949 \| 0.951 \| +0.3% \|`
			`\| bbc_news \| Pairwise diversity \| 0.901 \| 0.908 \| +0.8% \|`
			`\| arxiv_cs \| Prefix entropy \| 2.455 \| 2.346 \| -4.4% \|`
			`\| arxiv_cs \| Distinct-1 \| 0.555 \| 0.567 \| +2.3% \|`
			`\| arxiv_cs \| Distinct-2 \| 0.905 \| 0.906 \| +0.2% \|`
			`\| arxiv_cs \| Pairwise diversity \| 0.895 \| 0.901 \| +0.7% \|`


			`Additional experiment (after quantization, should affect further training but not existing quants):`
			`Initializing the \<think\>\</think\> tokens in embedding space.`

			`Original embeddings were identical (cos=1.0) at 0.3x norm, untrained.`

			`Optimized via AdamW on GSM8k reasoning traces with 3-shot prefix, loss on`
			`reasoning+answer tokens, norm clamped to 1.5x avg embedding norm.`

			`After: two distinct vectors (cos=0.07) at 1.5x norm.`
			`GSM8k 3-shot accuracy: 96.7% (29/30) vs 90.0% with original embeddings.`
			`CE loss improvement: +7.8% on held-out eval.`

			`This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).`

			`## Merge Details`
			`### Merge Method`

			`This model was merged using the [Karcher Mean](https://en.wikipedia.org/wiki/Karcher_mean) merge method.`

			`### Models Merged`

			`The following models were included in the merge:`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_javascript-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_cs-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-general-ao3style-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-ao3_english-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_math-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_python-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-wikipedia_english-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_physics-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_cpp-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-bbc_news-mara-360m`
			`* ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_markdown-mara-360m`

			`### Configuration`

			`The following YAML configuration was used to produce this model:`

			```yaml
			`models:`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-ao3_english-mara-360m`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_cs-mara-360m`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_math-mara-360m`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_physics-mara-360m`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-bbc_news-mara-360m`

			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_cpp-mara-360m`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_javascript-mara-360m`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_markdown-mara-360m`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_python-mara-360m`
			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-wikipedia_english-mara-360m`

			`- model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-general-ao3style-360m`
			`merge_method: karcher`
			`dtype: bfloat16`
			`tokenizer_source: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave`

			```