daydreamwarrior/Nemotron-Research-GooseReason-4B-Instruct-heretic-v2

Go to file

ModelHub XC 73959dd3af 初始化项目，由ModelHub XC社区提供模型

Model: daydreamwarrior/Nemotron-Research-GooseReason-4B-Instruct-heretic-v2
Source: Original Platform

2026-05-17 23:58:16 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

added_tokens.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

model-00001-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

model-00002-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

model.safetensors.index.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-05-17 23:58:16 +08:00

README.md

license, language, base_model, pipeline_tag, library_name, tags

license

language

base_model

pipeline_tag

library_name

This is a decensored version of nvidia/Nemotron-Research-GooseReason-4B-Instruct, made using Heretic v1.2.0

Abliteration parameters

Parameter	Value
direction_index	19.04
attn.o_proj.max_weight	1.25
attn.o_proj.max_weight_position	23.36
attn.o_proj.min_weight	1.15
attn.o_proj.min_weight_distance	18.21
mlp.down_proj.max_weight	1.05
mlp.down_proj.max_weight_position	27.34
mlp.down_proj.min_weight	1.01
mlp.down_proj.min_weight_distance	15.35

Performance

Metric	This model	Original model (nvidia/Nemotron-Research-GooseReason-4B-Instruct)
KL divergence	0.0426	0 (by definition)
Refusals	5/100	99/100

GooseReason-4B-Instruct

Trained with Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

GooseReason‑4B‑Instruct is a state-of-the-art 4B reasoning model trained via Reinforcement Learning with Verifiable Rewards (RLVR) on GooseReason-0.7M, a large-scale dataset synthesized by the Golden Goose pipeline. Starting from Qwen3‑4B‑Instruct and applying the ProRLv2 RL recipe augmented with GooseReason-0.7M data, GooseReason-4B-Instruct achieves new state-of-the-art results among 4B-Instruct models across 15 diverse benchmarks, spanning mathematics, programming, STEM reasoning, instruction following, and logical puzzles.

This model is for research and development only.

Golden Goose

Scaling up RLVR is bottlenecked by the scarcity of verifiable training data, where improvements increasingly saturate after prolonged training on existing datasets. Golden Goose is a simple, scalable pipeline that synthesizes unlimited RLVR tasks from reasoning-rich but unverifiable internet text—corpora such as science textbooks, Olympiad math forums, and cybersecurity web scrapes that were previously excluded from RLVR data construction due to the difficulty of automatic verification.

The key idea: given a source text S, we prompt an LLM to identify a contiguous span t of crucial reasoning steps and replace it with a [MASK] token, constructing a masked context S_mask. Treating t as the ground-truth answer, the LLM then generates a set of diverse, plausible distractors D = {d₁, ..., dₖ} that are similar in style and length to the removed span yet incorrect in context, forming a multiple-choice question: Q = (S_mask, {t} ∪ D)

Verification during RL simply checks whether the model's prediction matches the ground-truth option—no external judge or test execution needed. This formulation unlocks reasoning-rich corpora that were previously unusable for RLVR: Olympiad-level theorem proving from AoPS-Instruct, free-form textbook QA from MegaScience, and coding problems without test cases from rStar-Coder.

GooseReason-0.7M Dataset

Using the Golden Goose pipeline, we synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. The dataset is constructed from the following source corpora:

Domain	# Examples	Source	Description
Math	235,836	AoPS‑Instruct	~600K QA pairs from the Art of Problem Solving forum, predominantly featuring Olympiad-level math problems with community-driven solutions
Code	281,793	rStar‑Coder	~418K coding problems from competitive programming platforms; we use the `synthetic_sft` split (questions + teacher model solutions without test cases), which is not directly usable for RL training
STEM	155,496	MegaScience	~650K QA pairs from ~12K university-level scientific textbooks spanning physics, biology, chemistry, medicine, computer science, mathematics, and economics

The data mixing ratio used to train GooseReason-4B-Instruct is 55% ProRL data, 15% GooseReason-0.7M Math, 15% GooseReason-0.7M Code, and 15% GooseReason-0.7M STEM.

Evaluation Results

GooseReason-4B-Instruct is evaluated on 15 diverse benchmarks following the ProRL evaluation protocol. Math performance is measured on AIME 2024/2025, AMC, MATH, Minerva, and Olympiad Bench. Code performance is measured on APPS, CodeContests, CodeForces, TACO, HumanEvalPlus, and LiveCodeBench. STEM and reasoning tasks are measured via GPQA Diamond, IFEval, and Reasoning Gym (logical puzzles). The Qwen3‑30B‑Instruct results (in italics) are provided as a reference.

Table 1. Performance (pass@1) comparison across math benchmarks. Adding GooseReason-0.7M revives the saturated model and enables further RL scaling, achieving a +2.18% absolute gain (vs. a 0.79% degradation when continuing on ProRL data alone).

Model	RL Data	RL Steps	AIME24	AIME25	AMC	MATH	Minerva	Olympiad	Avg
Qwen3‑4B‑Instruct	—	—	64.79	48.75	85.17	94.66	50.09	65.83	68.21
Qwen3‑4B‑Instruct	ProRL Dataset	333	66.46	57.29	87.80	96.41	53.72	68.24	71.65
Qwen3‑4B‑Instruct	ProRL Dataset	+156	62.29	55.21	87.65	96.54	53.33	67.19	70.36
GooseReason‑4B‑Instruct	+GooseReason‑0.7M	+270	70.00	63.96	89.16	96.70	54.37	68.79	73.83
Qwen3‑30B‑Instruct	—	—	76.66	63.74	91.64	97.10	51.99	70.05	75.20

Table 2. Performance (pass@1) comparison across coding benchmarks. GooseReason-4B-Instruct achieves a +2.24% absolute gain in coding average, outperforming Qwen3‑30B‑Instruct by a wide margin.

Model	RL Data	RL Steps	APPS	CodeContests	CodeForces	TACO	HumanEvalPlus	LiveCodeBench	Avg
Qwen3‑4B‑Instruct	—	—	47.01	42.08	33.69	23.69	77.56	31.74	42.63
Qwen3‑4B‑Instruct	ProRL Dataset	333	57.92	52.55	51.67	33.13	84.24	41.28	53.46
Qwen3‑4B‑Instruct	ProRL Dataset	+156	58.45	52.88	54.47	32.80	84.20	40.56	53.89
GooseReason‑4B‑Instruct	+GooseReason‑0.7M	+270	60.48	54.66	55.59	35.37	86.46	41.64	55.70
Qwen3‑30B‑Instruct	—	—	55.37	49.70	47.76	29.05	80.56	43.20	50.94

Table 3. Performance (pass@1) on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym). Tasks in Reasoning Gym are grouped into four categories: Math, Algorithmic, Cognition, and Logic.

Model	RL Data	RL Steps	GPQA	IFEval	Math	Algorithmic	Cognition	Logic	Avg. Gym
Qwen3‑4B‑Instruct	—	—	60.26	72.36	43.69	19.46	34.92	57.26	33.98
Qwen3‑4B‑Instruct	ProRL Dataset	333	64.39	79.11	92.66	80.47	60.07	86.90	80.10
Qwen3‑4B‑Instruct	ProRL Dataset	+156	62.87	76.24	92.71	83.24	60.75	87.71	81.06
GooseReason‑4B‑Instruct	+GooseReason‑0.7M	+270	66.79	76.39	92.76	83.91	60.24	87.80	81.28
Qwen3‑30B‑Instruct	—	—	70.40	82.73	53.86	38.51	28.60	32.89	43.56

How to Use

Requirements: transformers >= 4.51.0

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nvidia/GooseReason-4B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "Find all positive integers n such that n² + 3n + 5 is a perfect square."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)

Recommended sampling parameters: temperature=0.6, max_new_tokens=32768.

Citation

If you find this model or the Golden Goose paper helpful, please cite:

@article{lu2026goldengoose,
  title={Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text},
  author={Lu, Ximing and Acuna, David and Jung, Jaehun and Hu, Jian and Zhang, Di and Diao, Shizhe and Zou, Yunheng and Zhang, Shaokun and Cui, Brandon and Liu, Mingjie and Kim, Hyunwoo and Ammanabrolu, Prithviraj and Kautz, Jan and Dong, Yi and Choi, Yejin},
  journal={arXiv preprint arXiv:2601.22975},
  year={2026}
}

README.md Unescape Escape

This is a decensored version of nvidia/Nemotron-Research-GooseReason-4B-Instruct, made using Heretic v1.2.0

Abliteration parameters

Performance

GooseReason-4B-Instruct

Golden Goose

GooseReason-0.7M Dataset

Evaluation Results

How to Use

Citation

README.md