pretty_name, language, license, library_name, pipeline_tag, tags, base_model, model-index
pretty_name language license library_name pipeline_tag tags base_model model-index
KnowRL-Nemotron-1.5B
en
apache-2.0 transformers text-generation
knowrl
rlvr
reasoning
math
knowledge-points
reinforcement-learning
nvidia/OpenMath-Nemotron-1.5B
name results
KnowRL-Nemotron-1.5B
task dataset metrics
type
mathematical-reasoning
name type
AIME 2024 aime24
name type value
Accuracy (w/o KP) accuracy 69.79
name type value
Accuracy (CSS) accuracy 74.58
task dataset metrics
type
mathematical-reasoning
name type
AIME 2025 aime25
name type value
Accuracy (w/o KP) accuracy 64.69
name type value
Accuracy (CSS) accuracy 65.21
task dataset metrics
type
mathematical-reasoning
name type
MATH-500 math-500
name type value
Accuracy (w/o KP) accuracy 95.70
name type value
Accuracy (CSS) accuracy 96.20
task dataset metrics
type
mathematical-reasoning
name type
Olympiad Bench olympiad-bench
name type value
Accuracy (w/o KP) accuracy 80.23
name type value
Accuracy (CSS) accuracy 82.44

KnowRL-Nemotron-1.5B

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

arXiv GitHub Collection Training Data KP Annotations

Model Summary

KnowRL-Nemotron-1.5B is a 1.5B-parameter math reasoning model trained with reinforcement learning (DAPO/GRPO) under minimal-sufficient knowledge point (KP) guidance. It is fine-tuned from nvidia/OpenMath-Nemotron-1.5B and achieves state-of-the-art results among 1.5B-scale models on competition-level math benchmarks.

Instead of injecting long solution hints or full reasoning templates, KnowRL decomposes guidance into atomic knowledge points (KPs) and identifies the minimal subset required to unlock reward learning — achieving more with less.

Key Highlights

  • 74.16 average accuracy (CSS) across 8 competition-level math benchmarks — new SOTA at 1.5B scale
  • 70.08 average accuracy even without KP hints at inference, demonstrating genuine policy improvement (+9.63 over baseline)
  • Trained with ~38% fewer KPs than full-KP injection via the CSS (Constrained Subset Search) selection strategy
  • Reward sparsity reduced from 41.21% zero-correct to 13.00% during training

Results

Benchmark w/o KP CBRS CSS
AIME 2024 69.79 75.52 74.58
AIME 2025 64.69 65.00 65.21
BRUMO 2025 69.48 78.33 78.12
HMMT 2025 41.04 45.00 48.75
AMC 2023 95.55 95.78 95.70
CMIMC 2025 44.14 49.22 52.19
MATH-500 95.70 96.45 96.20
Olympiad Bench 80.23 82.34 82.44
Average 70.08 73.46 74.16

w/o KP: No knowledge point hints at inference. CBRS / CSS: KP hints selected by the respective strategy are prepended to the prompt at inference.

Usage

Basic Inference (without KP hints)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HasuerYu/KnowRL-Nemotron-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

problem = "Find the sum of all positive integers n such that n^2 - 19n + 99 is a perfect square."
prompt = f"{problem}\nPlease reason step by step, and put your final answer within \\boxed{{}}."

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=32768, temperature=0.6, top_p=0.95)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Inference with KP Hints

For best performance, prepend selected knowledge points as a hint section in the prompt:

knowledge_points = [
    "If n^2 - 19n + 99 = m^2, then (2n - 19)^2 - 4m^2 = -15.",
]
hint = "## Hint\n" + "\n".join(f"- {kp}" for kp in knowledge_points)
prompt = f"{problem}\n{hint}\nPlease reason step by step, and put your final answer within \\boxed{{}}."

vLLM Serving

vllm serve HasuerYu/KnowRL-Nemotron-1.5B \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code

Training Details

Parameter Value
Base model nvidia/OpenMath-Nemotron-1.5B
Algorithm DAPO / GRPO
Framework verl + Ray
Learning rate 1e-6
Batch size 256
Max prompt length 8,192
Max response length 32,768
Samples per prompt 8
Total training steps 2,960
Hardware 8× NVIDIA H100 nodes (64 GPUs)

An entropy annealing strategy is applied: after step 2,590, the clip upper bound is reduced from 0.28 to 0.26 to encourage the policy to shift from exploration to exploitation, contributing +0.74 average accuracy.

How KnowRL Works

  1. KP Extraction: Decompose solution guidance into atomic knowledge points (KPs)
  2. KP Selection: Apply selection strategies (CSS, CBRS) to identify the minimal-sufficient subset of KPs per problem
  3. RL Training: Train with DAPO/GRPO, injecting selected KPs as hints in the prompt during rollout
  4. Inference: The trained model can be used with or without KP hints — even without hints, it significantly outperforms the baseline
Resource Link
KnowRL Collection HasuerYu/knowrl
Training Data HasuerYu/KnowRL-Train-Data
KP Annotations HasuerYu/KnowRL-KP-Annotations

Limitations

  • Optimized for competition-level math reasoning; performance on other domains is not evaluated
  • KP hint quality at inference depends on upstream KP extraction and selection pipelines
  • The model inherits limitations from the base model (nvidia/OpenMath-Nemotron-1.5B)

Citation

If you find this model helpful, please cite:

@misc{yu2026knowrlboostingllmreasoning,
      title={KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance}, 
      author={Linhao Yu and Tianmeng Yang and Siyu Ding and Renren Jin and Naibin Gu and Xiangzhao Hao and Shuaiyi Nie and Deyi Xiong and Weichong Yin and Yu Sun and Hua Wu},
      year={2026},
      eprint={2604.12627},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.12627}, 
}

License

This model is released under the Apache 2.0 License.

Description
Model synced from source: HasuerYu/KnowRL-Nemotron-1.5B
Readme 2 MiB