160 lines
8.3 KiB
Markdown
160 lines
8.3 KiB
Markdown
---
|
||
library_name: transformers
|
||
license: other
|
||
license_name: nvidia-open-model-license
|
||
license_link: >-
|
||
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
|
||
pipeline_tag: text-generation
|
||
language:
|
||
- en
|
||
tags:
|
||
- nvidia
|
||
- Nemotron-Cascade
|
||
- reasoning
|
||
- general-purpose
|
||
- SFT
|
||
- RL
|
||
- pytorch
|
||
---
|
||
|
||
|
||
# Nemotron-Cascade-8B-Thinking
|
||
|
||
<p align="center">
|
||
|
||
[](https://arxiv.org/abs/2512.13607)
|
||
[](https://huggingface.co/collections/nvidia/nemotron-cascade)
|
||
[](https://huggingface.co/collections/nvidia/nemotron-cascade)
|
||
[](https://huggingface.co/collections/nvidia/nemotron-cascade)
|
||
</p>
|
||
|
||
<img src="fig/nemotron-cascade-8b-thinking-results.png" alt="main_fig" style="width: 1000px; max-width: 100%;" />
|
||
|
||
|
||
## Introduction
|
||
|
||
We're excited to introduce [Nemotron-Cascade-8B-Thinking](https://huggingface.co/nvidia/Nemotron-Cascade-8B-Thinking), a powerful general-purpose model trained through sequential and domain-wise reinforcement learning. Nemotron-Cascade-8B-Thinking is post-trained from the [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) model, and it achieves best-in-class performance across a wide range of benchmarks. Different from [Nemotron-Cascade-8B](https://huggingface.co/nvidia/Nemotron-Cascade-8B), Nemotron-Cascade-8B-Thinking is designed exclusively for the ***thinking*** mode.
|
||
|
||
|
||
## Training Pipeline
|
||
<img src="fig/pipeline.png" alt="train_pipeline_fig" style="width: 1000px; max-width: 100%;" />
|
||
|
||
The training pipeline for Nemotron-Cascade begins with a multi-stage SFT phase to equip the model with foundational skills. Subsequently, Cascade RL is applied across multiple domains to further enhance the model’s performance in these areas.
|
||
|
||
Notably, RLHF for alignment, when used as a pre-step, boosts the model’s complex reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in the following Figure).
|
||
|
||
<figure style="margin: 0; padding: 0;">
|
||
<img src="fig/lcb_through_cascade_rl.png" alt="lcb_through_cascade_rl_fig" style="width: 100%; max-width: 100%; margin: 0; padding: 0;">
|
||
<figcaption>The LiveCodeBench v6 (08/24–05/25) performance of the Nemotron-Cascade-14B-Thinking model throughout the Cascade RL process.</figcaption>
|
||
</figure>
|
||
|
||
|
||
## Results
|
||
|
||
- We evaluate our model against competitive reasoning models on a diverse set of benchmarks, covering general-knowledge reasoning, alignment and instruction following, mathematical reasoning, competitive programming, software engineering, and tool-use proficiency.
|
||
- For Nemotron-Cascade models, we use a maximum generation length of 64K tokens and set the temperature to 0.6 and top-p to 0.95 for reasoning tasks.
|
||
- Our Nemotron-Cascade models achieve best-in-class performance across almost all benchmarks. Remarkably, Nemotron-Cascade-8B and Nemotron-Cascade-8B-Thinking achieve comparable LiveCodeBench (LCB) and LCB Pro scores to DeepSeek-R1-0528 (671B).
|
||
|
||
| **Benchmark<br>Metric: Pass@1** | **Qwen3-8B** | **Nemotron-Nano-9B-v2** | **DeepSeek-R1-0528 671B** | **Gemini-2.5-Flash-Thinking** | **Nemotron-<br>Cascade-8B-<br>Thinking** | **Nemotron-<br>Cascade-8B** |
|
||
| :---- | :---: | :---: | :---: | :---: | :---: | :---: |
|
||
| ***Knowledge Reasoning*** |
|
||
| MMLU | 83.0 | 82.6 | 89.9 | - | 84.0 | 83.7 |
|
||
| MMLU Pro | 75.1 | 73.3 | 85.0 | 81.9 | 75.5 | 75.7 |
|
||
| GPQA-Diamond | 62.0 | 64.0 | 81.0 | 82.8 | 66.7 | 66.5 |
|
||
| ***Alignment*** |
|
||
| ArenaHard | 85.8 | 74.6 | 95.1 | 95.7 | 85.8 | 87.9 |
|
||
| IFEval (Strict Prompt) | 85.0 | 86.1 | 84.1 | 89.8 | 83.7 | 90.2 |
|
||
| IFBench | 34.4 | 37.4 | 38.0 | 36.1 | 41.4 | 40.8 |
|
||
| ***Math*** |
|
||
| AIME 2024 | 76.0 | 81.9 | 91.4 | 82.3 | 88.8 | 89.5 |
|
||
| AIME 2025 | 67.3 | 72.0 | 87.5 | 72.0 | 81.4 | 80.1 |
|
||
| ***Code*** |
|
||
| LCB v5 (08/24-02/25) | 61.2 | 68.2 | 74.8 | 63.4 | 74.5 | 74.3 |
|
||
| LCB v6 (08/24-05/25) | 58.3 | 65.3 | 73.3 | 61.9 | 71.4 | 71.1 |
|
||
| LCB Pro 25Q2 (Easy) | 46.1 | 59.3 | 63.9 | 47.4 | 64.8 | 65.7 |
|
||
| LCB Pro 25Q2 (Med) | 2.2 | 4.8 | 7.0 | 1.8 | 6.1 | 6.4 |
|
||
| SWE Verified (Agentless) | 20.5 | - | 57.6 | 48.9 | 38.5 | 37.2 |
|
||
| ***Tool Calling*** |
|
||
| BFCL V3 | 68.1 | 66.9 | 67.9 | 68.6 | 67.0 | 64.4 |
|
||
|
||
|
||
## Usage Recommendations
|
||
|
||
For local deployment, we recommend setting the sampling parameters to temperature = 0.6, top_p = 0.95. We recommend using RoPE scaling with the [YaRN](https://arxiv.org/abs/2309.00071) method for better long-context support. This can be enabled by updating the model’s `config.json` as shown below:
|
||
```json
|
||
{
|
||
...,
|
||
"rope_scaling": {
|
||
"rope_type": "yarn",
|
||
"factor": 2.0,
|
||
"original_max_position_embeddings": 32768
|
||
}
|
||
}
|
||
```
|
||
|
||
- **Nemotron-Cascade-14B-Thinking**: use `factor: 3.0` to extend the context length to 90K tokens for SWE Verified (Agentless), and `factor: 2.0` to extend the context length to 64K tokens for other benchmarks.
|
||
- **Nemotron-Cascade-8B** and **Nemotron-Cascade-8B-Thinking**: use `factor: 2.0` across all benchmarks.
|
||
|
||
|
||
## Evaluation Tookit
|
||
|
||
To reproduce our results, please check evaluation code, scripts, cached prediction files in https://huggingface.co/nvidia/Nemotron-Cascade-8B-Thinking/blob/main/evaluation/README.md
|
||
|
||
## Chat Template
|
||
|
||
Nemotron-Cascade-8B-Thinking follows the Qwen3-style ChatML template and is designed exclusively for the ***thinking*** mode. To align with the template used in [Nemotron-Cascade-8B](https://huggingface.co/nvidia/Nemotron-Cascade-8B), the `" /think"` tag should be appended to the end of the user input. Note that a leading space is included in this tag to ensure correct tokenization.
|
||
|
||
To reduce the context length in a multi-turn conversation, we include only the final summary of the model’s output in the conversation history and change the user turn’s `" /think"` tag to `" /no_think"`.
|
||
|
||
A brief example is shown below:
|
||
|
||
```python
|
||
from transformers import AutoTokenizer
|
||
|
||
model_name = 'nvidia/Nemotron-Cascade-8B-Thinking'
|
||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
|
||
'''
|
||
single-turn example
|
||
'''
|
||
messages = [
|
||
{"role": "user", "content": "calculate 1+1?"}
|
||
]
|
||
|
||
# only thinking mode is supported (enable_thinking=True)
|
||
prompt_thinking = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
|
||
# prompt_thinking = '<|im_start|>system\nYou are a helpful and harmless assistant.<|im_end|>\n<|im_start|>user\ncalculate 1+1? /think<|im_end|>\n<|im_start|>assistant\n'
|
||
|
||
|
||
'''
|
||
multi-turn example
|
||
'''
|
||
messages = [
|
||
{"role": "user", "content": "calculate 1+1?"},
|
||
{"role": "assistant", "content": "<think>THINKING_CONTENT</think>\nTo calculate \\(1 + 1\\):\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**: \n \\(1 + 1 = 2\\).\n\n**Result**: \\(\\boxed{2}\\)",},
|
||
{"role": "user", "content": "what about 2+2"}
|
||
]
|
||
|
||
# only thinking mode is supported (enable_thinking=True)
|
||
prompt_thinking = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
|
||
# prompt_thinking = '<|im_start|>system\nYou are a helpful and harmless assistant.<|im_end|>\n<|im_start|>user\ncalculate 1+1? /no_think<|im_end|>\n<|im_start|>assistant\nTo calculate \\(1 + 1\\):\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**: \n \\(1 + 1 = 2\\).\n\n**Result**: \\(\\boxed\{2\}\\)<|im_end|>\n<|im_start|>user\nwhat about 2+2 /think<|im_end|>\n<|im_start|>assistant\n'
|
||
```
|
||
|
||
|
||
## Release Date
|
||
Dec 08, 2025
|
||
|
||
|
||
## License
|
||
Your use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
|
||
|
||
|
||
## Citation
|
||
```
|
||
@article{Nemotron_Cascade_Scaling_Cascaded_Reinforcement_Learning,
|
||
title={Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models},
|
||
author={Wang, Boxin and Lee, Chankyu and Lee, Nayeon and Lin, Sheng-Chieh and Dai, Wenliang and Chen, Yang and Chen, Yangyi and Yang, Zhuolin and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
|
||
year={2025}
|
||
}
|
||
```
|