LTE-Qwen3-8B-Base/README.md

---
base_model:
- Qwen/Qwen3-8B-Base
datasets:
- Elliott/Openr1-Math-46k-8192
language:
- en
license: apache-2.0
metrics:
- accuracy
pipeline_tag: text-generation
library_name: transformers
---

# LTE-Qwen3-8B-Base

[![arXiv](https://img.shields.io/badge/arXiv-2510.26109-b31b1b.svg?logo=arxiv)](https://arxiv.org/abs/2510.26109)
[![GitHub](https://img.shields.io/badge/GitHub-LTE-blue?logo=github)](https://github.com/JamyDon/LTE)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?logo=apache)](LICENSE)

## Introduction
LTE (Learning to reason from Trial and Error) is an RLVR (Reinforcement Learning with Verifiable Rewards) approach presented in the paper [Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error](https://huggingface.co/papers/2510.26109). 

LTE mitigates the exploration stagnation of Language Models (LMs) by utilizing their previously self-made mistakes as hints, requiring no external expert guidance. It improves the performance upper bound of LMs and enhances both exploitation and exploration during training.

## Key Highlights
- **Self-generated Hints**: LTE uses the errors generated by the LMs themselves during training as hints.
- **No External Expert Guidance**: LTE does not require any external expert guidance to mitigate the exploration stagnation of LMs.

## Inference
Here is an example of using LTE models for inference:

```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_path="JamyDohrn/LTE-Qwen3-8B-Base"

question = "which number is larger? 9.11 or 9.9?"

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=32768)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)
```

## Acknowledgements
LTE is built on the following repositories and we thank their teams for their valuable contributions to the community:
- [verl](https://github.com/volcengine/verl)
- [LUFFY](https://github.com/ElliottYan/LUFFY)
- [LIMO](https://github.com/GAIR-NLP/LIMO)

## Citation
If you find our work useful, feel free to cite our paper:
```bibtex
@misc{tang2026steprivertwicelearning,
      title={Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error}, 
      author={Chenming Tang and Hsiu-Yuan Huang and Weijie Liu and Clive Bai and Saiyong Yang and Yunfang Wu},
      year={2026},
      eprint={2510.26109},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.26109}, 
}
```
初始化项目，由ModelHub XC社区提供模型 Model: JamyDohrn/LTE-Qwen3-8B-Base Source: Original Platform 2026-04-24 13:14:21 +08:00			`---`
			`base_model:`
			`- Qwen/Qwen3-8B-Base`
			`datasets:`
			`- Elliott/Openr1-Math-46k-8192`
			`language:`
			`- en`
			`license: apache-2.0`
			`metrics:`
			`- accuracy`
			`pipeline_tag: text-generation`
			`library_name: transformers`
			`---`

			`# LTE-Qwen3-8B-Base`

			`[![arXiv](https://img.shields.io/badge/arXiv-2510.26109-b31b1b.svg?logo=arxiv)](https://arxiv.org/abs/2510.26109)`
			`[![GitHub](https://img.shields.io/badge/GitHub-LTE-blue?logo=github)](https://github.com/JamyDon/LTE)`
			`[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?logo=apache)](LICENSE)`

			`## Introduction`
			`LTE (Learning to reason from Trial and Error) is an RLVR (Reinforcement Learning with Verifiable Rewards) approach presented in the paper [Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error](https://huggingface.co/papers/2510.26109).`

			`LTE mitigates the exploration stagnation of Language Models (LMs) by utilizing their previously self-made mistakes as hints, requiring no external expert guidance. It improves the performance upper bound of LMs and enhances both exploitation and exploration during training.`

			`## Key Highlights`
			`- Self-generated Hints: LTE uses the errors generated by the LMs themselves during training as hints.`
			`- No External Expert Guidance: LTE does not require any external expert guidance to mitigate the exploration stagnation of LMs.`

			`## Inference`
			`Here is an example of using LTE models for inference:`

			```python
			`from transformers import AutoTokenizer`
			`from vllm import LLM, SamplingParams`

			`model_path="JamyDohrn/LTE-Qwen3-8B-Base"`

			`question = "which number is larger? 9.11 or 9.9?"`

			`tokenizer = AutoTokenizer.from_pretrained(model_path)`
			`messages = [{"role": "user", "content": question}]`
			`chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)`

			`llm = LLM(model=model_path)`
			`params = SamplingParams(temperature=0.6, max_tokens=32768)`
			`outputs = llm.generate([chat], params)`
			`print(outputs[0].outputs[0].text)`
			```

			`## Acknowledgements`
			`LTE is built on the following repositories and we thank their teams for their valuable contributions to the community:`
			`- [verl](https://github.com/volcengine/verl)`
			`- [LUFFY](https://github.com/ElliottYan/LUFFY)`
			`- [LIMO](https://github.com/GAIR-NLP/LIMO)`

			`## Citation`
			`If you find our work useful, feel free to cite our paper:`
			```bibtex
			`@misc{tang2026steprivertwicelearning,`
			`title={Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error},`
			`author={Chenming Tang and Hsiu-Yuan Huang and Weijie Liu and Clive Bai and Saiyong Yang and Yunfang Wu},`
			`year={2026},`
			`eprint={2510.26109},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.LG},`
			`url={https://arxiv.org/abs/2510.26109},`
			`}`
			```