qwen3_1.7B_Base_MaxRL_Polar…/README.md

---
library_name: transformers
license: mit
---

# Model Card for Model ID

This is a saved checkpoint from fine-tuning a Qwen3/Qwen3-1.7B-Base model using the MaxRL objective, [**"Maximum Likelihood Reinforcement Learning"**](https://arxiv.org/abs/2602.02710). 
In our work, we introduce MaxRL, a framework for optimizing maximum likelihood in RL settings.


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a Qwen3/Qwen3-1.7B-Base model fine-tuned using MaxRL.

- **Finetuned from model:** [Qwen3/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [Official Code Release for the paper "Maximum Likelihood Reinforcement Learning"](https://github.com/tajwarfahim/maxrl)
- **Paper:** [Maximum Likelihood Reinforcement Learning](https://arxiv.org/abs/2602.02710)
- **Project Website:** [Project Website](https://zanette-labs.github.io/MaxRL/)

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

We train on the [POLARIS-53K](https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K) dataset to produce this checkpoint.

### Training Procedure

Please use the [given script](https://github.com/tajwarfahim/maxrl/blob/main/qwen3_experiments/run_qwen3_training.sh) or in general the published [codebase](https://github.com/tajwarfahim/maxrl) to reproduce training this checkpoint. Hyperparameters and other details are provided in the training script.

Due to computational constraints, we have trained for 1000 steps, and released the final checkpoint.


#### Hardware

This model has been finetuned using 32 NVIDIA H200 GPUs (4 nodes of 8xH200 GPUs).


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@misc{tajwar2026maximumlikelihoodreinforcementlearning,
      title={Maximum Likelihood Reinforcement Learning}, 
      author={Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov and Haiwen Feng and Andrea Zanette},
      year={2026},
      eprint={2602.02710},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.02710}, 
}
```

## Model Card Contact

[Fahim Tajwar](mailto:tajwarfahim932@gmail.com)
初始化项目，由ModelHub XC社区提供模型 Model: ftajwar/qwen3_1.7B_Base_MaxRL_Polaris_1000_steps Source: Original Platform 2026-05-12 18:01:36 +08:00			`---`
			`library_name: transformers`
			`license: mit`
			`---`

			`# Model Card for Model ID`

			`This is a saved checkpoint from fine-tuning a Qwen3/Qwen3-1.7B-Base model using the MaxRL objective, ["Maximum Likelihood Reinforcement Learning"](https://arxiv.org/abs/2602.02710).`
			`In our work, we introduce MaxRL, a framework for optimizing maximum likelihood in RL settings.`


			`## Model Details`

			`### Model Description`

			`<!-- Provide a longer summary of what this model is. -->`

			`This is the model card of a Qwen3/Qwen3-1.7B-Base model fine-tuned using MaxRL.`

			`- Finetuned from model: [Qwen3/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)`

			`### Model Sources`

			`<!-- Provide the basic links for the model. -->`

			`- Repository: [Official Code Release for the paper "Maximum Likelihood Reinforcement Learning"](https://github.com/tajwarfahim/maxrl)`
			`- Paper: [Maximum Likelihood Reinforcement Learning](https://arxiv.org/abs/2602.02710)`
			`- Project Website: [Project Website](https://zanette-labs.github.io/MaxRL/)`

			`## Training Details`

			`### Training Data`

			`<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->`

			`We train on the [POLARIS-53K](https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K) dataset to produce this checkpoint.`

			`### Training Procedure`

			`Please use the [given script](https://github.com/tajwarfahim/maxrl/blob/main/qwen3_experiments/run_qwen3_training.sh) or in general the published [codebase](https://github.com/tajwarfahim/maxrl) to reproduce training this checkpoint. Hyperparameters and other details are provided in the training script.`

			`Due to computational constraints, we have trained for 1000 steps, and released the final checkpoint.`


			`#### Hardware`

			`This model has been finetuned using 32 NVIDIA H200 GPUs (4 nodes of 8xH200 GPUs).`


			`## Citation`

			`<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->`

			`BibTeX:`

			```
			`@misc{tajwar2026maximumlikelihoodreinforcementlearning,`
			`title={Maximum Likelihood Reinforcement Learning},`
			`author={Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov and Haiwen Feng and Andrea Zanette},`
			`year={2026},`
			`eprint={2602.02710},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.LG},`
			`url={https://arxiv.org/abs/2602.02710},`
			`}`
			```

			`## Model Card Contact`

			`[Fahim Tajwar](mailto:tajwarfahim932@gmail.com)`