71 lines
2.6 KiB
Markdown
71 lines
2.6 KiB
Markdown
|
|
---
|
||
|
|
library_name: transformers
|
||
|
|
license: mit
|
||
|
|
---
|
||
|
|
|
||
|
|
# Model Card for Model ID
|
||
|
|
|
||
|
|
This is a saved checkpoint from fine-tuning a Qwen3/Qwen3-1.7B-Base model using the MaxRL objective, [**"Maximum Likelihood Reinforcement Learning"**](https://arxiv.org/abs/2602.02710).
|
||
|
|
In our work, we introduce MaxRL, a framework for optimizing maximum likelihood in RL settings.
|
||
|
|
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
|
||
|
|
### Model Description
|
||
|
|
|
||
|
|
<!-- Provide a longer summary of what this model is. -->
|
||
|
|
|
||
|
|
This is the model card of a Qwen3/Qwen3-1.7B-Base model fine-tuned using MaxRL.
|
||
|
|
|
||
|
|
- **Finetuned from model:** [Qwen3/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)
|
||
|
|
|
||
|
|
### Model Sources
|
||
|
|
|
||
|
|
<!-- Provide the basic links for the model. -->
|
||
|
|
|
||
|
|
- **Repository:** [Official Code Release for the paper "Maximum Likelihood Reinforcement Learning"](https://github.com/tajwarfahim/maxrl)
|
||
|
|
- **Paper:** [Maximum Likelihood Reinforcement Learning](https://arxiv.org/abs/2602.02710)
|
||
|
|
- **Project Website:** [Project Website](https://zanette-labs.github.io/MaxRL/)
|
||
|
|
|
||
|
|
## Training Details
|
||
|
|
|
||
|
|
### Training Data
|
||
|
|
|
||
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
||
|
|
|
||
|
|
We train on the [POLARIS-53K](https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K) dataset to produce this checkpoint.
|
||
|
|
|
||
|
|
### Training Procedure
|
||
|
|
|
||
|
|
Please use the [given script](https://github.com/tajwarfahim/maxrl/blob/main/qwen3_experiments/run_qwen3_training.sh) or in general the published [codebase](https://github.com/tajwarfahim/maxrl) to reproduce training this checkpoint. Hyperparameters and other details are provided in the training script.
|
||
|
|
|
||
|
|
Due to computational constraints, we have trained for 1000 steps, and released the final checkpoint.
|
||
|
|
|
||
|
|
|
||
|
|
#### Hardware
|
||
|
|
|
||
|
|
This model has been finetuned using 32 NVIDIA H200 GPUs (4 nodes of 8xH200 GPUs).
|
||
|
|
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
||
|
|
|
||
|
|
**BibTeX:**
|
||
|
|
|
||
|
|
```
|
||
|
|
@misc{tajwar2026maximumlikelihoodreinforcementlearning,
|
||
|
|
title={Maximum Likelihood Reinforcement Learning},
|
||
|
|
author={Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov and Haiwen Feng and Andrea Zanette},
|
||
|
|
year={2026},
|
||
|
|
eprint={2602.02710},
|
||
|
|
archivePrefix={arXiv},
|
||
|
|
primaryClass={cs.LG},
|
||
|
|
url={https://arxiv.org/abs/2602.02710},
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Model Card Contact
|
||
|
|
|
||
|
|
[Fahim Tajwar](mailto:tajwarfahim932@gmail.com)
|