初始化项目，由ModelHub XC社区提供模型

Model: ftajwar/qwen3_1.7B_Base_MaxRL_Polaris_1000_steps Source: Original Platform
2026-05-12 18:01:36 +08:00
commit 7a6cd2f1cb
12 changed files with 151950 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,70 @@
+---
+library_name: transformers
+license: mit
+---
+
+# Model Card for Model ID
+
+This is a saved checkpoint from fine-tuning a Qwen3/Qwen3-1.7B-Base model using the MaxRL objective, [**"Maximum Likelihood Reinforcement Learning"**](https://arxiv.org/abs/2602.02710). 
+In our work, we introduce MaxRL, a framework for optimizing maximum likelihood in RL settings.
+
+
+## Model Details
+
+### Model Description
+
+<!-- Provide a longer summary of what this model is. -->
+
+This is the model card of a Qwen3/Qwen3-1.7B-Base model fine-tuned using MaxRL.
+
+- **Finetuned from model:** [Qwen3/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)
+
+### Model Sources
+
+<!-- Provide the basic links for the model. -->
+
+- **Repository:** [Official Code Release for the paper "Maximum Likelihood Reinforcement Learning"](https://github.com/tajwarfahim/maxrl)
+- **Paper:** [Maximum Likelihood Reinforcement Learning](https://arxiv.org/abs/2602.02710)
+- **Project Website:** [Project Website](https://zanette-labs.github.io/MaxRL/)
+
+## Training Details
+
+### Training Data
+
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+
+We train on the [POLARIS-53K](https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K) dataset to produce this checkpoint.
+
+### Training Procedure
+
+Please use the [given script](https://github.com/tajwarfahim/maxrl/blob/main/qwen3_experiments/run_qwen3_training.sh) or in general the published [codebase](https://github.com/tajwarfahim/maxrl) to reproduce training this checkpoint. Hyperparameters and other details are provided in the training script.
+
+Due to computational constraints, we have trained for 1000 steps, and released the final checkpoint.
+
+
+#### Hardware
+
+This model has been finetuned using 32 NVIDIA H200 GPUs (4 nodes of 8xH200 GPUs).
+
+
+## Citation
+
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+
+**BibTeX:**
+
+```
+@misc{tajwar2026maximumlikelihoodreinforcementlearning,
+      title={Maximum Likelihood Reinforcement Learning}, 
+      author={Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov and Haiwen Feng and Andrea Zanette},
+      year={2026},
+      eprint={2602.02710},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2602.02710}, 
+}
+```
+
+## Model Card Contact
+
+[Fahim Tajwar](mailto:tajwarfahim932@gmail.com)