195 lines
8.0 KiB
Markdown
195 lines
8.0 KiB
Markdown
|
|
---
|
||
|
|
library_name: transformers
|
||
|
|
license: apache-2.0
|
||
|
|
license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE
|
||
|
|
base_model: Qwen/Qwen3-4B-Instruct-2507
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
tags:
|
||
|
|
- qwen3
|
||
|
|
- text-generation
|
||
|
|
- continued-pretraining
|
||
|
|
- agent
|
||
|
|
- tool-use
|
||
|
|
---
|
||
|
|
|
||
|
|
<div align="center">
|
||
|
|
|
||
|
|
# Qwen3-4B-EnvTuning-Base
|
||
|
|
|
||
|
|
[](https://huggingface.co/IcyFish/Qwen3-4B-EnvTuning-Base)
|
||
|
|
[](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
|
||
|
|
[](https://arxiv.org/abs/2510.10197)
|
||
|
|
[](https://openreview.net/forum?id=nzodtGccEM)
|
||
|
|
|
||
|
|
</div>
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
`Qwen3-4B-EnvTuning-Base` is a continued-training checkpoint built on top of [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).
|
||
|
|
|
||
|
|
This model follows the training idea in the paper **Don't Just Fine-tune the Agent, Tune the Environment**, which shifts agent learning from static trajectory imitation to **environment-based exploration**. The core idea is to improve agent capability by tuning the learning environment itself, instead of relying only on fine-tuning the policy with pre-collected demonstrations.
|
||
|
|
|
||
|
|
- Base model: `Qwen/Qwen3-4B-Instruct-2507`
|
||
|
|
- Released model: [`IcyFish/Qwen3-4B-EnvTuning-Base`](https://huggingface.co/IcyFish/Qwen3-4B-EnvTuning-Base)
|
||
|
|
- Model type: Causal Language Model
|
||
|
|
- Training style: continued training based on the Environment Tuning paradigm
|
||
|
|
|
||
|
|
## Introduction
|
||
|
|
|
||
|
|
The paper studies agent training under **extreme data scarcity**. In multi-turn tool-use settings, plain SFT on synthetic trajectories often overfits, while direct RL tends to suffer from cold-start and unstable optimization. Environment Tuning addresses this by redesigning the interaction loop between agent and environment so that exploration becomes more learnable.
|
||
|
|
|
||
|
|
The method centers on three ingredients:
|
||
|
|
|
||
|
|
- **Structured curriculum**: train the agent from easy skills to harder multi-turn tool-use behaviors.
|
||
|
|
- **Actionable environment augmentation**: replace vague failures with corrective hints that reveal tool dependencies and constraints.
|
||
|
|
- **Fine-grained progress rewards**: provide denser turn-level learning signals instead of only sparse episode-level success.
|
||
|
|
|
||
|
|
The paper reports that this paradigm can train competitive agents from only a small number of problem instances, with better out-of-distribution generalization than pure SFT baselines.
|
||
|
|
|
||
|
|
The original paper includes an introduction figure illustrating the difference between static SFT, standard RL, and Environment Tuning. To keep this Hugging Face repository lightweight and push-friendly, the figure is not embedded as a local binary asset here.
|
||
|
|
|
||
|
|
## Training Pipeline
|
||
|
|
|
||
|
|
This checkpoint is a Qwen3-4B-based release inspired by the training pipeline proposed in the paper. At a high level, the recipe consists of:
|
||
|
|
|
||
|
|
1. Start from a strong instruction-tuned base model.
|
||
|
|
2. Train with a staged curriculum rather than optimizing the full task from the beginning.
|
||
|
|
3. Use augmented environment feedback in the middle stages to turn failed tool interactions into useful supervision.
|
||
|
|
4. Use fine-grained progress rewards to stabilize long-horizon learning.
|
||
|
|
5. Remove the extra environment assistance in the final stage to better match real evaluation conditions.
|
||
|
|
|
||
|
|
The paper also provides a pipeline figure showing the curriculum stages, augmented feedback, and the agent learning loop. This repository keeps the README text-only for compatibility with the current Hugging Face push restrictions on binary assets.
|
||
|
|
|
||
|
|
## Training Setup and Evaluation
|
||
|
|
|
||
|
|
This checkpoint was **not** evaluated in the original paper. It is a follow-up model release that keeps the same training philosophy and core method, but uses a different concrete training setup.
|
||
|
|
|
||
|
|
- Training data used for this checkpoint: **100 BFCL V3 base training instances**
|
||
|
|
- Naming note: the `-Base` suffix indicates that this model was trained on the **BFCL V3 base split only**
|
||
|
|
- Evaluation setting: tested on **400 unseen BFCL V3 instances** that were not used for training
|
||
|
|
|
||
|
|
| Category | Correct | Total | Accuracy |
|
||
|
|
| --- | ---: | ---: | ---: |
|
||
|
|
| `multi_turn_base` | 75 | 100 | 75.00% |
|
||
|
|
| `multi_turn_long_context` | 69 | 100 | 69.00% |
|
||
|
|
| `multi_turn_miss_func` | 58 | 100 | 58.00% |
|
||
|
|
| `multi_turn_miss_param` | 38 | 100 | 38.00% |
|
||
|
|
| `OVERALL` | 240 | 400 | 60.00% |
|
||
|
|
|
||
|
|
These numbers should be understood as the evaluation results of **this released checkpoint**, rather than results reported in the original paper.
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
|
||
|
|
Unless otherwise noted, this checkpoint keeps the same underlying architecture as `Qwen3-4B-Instruct-2507`:
|
||
|
|
|
||
|
|
- Architecture: `Qwen3ForCausalLM`
|
||
|
|
- Parameters: 4.0B
|
||
|
|
- Non-embedding parameters: 3.6B
|
||
|
|
- Layers: 36
|
||
|
|
- Attention heads: 32 for Q and 8 for KV
|
||
|
|
- Native context length: 262,144
|
||
|
|
|
||
|
|
For the original architecture and upstream model information, please refer to:
|
||
|
|
|
||
|
|
- Qwen blog: https://qwenlm.github.io/blog/qwen3/
|
||
|
|
- Qwen GitHub: https://github.com/QwenLM/Qwen3
|
||
|
|
- Qwen documentation: https://qwen.readthedocs.io/en/latest/
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
Use the model with the latest version of `transformers`. With `transformers<4.51.0`, you may encounter:
|
||
|
|
|
||
|
|
```text
|
||
|
|
KeyError: 'qwen3'
|
||
|
|
```
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
|
|
|
||
|
|
model_name = "IcyFish/Qwen3-4B-EnvTuning-Base"
|
||
|
|
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
||
|
|
model_name,
|
||
|
|
torch_dtype="auto",
|
||
|
|
device_map="auto",
|
||
|
|
)
|
||
|
|
|
||
|
|
messages = [
|
||
|
|
{"role": "user", "content": "Give me a short introduction to large language models."}
|
||
|
|
]
|
||
|
|
|
||
|
|
text = tokenizer.apply_chat_template(
|
||
|
|
messages,
|
||
|
|
tokenize=False,
|
||
|
|
add_generation_prompt=True,
|
||
|
|
)
|
||
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
||
|
|
|
||
|
|
generated_ids = model.generate(
|
||
|
|
**model_inputs,
|
||
|
|
max_new_tokens=16384,
|
||
|
|
)
|
||
|
|
|
||
|
|
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
|
||
|
|
content = tokenizer.decode(output_ids, skip_special_tokens=True)
|
||
|
|
print(content)
|
||
|
|
```
|
||
|
|
|
||
|
|
For serving:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python -m sglang.launch_server \
|
||
|
|
--model-path IcyFish/Qwen3-4B-EnvTuning-Base \
|
||
|
|
--context-length 262144
|
||
|
|
```
|
||
|
|
|
||
|
|
```bash
|
||
|
|
vllm serve IcyFish/Qwen3-4B-EnvTuning-Base --max-model-len 262144
|
||
|
|
```
|
||
|
|
|
||
|
|
If you encounter out-of-memory issues, consider reducing the effective context length, for example to `32768`.
|
||
|
|
|
||
|
|
## Notes
|
||
|
|
|
||
|
|
- This repository releases a **derived checkpoint**, not the original upstream Qwen release.
|
||
|
|
- This checkpoint follows the **same method family** as the paper, but it is **not** one of the exact models reported in the paper's main experiments.
|
||
|
|
- The figures from the paper are referenced conceptually in the README, but local binary image assets are intentionally omitted to keep the repository easy to publish on Hugging Face.
|
||
|
|
- The BFCL V3 results reported above are model-specific numbers for this checkpoint and should not be confused with either upstream Qwen3 results or the original paper's reported models.
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
This model is released under the same license link referenced from the upstream Qwen checkpoint:
|
||
|
|
|
||
|
|
- https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE
|
||
|
|
|
||
|
|
Please review the upstream license terms before downstream use.
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
If you use this model, please consider citing both the Environment Tuning paper and the original Qwen3 technical report.
|
||
|
|
|
||
|
|
```bibtex
|
||
|
|
@article{lu2025dont,
|
||
|
|
title={Don't Just Fine-tune the Agent, Tune the Environment},
|
||
|
|
author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},
|
||
|
|
journal={arXiv preprint arXiv:2510.10197},
|
||
|
|
year={2025},
|
||
|
|
url={https://arxiv.org/abs/2510.10197}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
```bibtex
|
||
|
|
@misc{qwen3technicalreport,
|
||
|
|
title={Qwen3 Technical Report},
|
||
|
|
author={Qwen Team},
|
||
|
|
year={2025},
|
||
|
|
eprint={2505.09388},
|
||
|
|
archivePrefix={arXiv},
|
||
|
|
primaryClass={cs.CL},
|
||
|
|
url={https://arxiv.org/abs/2505.09388}
|
||
|
|
}
|
||
|
|
```
|