Qwen3-4B-EnvTuning-Base/README.md

---
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE
base_model: Qwen/Qwen3-4B-Instruct-2507
pipeline_tag: text-generation
language:
- en
tags:
- qwen3
- text-generation
- continued-pretraining
- agent
- tool-use
---

<div align="center">

# Qwen3-4B-EnvTuning-Base

[![Hugging Face Model](https://img.shields.io/badge/Hugging%20Face-Model-orange)](https://huggingface.co/IcyFish/Qwen3-4B-EnvTuning-Base)
[![Base Model](https://img.shields.io/badge/Base%20Model-Qwen3--4B--Instruct--2507-blue)](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
[![arXiv](https://img.shields.io/badge/arXiv-2510.10197-b31b1b.svg)](https://arxiv.org/abs/2510.10197)
[![OpenReview](https://img.shields.io/badge/OpenReview-ICLR%202026%20Poster-8c1aff)](https://openreview.net/forum?id=nzodtGccEM)

</div>

## Overview

`Qwen3-4B-EnvTuning-Base` is a continued-training checkpoint built on top of [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).

This model follows the training idea in the paper **Don't Just Fine-tune the Agent, Tune the Environment**, which shifts agent learning from static trajectory imitation to **environment-based exploration**. The core idea is to improve agent capability by tuning the learning environment itself, instead of relying only on fine-tuning the policy with pre-collected demonstrations.

- Base model: `Qwen/Qwen3-4B-Instruct-2507`
- Released model: [`IcyFish/Qwen3-4B-EnvTuning-Base`](https://huggingface.co/IcyFish/Qwen3-4B-EnvTuning-Base)
- Model type: Causal Language Model
- Training style: continued training based on the Environment Tuning paradigm

## Introduction

The paper studies agent training under **extreme data scarcity**. In multi-turn tool-use settings, plain SFT on synthetic trajectories often overfits, while direct RL tends to suffer from cold-start and unstable optimization. Environment Tuning addresses this by redesigning the interaction loop between agent and environment so that exploration becomes more learnable.

The method centers on three ingredients:

- **Structured curriculum**: train the agent from easy skills to harder multi-turn tool-use behaviors.
- **Actionable environment augmentation**: replace vague failures with corrective hints that reveal tool dependencies and constraints.
- **Fine-grained progress rewards**: provide denser turn-level learning signals instead of only sparse episode-level success.

The paper reports that this paradigm can train competitive agents from only a small number of problem instances, with better out-of-distribution generalization than pure SFT baselines.

The original paper includes an introduction figure illustrating the difference between static SFT, standard RL, and Environment Tuning. To keep this Hugging Face repository lightweight and push-friendly, the figure is not embedded as a local binary asset here.

## Training Pipeline

This checkpoint is a Qwen3-4B-based release inspired by the training pipeline proposed in the paper. At a high level, the recipe consists of:

1. Start from a strong instruction-tuned base model.
2. Train with a staged curriculum rather than optimizing the full task from the beginning.
3. Use augmented environment feedback in the middle stages to turn failed tool interactions into useful supervision.
4. Use fine-grained progress rewards to stabilize long-horizon learning.
5. Remove the extra environment assistance in the final stage to better match real evaluation conditions.

The paper also provides a pipeline figure showing the curriculum stages, augmented feedback, and the agent learning loop. This repository keeps the README text-only for compatibility with the current Hugging Face push restrictions on binary assets.

## Training Setup and Evaluation

This checkpoint was **not** evaluated in the original paper. It is a follow-up model release that keeps the same training philosophy and core method, but uses a different concrete training setup.

- Training data used for this checkpoint: **100 BFCL V3 base training instances**
- Naming note: the `-Base` suffix indicates that this model was trained on the **BFCL V3 base split only**
- Evaluation setting: tested on **400 unseen BFCL V3 instances** that were not used for training

| Category | Correct | Total | Accuracy |
| --- | ---: | ---: | ---: |
| `multi_turn_base` | 75 | 100 | 75.00% |
| `multi_turn_long_context` | 69 | 100 | 69.00% |
| `multi_turn_miss_func` | 58 | 100 | 58.00% |
| `multi_turn_miss_param` | 38 | 100 | 38.00% |
| `OVERALL` | 240 | 400 | 60.00% |

These numbers should be understood as the evaluation results of **this released checkpoint**, rather than results reported in the original paper.

## Model Details

Unless otherwise noted, this checkpoint keeps the same underlying architecture as `Qwen3-4B-Instruct-2507`:

- Architecture: `Qwen3ForCausalLM`
- Parameters: 4.0B
- Non-embedding parameters: 3.6B
- Layers: 36
- Attention heads: 32 for Q and 8 for KV
- Native context length: 262,144

For the original architecture and upstream model information, please refer to:

- Qwen blog: https://qwenlm.github.io/blog/qwen3/
- Qwen GitHub: https://github.com/QwenLM/Qwen3
- Qwen documentation: https://qwen.readthedocs.io/en/latest/

## Quick Start

Use the model with the latest version of `transformers`. With `transformers<4.51.0`, you may encounter:

```text
KeyError: 'qwen3'
```

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "IcyFish/Qwen3-4B-EnvTuning-Base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Give me a short introduction to large language models."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
)

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print(content)
```

For serving:

```bash
python -m sglang.launch_server \
  --model-path IcyFish/Qwen3-4B-EnvTuning-Base \
  --context-length 262144
```

```bash
vllm serve IcyFish/Qwen3-4B-EnvTuning-Base --max-model-len 262144
```

If you encounter out-of-memory issues, consider reducing the effective context length, for example to `32768`.

## Notes

- This repository releases a **derived checkpoint**, not the original upstream Qwen release.
- This checkpoint follows the **same method family** as the paper, but it is **not** one of the exact models reported in the paper's main experiments.
- The figures from the paper are referenced conceptually in the README, but local binary image assets are intentionally omitted to keep the repository easy to publish on Hugging Face.
- The BFCL V3 results reported above are model-specific numbers for this checkpoint and should not be confused with either upstream Qwen3 results or the original paper's reported models.

## License

This model is released under the same license link referenced from the upstream Qwen checkpoint:

- https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE

Please review the upstream license terms before downstream use.

## Citation

If you use this model, please consider citing both the Environment Tuning paper and the original Qwen3 technical report.

```bibtex
@article{lu2025dont,
  title={Don't Just Fine-tune the Agent, Tune the Environment},
  author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},
  journal={arXiv preprint arXiv:2510.10197},
  year={2025},
  url={https://arxiv.org/abs/2510.10197}
}
```

```bibtex
@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388}
}
```
初始化项目，由ModelHub XC社区提供模型 Model: IcyFish/Qwen3-4B-EnvTuning-Base Source: Original Platform 2026-04-29 18:23:45 +08:00			`---`
			`library_name: transformers`
			`license: apache-2.0`
			`license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE`
			`base_model: Qwen/Qwen3-4B-Instruct-2507`
			`pipeline_tag: text-generation`
			`language:`
			`- en`
			`tags:`
			`- qwen3`
			`- text-generation`
			`- continued-pretraining`
			`- agent`
			`- tool-use`
			`---`

			`<div align="center">`

			`# Qwen3-4B-EnvTuning-Base`

			`[![Hugging Face Model](https://img.shields.io/badge/Hugging%20Face-Model-orange)](https://huggingface.co/IcyFish/Qwen3-4B-EnvTuning-Base)`
			`[![Base Model](https://img.shields.io/badge/Base%20Model-Qwen3--4B--Instruct--2507-blue)](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)`
			`[![arXiv](https://img.shields.io/badge/arXiv-2510.10197-b31b1b.svg)](https://arxiv.org/abs/2510.10197)`
			`[![OpenReview](https://img.shields.io/badge/OpenReview-ICLR%202026%20Poster-8c1aff)](https://openreview.net/forum?id=nzodtGccEM)`

			`</div>`

			`## Overview`

			`Qwen3-4B-EnvTuning-Base` is a continued-training checkpoint built on top of [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).

			`This model follows the training idea in the paper Don't Just Fine-tune the Agent, Tune the Environment, which shifts agent learning from static trajectory imitation to environment-based exploration. The core idea is to improve agent capability by tuning the learning environment itself, instead of relying only on fine-tuning the policy with pre-collected demonstrations.`

			- Base model: `Qwen/Qwen3-4B-Instruct-2507`
			- Released model: [`IcyFish/Qwen3-4B-EnvTuning-Base`](https://huggingface.co/IcyFish/Qwen3-4B-EnvTuning-Base)
			`- Model type: Causal Language Model`
			`- Training style: continued training based on the Environment Tuning paradigm`

			`## Introduction`

			`The paper studies agent training under extreme data scarcity. In multi-turn tool-use settings, plain SFT on synthetic trajectories often overfits, while direct RL tends to suffer from cold-start and unstable optimization. Environment Tuning addresses this by redesigning the interaction loop between agent and environment so that exploration becomes more learnable.`

			`The method centers on three ingredients:`

			`- Structured curriculum: train the agent from easy skills to harder multi-turn tool-use behaviors.`
			`- Actionable environment augmentation: replace vague failures with corrective hints that reveal tool dependencies and constraints.`
			`- Fine-grained progress rewards: provide denser turn-level learning signals instead of only sparse episode-level success.`

			`The paper reports that this paradigm can train competitive agents from only a small number of problem instances, with better out-of-distribution generalization than pure SFT baselines.`

			`The original paper includes an introduction figure illustrating the difference between static SFT, standard RL, and Environment Tuning. To keep this Hugging Face repository lightweight and push-friendly, the figure is not embedded as a local binary asset here.`

			`## Training Pipeline`

			`This checkpoint is a Qwen3-4B-based release inspired by the training pipeline proposed in the paper. At a high level, the recipe consists of:`

			`1. Start from a strong instruction-tuned base model.`
			`2. Train with a staged curriculum rather than optimizing the full task from the beginning.`
			`3. Use augmented environment feedback in the middle stages to turn failed tool interactions into useful supervision.`
			`4. Use fine-grained progress rewards to stabilize long-horizon learning.`
			`5. Remove the extra environment assistance in the final stage to better match real evaluation conditions.`

			`The paper also provides a pipeline figure showing the curriculum stages, augmented feedback, and the agent learning loop. This repository keeps the README text-only for compatibility with the current Hugging Face push restrictions on binary assets.`

			`## Training Setup and Evaluation`

			`This checkpoint was not evaluated in the original paper. It is a follow-up model release that keeps the same training philosophy and core method, but uses a different concrete training setup.`

			`- Training data used for this checkpoint: 100 BFCL V3 base training instances`
			- Naming note: the `-Base` suffix indicates that this model was trained on the BFCL V3 base split only
			`- Evaluation setting: tested on 400 unseen BFCL V3 instances that were not used for training`

			`\| Category \| Correct \| Total \| Accuracy \|`
			`\| --- \| ---: \| ---: \| ---: \|`
			\| `multi_turn_base` \| 75 \| 100 \| 75.00% \|
			\| `multi_turn_long_context` \| 69 \| 100 \| 69.00% \|
			\| `multi_turn_miss_func` \| 58 \| 100 \| 58.00% \|
			\| `multi_turn_miss_param` \| 38 \| 100 \| 38.00% \|
			\| `OVERALL` \| 240 \| 400 \| 60.00% \|

			`These numbers should be understood as the evaluation results of this released checkpoint, rather than results reported in the original paper.`

			`## Model Details`

			Unless otherwise noted, this checkpoint keeps the same underlying architecture as `Qwen3-4B-Instruct-2507`:

			- Architecture: `Qwen3ForCausalLM`
			`- Parameters: 4.0B`
			`- Non-embedding parameters: 3.6B`
			`- Layers: 36`
			`- Attention heads: 32 for Q and 8 for KV`
			`- Native context length: 262,144`

			`For the original architecture and upstream model information, please refer to:`

			`- Qwen blog: https://qwenlm.github.io/blog/qwen3/`
			`- Qwen GitHub: https://github.com/QwenLM/Qwen3`
			`- Qwen documentation: https://qwen.readthedocs.io/en/latest/`

			`## Quick Start`

			Use the model with the latest version of `transformers`. With `transformers<4.51.0`, you may encounter:

			```text
			`KeyError: 'qwen3'`
			```

			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer`

			`model_name = "IcyFish/Qwen3-4B-EnvTuning-Base"`

			`tokenizer = AutoTokenizer.from_pretrained(model_name)`
			`model = AutoModelForCausalLM.from_pretrained(`
			`model_name,`
			`torch_dtype="auto",`
			`device_map="auto",`
			`)`

			`messages = [`
			`{"role": "user", "content": "Give me a short introduction to large language models."}`
			`]`

			`text = tokenizer.apply_chat_template(`
			`messages,`
			`tokenize=False,`
			`add_generation_prompt=True,`
			`)`
			`model_inputs = tokenizer([text], return_tensors="pt").to(model.device)`

			`generated_ids = model.generate(`
			`**model_inputs,`
			`max_new_tokens=16384,`
			`)`

			`output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]`
			`content = tokenizer.decode(output_ids, skip_special_tokens=True)`
			`print(content)`
			```

			`For serving:`

			```bash
			`python -m sglang.launch_server \`
			`--model-path IcyFish/Qwen3-4B-EnvTuning-Base \`
			`--context-length 262144`
			```

			```bash
			`vllm serve IcyFish/Qwen3-4B-EnvTuning-Base --max-model-len 262144`
			```

			If you encounter out-of-memory issues, consider reducing the effective context length, for example to `32768`.

			`## Notes`

			`- This repository releases a derived checkpoint, not the original upstream Qwen release.`
			`- This checkpoint follows the same method family as the paper, but it is not one of the exact models reported in the paper's main experiments.`
			`- The figures from the paper are referenced conceptually in the README, but local binary image assets are intentionally omitted to keep the repository easy to publish on Hugging Face.`
			`- The BFCL V3 results reported above are model-specific numbers for this checkpoint and should not be confused with either upstream Qwen3 results or the original paper's reported models.`

			`## License`

			`This model is released under the same license link referenced from the upstream Qwen checkpoint:`

			`- https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE`

			`Please review the upstream license terms before downstream use.`

			`## Citation`

			`If you use this model, please consider citing both the Environment Tuning paper and the original Qwen3 technical report.`

			```bibtex
			`@article{lu2025dont,`
			`title={Don't Just Fine-tune the Agent, Tune the Environment},`
			`author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},`
			`journal={arXiv preprint arXiv:2510.10197},`
			`year={2025},`
			`url={https://arxiv.org/abs/2510.10197}`
			`}`
			```

			```bibtex
			`@misc{qwen3technicalreport,`
			`title={Qwen3 Technical Report},`
			`author={Qwen Team},`
			`year={2025},`
			`eprint={2505.09388},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.CL},`
			`url={https://arxiv.org/abs/2505.09388}`
			`}`
			```