初始化项目,由ModelHub XC社区提供模型
Model: IcyFish/Qwen3-4B-EnvTuning Source: Original Platform
This commit is contained in:
193
README.md
Normal file
193
README.md
Normal file
@@ -0,0 +1,193 @@
|
||||
---
|
||||
library_name: transformers
|
||||
license: apache-2.0
|
||||
license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE
|
||||
base_model: Qwen/Qwen3-4B-Instruct-2507
|
||||
pipeline_tag: text-generation
|
||||
language:
|
||||
- en
|
||||
tags:
|
||||
- qwen3
|
||||
- text-generation
|
||||
- continued-pretraining
|
||||
- agent
|
||||
- tool-use
|
||||
---
|
||||
|
||||
<div align="center">
|
||||
|
||||
# Qwen3-4B-EnvTuning
|
||||
|
||||
[](https://huggingface.co/IcyFish/Qwen3-4B-EnvTuning)
|
||||
[](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
|
||||
[](https://arxiv.org/abs/2510.10197)
|
||||
[](https://openreview.net/forum?id=nzodtGccEM)
|
||||
|
||||
</div>
|
||||
|
||||
## Overview
|
||||
|
||||
`Qwen3-4B-EnvTuning` is a continued-training checkpoint built on top of [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).
|
||||
|
||||
This model follows the training idea in the paper **Don't Just Fine-tune the Agent, Tune the Environment**, which shifts agent learning from static trajectory imitation to **environment-based exploration**. The core idea is to improve agent capability by tuning the learning environment itself, instead of relying only on fine-tuning the policy with pre-collected demonstrations.
|
||||
|
||||
- Base model: `Qwen/Qwen3-4B-Instruct-2507`
|
||||
- Released model: [`IcyFish/Qwen3-4B-EnvTuning`](https://huggingface.co/IcyFish/Qwen3-4B-EnvTuning-Base)
|
||||
- Model type: Causal Language Model
|
||||
- Training style: continued training based on the Environment Tuning paradigm
|
||||
|
||||
## Introduction
|
||||
|
||||
The paper studies agent training under **extreme data scarcity**. In multi-turn tool-use settings, plain SFT on synthetic trajectories often overfits, while direct RL tends to suffer from cold-start and unstable optimization. Environment Tuning addresses this by redesigning the interaction loop between agent and environment so that exploration becomes more learnable.
|
||||
|
||||
The method centers on three ingredients:
|
||||
|
||||
- **Structured curriculum**: train the agent from easy skills to harder multi-turn tool-use behaviors.
|
||||
- **Actionable environment augmentation**: replace vague failures with corrective hints that reveal tool dependencies and constraints.
|
||||
- **Fine-grained progress rewards**: provide denser turn-level learning signals instead of only sparse episode-level success.
|
||||
|
||||
The paper reports that this paradigm can train competitive agents from only a small number of problem instances, with better out-of-distribution generalization than pure SFT baselines.
|
||||
|
||||
The original paper includes an introduction figure illustrating the difference between static SFT, standard RL, and Environment Tuning. To keep this Hugging Face repository lightweight and push-friendly, the figure is not embedded as a local binary asset here.
|
||||
|
||||
## Training Pipeline
|
||||
|
||||
This checkpoint is a Qwen3-4B-based release inspired by the training pipeline proposed in the paper. At a high level, the recipe consists of:
|
||||
|
||||
1. Start from a strong instruction-tuned base model.
|
||||
2. Train with a staged curriculum rather than optimizing the full task from the beginning.
|
||||
3. Use augmented environment feedback in the middle stages to turn failed tool interactions into useful supervision.
|
||||
4. Use fine-grained progress rewards to stabilize long-horizon learning.
|
||||
5. Remove the extra environment assistance in the final stage to better match real evaluation conditions.
|
||||
|
||||
The paper also provides a pipeline figure showing the curriculum stages, augmented feedback, and the agent learning loop. This repository keeps the README text-only for compatibility with the current Hugging Face push restrictions on binary assets.
|
||||
|
||||
## Training Setup and Evaluation
|
||||
|
||||
This checkpoint was **not** evaluated in the original paper. It is a follow-up model release that keeps the same training philosophy and core method, but uses a different concrete training setup.
|
||||
|
||||
- Training data used for this checkpoint: **400 BFCL V3 training instances**
|
||||
- Evaluation setting: tested on **400 unseen BFCL V3 instances** that were not used for training
|
||||
|
||||
| Category | Correct | Total | Accuracy |
|
||||
| --- | ---: | ---: | ---: |
|
||||
| `multi_turn_base` | 69 | 100 | 69.00% |
|
||||
| `multi_turn_long_context` | 65 | 100 | 65.00% |
|
||||
| `multi_turn_miss_func` | 64 | 100 | 64.00% |
|
||||
| `multi_turn_miss_param` | 56 | 100 | 56.00% |
|
||||
| `OVERALL` | 254 | 400 | 63.50% |
|
||||
|
||||
These numbers should be understood as the evaluation results of **this released checkpoint**, rather than results reported in the original paper.
|
||||
|
||||
## Model Details
|
||||
|
||||
Unless otherwise noted, this checkpoint keeps the same underlying architecture as `Qwen3-4B-Instruct-2507`:
|
||||
|
||||
- Architecture: `Qwen3ForCausalLM`
|
||||
- Parameters: 4.0B
|
||||
- Non-embedding parameters: 3.6B
|
||||
- Layers: 36
|
||||
- Attention heads: 32 for Q and 8 for KV
|
||||
- Native context length: 262,144
|
||||
|
||||
For the original architecture and upstream model information, please refer to:
|
||||
|
||||
- Qwen blog: https://qwenlm.github.io/blog/qwen3/
|
||||
- Qwen GitHub: https://github.com/QwenLM/Qwen3
|
||||
- Qwen documentation: https://qwen.readthedocs.io/en/latest/
|
||||
|
||||
## Quick Start
|
||||
|
||||
Use the model with the latest version of `transformers`. With `transformers<4.51.0`, you may encounter:
|
||||
|
||||
```text
|
||||
KeyError: 'qwen3'
|
||||
```
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "IcyFish/Qwen3-4B-EnvTuning"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
torch_dtype="auto",
|
||||
device_map="auto",
|
||||
)
|
||||
|
||||
messages = [
|
||||
{"role": "user", "content": "Give me a short introduction to large language models."}
|
||||
]
|
||||
|
||||
text = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
)
|
||||
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
||||
|
||||
generated_ids = model.generate(
|
||||
**model_inputs,
|
||||
max_new_tokens=16384,
|
||||
)
|
||||
|
||||
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
|
||||
content = tokenizer.decode(output_ids, skip_special_tokens=True)
|
||||
print(content)
|
||||
```
|
||||
|
||||
For serving:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model-path IcyFish/Qwen3-4B-EnvTuning \
|
||||
--context-length 262144
|
||||
```
|
||||
|
||||
```bash
|
||||
vllm serve IcyFish/Qwen3-4B-EnvTuning --max-model-len 262144
|
||||
```
|
||||
|
||||
If you encounter out-of-memory issues, consider reducing the effective context length, for example to `32768`.
|
||||
|
||||
## Notes
|
||||
|
||||
- This repository releases a **derived checkpoint**, not the original upstream Qwen release.
|
||||
- This checkpoint follows the **same method family** as the paper, but it is **not** one of the exact models reported in the paper's main experiments.
|
||||
- The figures from the paper are referenced conceptually in the README, but local binary image assets are intentionally omitted to keep the repository easy to publish on Hugging Face.
|
||||
- The BFCL V3 results reported above are model-specific numbers for this checkpoint and should not be confused with either upstream Qwen3 results or the original paper's reported models.
|
||||
|
||||
## License
|
||||
|
||||
This model is released under the same license link referenced from the upstream Qwen checkpoint:
|
||||
|
||||
- https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE
|
||||
|
||||
Please review the upstream license terms before downstream use.
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this model, please consider citing both the Environment Tuning paper and the original Qwen3 technical report.
|
||||
|
||||
```bibtex
|
||||
@article{lu2025dont,
|
||||
title={Don't Just Fine-tune the Agent, Tune the Environment},
|
||||
author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},
|
||||
journal={arXiv preprint arXiv:2510.10197},
|
||||
year={2025},
|
||||
url={https://arxiv.org/abs/2510.10197}
|
||||
}
|
||||
```
|
||||
|
||||
```bibtex
|
||||
@misc{qwen3technicalreport,
|
||||
title={Qwen3 Technical Report},
|
||||
author={Qwen Team},
|
||||
year={2025},
|
||||
eprint={2505.09388},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2505.09388}
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user