Sync from v0.13

2026-01-19 10:38:50 +08:00
parent b2ef04d792
commit 5aef6c175a
3714 changed files with 854317 additions and 89342 deletions
--- a/docs/training/rlhf.md
+++ b/docs/training/rlhf.md
@@ -0,0 +1,28 @@
+# Reinforcement Learning from Human Feedback
+
+Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors. vLLM can be used to generate the completions for RLHF.
+
+The following open-source RL libraries use vLLM for fast rollouts (sorted alphabetically and non-exhaustive):
+
+- [Cosmos-RL](https://github.com/nvidia-cosmos/cosmos-rl)
+- [ms-swift](https://github.com/modelscope/ms-swift/tree/main)
+- [NeMo-RL](https://github.com/NVIDIA-NeMo/RL)
+- [Open Instruct](https://github.com/allenai/open-instruct)
+- [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)
+- [PipelineRL](https://github.com/ServiceNow/PipelineRL)
+- [Prime-RL](https://github.com/PrimeIntellect-ai/prime-rl)
+- [SkyRL](https://github.com/NovaSky-AI/SkyRL)
+- [TRL](https://github.com/huggingface/trl)
+- [Unsloth](https://github.com/unslothai/unsloth)
+- [verl](https://github.com/volcengine/verl)
+
+See the following basic examples to get started if you don't want to use an existing library:
+
+- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md)
+- [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md)
+- [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md)
+
+See the following notebooks showing how to use vLLM for GRPO:
+
+- [Efficient Online Training with GRPO and vLLM in TRL](https://huggingface.co/learn/cookbook/grpo_vllm_online_training)
+- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)
--- a/docs/training/trl.md
+++ b/docs/training/trl.md
@@ -0,0 +1,54 @@
+# Transformers Reinforcement Learning
+
+[Transformers Reinforcement Learning](https://huggingface.co/docs/trl) (TRL) is a full stack library that provides a set of tools to train transformer language models with methods like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more. The library is integrated with 🤗 transformers.
+
+Online methods such as GRPO or Online DPO require the model to generate completions. vLLM can be used to generate these completions!
+
+See the [vLLM integration guide](https://huggingface.co/docs/trl/main/en/vllm_integration) in the TRL documentation for more information.
+
+TRL currently supports the following online trainers with vLLM:
+
+- [GRPO](https://huggingface.co/docs/trl/main/en/grpo_trainer)
+- [Online DPO](https://huggingface.co/docs/trl/main/en/online_dpo_trainer)
+- [RLOO](https://huggingface.co/docs/trl/main/en/rloo_trainer)
+- [Nash-MD](https://huggingface.co/docs/trl/main/en/nash_md_trainer)
+- [XPO](https://huggingface.co/docs/trl/main/en/xpo_trainer)
+
+To enable vLLM in TRL, set the `use_vllm` flag in the trainer configuration to `True`.
+
+## Modes of Using vLLM During Training
+
+TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**. You can control how vLLM operates during training with the `vllm_mode` parameter.
+
+### Server mode
+
+In **server mode**, vLLM runs as an independent process on dedicated GPUs and communicates with the trainer through HTTP requests. This configuration is ideal when you have separate GPUs for inference, as it isolates generation workloads from training, ensuring stable performance and easier scaling.
+
+```python
+from trl import GRPOConfig
+
+training_args = GRPOConfig(
+    ...,
+    use_vllm=True,
+    vllm_mode="server",  # default value, can be omitted
+)
+```
+
+### Colocate mode
+
+In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
+
+```python
+from trl import GRPOConfig
+
+training_args = GRPOConfig(
+    ...,
+    use_vllm=True,
+    vllm_mode="colocate",
+)
+```
+
+Some trainers also support **vLLM sleep mode**, which offloads parameters and caches to GPU RAM during training, helping reduce memory usage. Learn more in the [memory optimization docs](https://huggingface.co/docs/trl/main/en/reducing_memory_usage#vllm-sleep-mode).
+
+!!! info
+    For detailed configuration options and flags, refer to the documentation of the specific trainer you are using.