Sync from v0.13

2026-01-19 10:38:50 +08:00
parent b2ef04d792
commit 5aef6c175a
3714 changed files with 854317 additions and 89342 deletions
--- a/docs/features/quantization/bnb.md
+++ b/docs/features/quantization/bnb.md
@@ -0,0 +1,56 @@
+# BitsAndBytes
+
+vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
+BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
+Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.
+
+Below are the steps to utilize BitsAndBytes with vLLM.
+
+```bash
+pip install bitsandbytes>=0.46.1
+```
+
+vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
+
+You can find bitsandbytes quantized models on [Hugging Face](https://huggingface.co/models?search=bitsandbytes).
+And usually, these repositories have a config.json file that includes a quantization_config section.
+
+## Read quantized checkpoint
+
+For pre-quantized checkpoints, vLLM will try to infer the quantization method from the config file, so you don't need to explicitly specify the quantization argument.
+
+```python
+from vllm import LLM
+import torch
+# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
+model_id = "unsloth/tinyllama-bnb-4bit"
+llm = LLM(
+    model=model_id,
+    dtype=torch.bfloat16,
+    trust_remote_code=True,
+)
+```
+
+## Inflight quantization: load as 4bit quantization
+
+For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify the quantization argument.
+
+```python
+from vllm import LLM
+import torch
+model_id = "huggyllama/llama-7b"
+llm = LLM(
+    model=model_id,
+    dtype=torch.bfloat16,
+    trust_remote_code=True,
+    quantization="bitsandbytes",
+)
+```
+
+## OpenAI Compatible Server
+
+Append the following to your model arguments for 4bit inflight quantization:
+
+```bash
+--quantization bitsandbytes
+```