5.5 KiB
Deterministic Inference
Why Deterministic Inference Matters
Deterministic inference ensures consistent LLM outputs across runs, which is critical for:
- Reinforcement Learning: Ensures consistent logprobs across runs, reducing stochastic noise and making RL training more stable, reproducible, and debuggable.
- Testing & Debugging: Enables reproducible validation
- Production: Improves reliability and user experience
Even with temperature=0, standard LLM inference can produce different outputs due to dynamic batching and varying reduction orders in GPU kernels.
The Root Cause of Non-Determinism
The main source is varying batch sizes. Different batch sizes cause GPU kernels to split reduction operations differently, leading to different addition orders. Due to floating-point non-associativity ((a + b) + c ≠ a + (b + c)), this produces different results even for identical inputs.
SGLang's Solution
Building on Thinking Machines Lab's batch-invariant operators, SGLang achieves fully deterministic inference while maintaining compatibility with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. The development roadmap for deterministic inference features can be found in this issue.
Supported Backends
Deterministic inference is only supported with the following three attention backends: FlashInfer, FlashAttention 3 (FA3), and Triton.
The following table shows feature compatibility for deterministic inference across different attention backends:
| Attention Backend | CUDA Graph | Chunked Prefill | Radix Cache | Non-greedy Sampling (Temp > 0) |
|---|---|---|---|---|
| FlashInfer | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes |
| FlashAttention 3 (FA3) | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
| Triton | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
Usage
Basic Usage
Enable deterministic inference by adding the --enable-deterministic-inference flag:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--attention-backend fa3 \
--enable-deterministic-inference
Server Arguments
| Argument | Type/Default | Description |
|---|---|---|
--enable-deterministic-inference |
flag; default: disabled | Enable deterministic inference with batch-invariant operations |
--attention-backend |
string; default: fa3 | Choose attention backend (flashinfer, fa3, or triton) |
Example Configurations
Qwen3-8B
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--attention-backend flashinfer \
--enable-deterministic-inference
Llama Models
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--attention-backend fa3 \
--enable-deterministic-inference
Qwen3-30B-A3B (MoE Model)
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-30B-A3B \
--attention-backend fa3 \
--enable-deterministic-inference
Deterministic Inference with Non-Greedy Sampling (Temperature > 0)
SGLang supports deterministic inference even with non-greedy sampling by using sampling seeds. This is particularly useful for reinforcement learning scenarios like GRPO (Group Relative Policy Optimization) where you need multiple diverse but reproducible responses.
Default Behavior
By default, SGLang uses a sampling seed of 42 for reproducible sampling:
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Tell me a joke",
"sampling_params": {
"temperature": 0.8, # Non-greedy sampling
"max_new_tokens": 128,
},
},
)
print(response.json())
# This will always produce the same response across runs
Generating Multiple Reproducible Responses
To sample different responses from the same prompt while maintaining reproducibility (e.g., for GRPO training), provide different sampling seeds in your requests:
import requests
# Prepare a list of sampling seeds for different responses
sampling_seeds = [42, 43, 44, 45, 46]
responses = []
for seed in sampling_seeds:
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "Tell me a joke",
"sampling_params": {
"temperature": 0.8,
"max_new_tokens": 128,
"sampling_seed": seed, # Specify sampling seed
},
},
)
responses.append(response.json())
# Each seed will produce a different but reproducible response
# Using the same seed will always produce the same response
This approach ensures that:
- Different seeds produce diverse responses
- The same seed always produces the same response across different runs
- Results are reproducible for debugging and evaluation
Verification
Run deterministic tests to verify consistent outputs:
# Single test: same prompt, varying batch sizes
python3 -m sglang.test.test_deterministic --test-mode single --n-trials 50
# Prefix test: prompts with different prefix lengths
python3 -m sglang.test.test_deterministic --test-mode prefix --n-trials 50
# Radix Cache Consistency mode: test radix cache determinism (cached vs uncached prefill)
python3 -m sglang.test.test_deterministic --test-mode radix_cache
Expected result: All tests should show Unique samples: 1 (perfectly deterministic).