Files
sglang/docs/advanced_features/deterministic_inference.md

5.5 KiB

Deterministic Inference

Why Deterministic Inference Matters

Deterministic inference ensures consistent LLM outputs across runs, which is critical for:

  • Reinforcement Learning: Ensures consistent logprobs across runs, reducing stochastic noise and making RL training more stable, reproducible, and debuggable.
  • Testing & Debugging: Enables reproducible validation
  • Production: Improves reliability and user experience

Even with temperature=0, standard LLM inference can produce different outputs due to dynamic batching and varying reduction orders in GPU kernels.

The Root Cause of Non-Determinism

The main source is varying batch sizes. Different batch sizes cause GPU kernels to split reduction operations differently, leading to different addition orders. Due to floating-point non-associativity ((a + b) + c ≠ a + (b + c)), this produces different results even for identical inputs.

SGLang's Solution

Building on Thinking Machines Lab's batch-invariant operators, SGLang achieves fully deterministic inference while maintaining compatibility with chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. The development roadmap for deterministic inference features can be found in this issue.

Supported Backends

Deterministic inference is only supported with the following three attention backends: FlashInfer, FlashAttention 3 (FA3), and Triton.

The following table shows feature compatibility for deterministic inference across different attention backends:

Attention Backend CUDA Graph Chunked Prefill Radix Cache Non-greedy Sampling (Temp > 0)
FlashInfer Yes Yes No Yes
FlashAttention 3 (FA3) Yes Yes Yes Yes
Triton Yes Yes Yes Yes

Usage

Basic Usage

Enable deterministic inference by adding the --enable-deterministic-inference flag:

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --attention-backend fa3 \
    --enable-deterministic-inference

Server Arguments

Argument Type/Default Description
--enable-deterministic-inference flag; default: disabled Enable deterministic inference with batch-invariant operations
--attention-backend string; default: fa3 Choose attention backend (flashinfer, fa3, or triton)

Example Configurations

Qwen3-8B

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --attention-backend flashinfer \
    --enable-deterministic-inference

Llama Models

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --attention-backend fa3 \
    --enable-deterministic-inference

Qwen3-30B-A3B (MoE Model)

python3 -m sglang.launch_server \
    --model-path Qwen/Qwen3-30B-A3B \
    --attention-backend fa3 \
    --enable-deterministic-inference

Deterministic Inference with Non-Greedy Sampling (Temperature > 0)

SGLang supports deterministic inference even with non-greedy sampling by using sampling seeds. This is particularly useful for reinforcement learning scenarios like GRPO (Group Relative Policy Optimization) where you need multiple diverse but reproducible responses.

Default Behavior

By default, SGLang uses a sampling seed of 42 for reproducible sampling:

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Tell me a joke",
        "sampling_params": {
            "temperature": 0.8,  # Non-greedy sampling
            "max_new_tokens": 128,
        },
    },
)
print(response.json())
# This will always produce the same response across runs

Generating Multiple Reproducible Responses

To sample different responses from the same prompt while maintaining reproducibility (e.g., for GRPO training), provide different sampling seeds in your requests:

import requests

# Prepare a list of sampling seeds for different responses
sampling_seeds = [42, 43, 44, 45, 46]

responses = []
for seed in sampling_seeds:
    response = requests.post(
        "http://localhost:30000/generate",
        json={
            "text": "Tell me a joke",
            "sampling_params": {
                "temperature": 0.8,
                "max_new_tokens": 128,
                "sampling_seed": seed,  # Specify sampling seed
            },
        },
    )
    responses.append(response.json())

# Each seed will produce a different but reproducible response
# Using the same seed will always produce the same response

This approach ensures that:

  • Different seeds produce diverse responses
  • The same seed always produces the same response across different runs
  • Results are reproducible for debugging and evaluation

Verification

Run deterministic tests to verify consistent outputs:

# Single test: same prompt, varying batch sizes
python3 -m sglang.test.test_deterministic --test-mode single --n-trials 50

# Prefix test: prompts with different prefix lengths
python3 -m sglang.test.test_deterministic --test-mode prefix --n-trials 50

# Radix Cache Consistency mode: test radix cache determinism (cached vs uncached prefill)
python3 -m sglang.test.test_deterministic --test-mode radix_cache

Expected result: All tests should show Unique samples: 1 (perfectly deterministic).