xc-llm-ascend/docs/source/user_guide/feature_guide/batch_invariance.md

# Batch Invariance

```{note}
Batch invariance is currently in beta. Some features are still under active development.
Track progress and planned improvements at <https://github.com/vllm-project/vllm-ascend/issues/5487>
```

This document shows how to enable batch invariance in vLLM-Ascend. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch.

## Motivation

Batch invariance is crucial for several use cases:

- **Framework debugging**: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching.
- **Model debugging**: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations.
- **Reinforcement Learning (RL)**: RL training often requires deterministic rollouts for reproducibility and stable training.
- **Large-scale inference systems**: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees.

## Hardware Requirements

Batch invariance currently requires Ascend Atlas A2 inference products NPUs, because only the Atlas A2 inference products supports batch invariance with HCCL communication for now.
We will support other NPUs in the future.

## Software Requirements

Batch invariance requires a custom operator library for Atlas A2 inference products.
We will release the customed operator library in future versions.

## Enabling Batch Invariance

Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` environment variable to `1`:

```bash
export VLLM_BATCH_INVARIANT=1
```

### Online Inference (Server Mode)

To start a vLLM server with batch invariance enabled:

```bash
VLLM_BATCH_INVARIANT=1 vllm serve Qwen/Qwen3-8B
```

Then use the OpenAI-compatible client:

```python
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

# These requests will produce deterministic outputs
# regardless of batch size or order
response = client.completions.create(
    model="Qwen/Qwen3-8B",
    prompt="The future of AI is",
    max_tokens=100,
    temperature=0.7,
    seed=42,
)

print(response.choices[0].text)
```

### Offline Inference

For offline batch inference with batch invariance:

```python
import os
os.environ["VLLM_BATCH_INVARIANT"] = "1"

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
    "Machine learning enables",
    "Deep learning models can",
]

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=100,
    seed=42,
)

llm = LLM(
    model="Qwen/Qwen3-8B",
    tensor_parallel_size=1,
)

# Outputs will be deterministic regardless of batch size
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Generated: {generated_text!r}\n")
```

## Tested Models

Batch invariance has been tested and verified on the following models:

- **Qwen3 (Dense)**: `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B`
- **Qwen3 (MoE)**: `Qwen/Qwen3-30B-A3B`

Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm-ascend/issues/new/choose).

## Implementation Details

When batch invariance is enabled, vLLM:

1. Uses deterministic kernel implementations for attention and other operations
2. Ensures consistent numerical behavior across different batch sizes
3. Disables certain optimizations that may introduce non-determinism

```{note}
Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility.
```

## Future Improvements

The batch invariance feature is under active development. Planned improvements include:

- Support for additional NPUs series
- Expanded model coverage
- Performance optimizations
- Additional testing and validation

For the latest status and to contribute ideas, see the [tracking issue](https://github.com/vllm-project/vllm-ascend/issues/5487).
[Feature] Add docs of batch invariance and make some extra operators patch (#6910) ### What this PR does / why we need it? This PR add docs of batch invariance and make some extra operators according to validation result. please see https://github.com/vllm-project/vllm-ascend/issues/5487 to track progress. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2026-03-05 09:12:40 +08:00			`# Batch Invariance`

			```{note}
			`Batch invariance is currently in beta. Some features are still under active development.`
			`Track progress and planned improvements at <https://github.com/vllm-project/vllm-ascend/issues/5487>`
			```

			`This document shows how to enable batch invariance in vLLM-Ascend. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch.`

			`## Motivation`

			`Batch invariance is crucial for several use cases:`

			`- Framework debugging: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching.`
			`- Model debugging: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations.`
			`- Reinforcement Learning (RL): RL training often requires deterministic rollouts for reproducibility and stable training.`
			`- Large-scale inference systems: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees.`

			`## Hardware Requirements`

[Doc] Sensitive word modification (#8303) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> 2026-04-17 16:30:00 +08:00			`Batch invariance currently requires Ascend Atlas A2 inference products NPUs, because only the Atlas A2 inference products supports batch invariance with HCCL communication for now.`
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073) What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> 2026-04-09 15:37:57 +08:00			`We will support other NPUs in the future.`
[Feature] Add docs of batch invariance and make some extra operators patch (#6910) ### What this PR does / why we need it? This PR add docs of batch invariance and make some extra operators according to validation result. please see https://github.com/vllm-project/vllm-ascend/issues/5487 to track progress. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2026-03-05 09:12:40 +08:00
			`## Software Requirements`

[Doc] Sensitive word modification (#8303) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> 2026-04-17 16:30:00 +08:00			`Batch invariance requires a custom operator library for Atlas A2 inference products.`
[Feature] Add docs of batch invariance and make some extra operators patch (#6910) ### What this PR does / why we need it? This PR add docs of batch invariance and make some extra operators according to validation result. please see https://github.com/vllm-project/vllm-ascend/issues/5487 to track progress. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> 2026-03-05 09:12:40 +08:00			`We will release the customed operator library in future versions.`

			`## Enabling Batch Invariance`

			Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` environment variable to `1`:

			```bash
			`export VLLM_BATCH_INVARIANT=1`
			```

			`### Online Inference (Server Mode)`

			`To start a vLLM server with batch invariance enabled:`

			```bash
			`VLLM_BATCH_INVARIANT=1 vllm serve Qwen/Qwen3-8B`
			```

			`Then use the OpenAI-compatible client:`

			```python
			`from openai import OpenAI`

			`client = OpenAI(`
			`api_key="EMPTY",`
			`base_url="http://localhost:8000/v1",`
			`)`

			`# These requests will produce deterministic outputs`
			`# regardless of batch size or order`
			`response = client.completions.create(`
			`model="Qwen/Qwen3-8B",`
			`prompt="The future of AI is",`
			`max_tokens=100,`
			`temperature=0.7,`
			`seed=42,`
			`)`

			`print(response.choices[0].text)`
			```

			`### Offline Inference`

			`For offline batch inference with batch invariance:`

			```python
			`import os`
			`os.environ["VLLM_BATCH_INVARIANT"] = "1"`

			`from vllm import LLM, SamplingParams`

			`prompts = [`
			`"The future of AI is",`
			`"Machine learning enables",`
			`"Deep learning models can",`
			`]`

			`sampling_params = SamplingParams(`
			`temperature=0.7,`
			`max_tokens=100,`
			`seed=42,`
			`)`

			`llm = LLM(`
			`model="Qwen/Qwen3-8B",`
			`tensor_parallel_size=1,`
			`)`

			`# Outputs will be deterministic regardless of batch size`
			`outputs = llm.generate(prompts, sampling_params)`

			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}")`
			`print(f"Generated: {generated_text!r}\n")`
			```

			`## Tested Models`

			`Batch invariance has been tested and verified on the following models:`

			- Qwen3 (Dense): `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B`
			- Qwen3 (MoE): `Qwen/Qwen3-30B-A3B`

			`Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm-ascend/issues/new/choose).`

			`## Implementation Details`

			`When batch invariance is enabled, vLLM:`

			`1. Uses deterministic kernel implementations for attention and other operations`
			`2. Ensures consistent numerical behavior across different batch sizes`
			`3. Disables certain optimizations that may introduce non-determinism`

			```{note}
			`Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility.`
			```

			`## Future Improvements`

			`The batch invariance feature is under active development. Planned improvements include:`

			`- Support for additional NPUs series`
			`- Expanded model coverage`
			`- Performance optimizations`
			`- Additional testing and validation`

			`For the latest status and to contribute ideas, see the [tracking issue](https://github.com/vllm-project/vllm-ascend/issues/5487).`