xc-llm-ascend/docs/source/user_guide/feature_guide/dynamic_batch.md

# Dynamic Batch

Dynamic batch is a technique that dynamically adjusts the chunksize during each inference iteration within the chunked prefilling strategy according to the resources and SLO targets, thereby improving the effective throughput and decreasing the TBT.

Dynamic batch is controlled by the value of the `--SLO_limits_for_dynamic_batch`.
Notably, only 910 B3 is supported with decode token number scales below 2048 so far.
Especially, the improvements are quite obvious on Qwen, Llama models.
We are working on further improvements and this feature will support more XPUs in the future.

## Getting started

### Prerequisites

1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`

2. `Pandas` is needed to load the lookup table, in case pandas is not installed.
  
    ```bash
    pip install pandas 
    ```

### Tuning Parameters

`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values relax latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.

```python
--SLO_limits_for_dynamic_batch = -1  # Default value; dynamic batching is disabled.
--SLO_limits_for_dynamic_batch = 0  # Baseline value for dynamic batching; dynamic batching is disabled. FCFS and decode-first chunked prefilling strategy is used.
--SLO_limits_for_dynamic_batch > 0  # User-defined positive value; dynamic batching is enabled. FCFS and decode-first chunked prefilling strategy is used.
```

### Supported Models

So far, dynamic batch performs better on several dense models including Qwen and Llama (from 8B to 32B) with `tensor_parallel_size=8`. For different models, a proper `SLO_limits_for_dynamic_batch` parameter is needed. The empirical value of this parameter is generally `35, 50, or 75`. Therefore, some additional tests are needed to select the best parameter.

## Usage

Dynamic batch is used in the online inference. A fully executable example is as follows:

```shell
SLO_LIMIT=50
vllm serve Qwen/Qwen2.5-14B-Instruct\
    --additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LIMIT}'}' \
    --max-num-seqs 256 \
    --block-size 128 \
    --tensor_parallel_size 8 \
    --load_format dummy \
    --max_num_batched_tokens 1024 \
    --max-model-len 9000 \
    --host localhost \
    --port 12091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code
```
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			`# Dynamic Batch`

			`Dynamic batch is a technique that dynamically adjusts the chunksize during each inference iteration within the chunked prefilling strategy according to the resources and SLO targets, thereby improving the effective throughput and decreasing the TBT.`

			Dynamic batch is controlled by the value of the `--SLO_limits_for_dynamic_batch`.
[main][Docs] Fix typos across documentation (#6728) ## Summary Fix typos and improve grammar consistency across 50 documentation files. ### Changes include: - Spelling corrections (e.g., "Facotory" → "Factory", "certainty" → "determinism") - Grammar improvements (e.g., "multi-thread" → "multi-threaded", "re-routed" → "re-run") - Punctuation fixes (semicolon consistency in filter parameters) - Code style fixes (correct flag name `--num-prompts` instead of `--num-prompt`) - Capitalization consistency (e.g., "python" → "Python", "ascend" → "Ascend") - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> 2026-02-13 15:50:05 +08:00			`Notably, only 910 B3 is supported with decode token number scales below 2048 so far.`
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			`Especially, the improvements are quite obvious on Qwen, Llama models.`
			`We are working on further improvements and this feature will support more XPUs in the future.`

			`## Getting started`

			`### Prerequisites`

[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073) What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> 2026-04-09 15:37:57 +08:00			1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00
[main][Docs] Fix typos across documentation (#6728) ## Summary Fix typos and improve grammar consistency across 50 documentation files. ### Changes include: - Spelling corrections (e.g., "Facotory" → "Factory", "certainty" → "determinism") - Grammar improvements (e.g., "multi-thread" → "multi-threaded", "re-routed" → "re-run") - Punctuation fixes (semicolon consistency in filter parameters) - Code style fixes (correct flag name `--num-prompts` instead of `--num-prompt`) - Capitalization consistency (e.g., "python" → "Python", "ascend" → "Ascend") - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> 2026-02-13 15:50:05 +08:00			2. `Pandas` is needed to load the lookup table, in case pandas is not installed.
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073) What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> 2026-04-09 15:37:57 +08:00			```bash
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			`pip install pandas`
			```

[main][Docs] Fix typos across documentation (#6728) ## Summary Fix typos and improve grammar consistency across 50 documentation files. ### Changes include: - Spelling corrections (e.g., "Facotory" → "Factory", "certainty" → "determinism") - Grammar improvements (e.g., "multi-thread" → "multi-threaded", "re-routed" → "re-run") - Punctuation fixes (semicolon consistency in filter parameters) - Code style fixes (correct flag name `--num-prompts` instead of `--num-prompt`) - Capitalization consistency (e.g., "python" → "Python", "ascend" → "Ascend") - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> 2026-02-13 15:50:05 +08:00			`### Tuning Parameters`
[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073) What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> 2026-04-09 15:37:57 +08:00
[Doc][Misc] Improve readability and fix typos in documentation (#8340) ### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. --------- Signed-off-by: sunshine202600 <sunshine202600@163.com> 2026-04-17 08:54:38 +08:00			`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values relax latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00
			```python
[Doc][Misc] Improve documentation quality by revising specific content. (#8603) ### What this PR does / why we need it? To improve the quality of certain docs by revising specific content. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.19.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6f786f2c506cb07f4566771fdc62e640e2c4a176 --------- Signed-off-by: Lucky1 <144669645+verylucky01@users.noreply.github.com> 2026-04-24 15:40:41 +08:00			`--SLO_limits_for_dynamic_batch = -1 # Default value; dynamic batching is disabled.`
			`--SLO_limits_for_dynamic_batch = 0 # Baseline value for dynamic batching; dynamic batching is disabled. FCFS and decode-first chunked prefilling strategy is used.`
			`--SLO_limits_for_dynamic_batch > 0 # User-defined positive value; dynamic batching is enabled. FCFS and decode-first chunked prefilling strategy is used.`
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			```

			`### Supported Models`
[Lint]Style: reformat markdown files via markdownlint (#5884) ### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain> 2026-01-15 09:06:01 +08:00
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			So far, dynamic batch performs better on several dense models including Qwen and Llama (from 8B to 32B) with `tensor_parallel_size=8`. For different models, a proper `SLO_limits_for_dynamic_batch` parameter is needed. The empirical value of this parameter is generally `35, 50, or 75`. Therefore, some additional tests are needed to select the best parameter.

			`## Usage`
[Lint]Style: reformat markdown files via markdownlint (#5884) ### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain> 2026-01-15 09:06:01 +08:00
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			`Dynamic batch is used in the online inference. A fully executable example is as follows:`

			```shell
[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701) ### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com> 2026-04-28 09:01:25 +08:00			`SLO_LIMIT=50`
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			`vllm serve Qwen/Qwen2.5-14B-Instruct\`
[Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701) ### What this PR does / why we need it? This PR fixes various documentation issues and improves code examples throughout the project. Signed-off-by: MrZ20 <2609716663@qq.com> 2026-04-28 09:01:25 +08:00			`--additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LIMIT}'}' \`
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			`--max-num-seqs 256 \`
			`--block-size 128 \`
			`--tensor_parallel_size 8 \`
			`--load_format dummy \`
			`--max_num_batched_tokens 1024 \`
[main][Docs] Fix typos across documentation (#6728) ## Summary Fix typos and improve grammar consistency across 50 documentation files. ### Changes include: - Spelling corrections (e.g., "Facotory" → "Factory", "certainty" → "determinism") - Grammar improvements (e.g., "multi-thread" → "multi-threaded", "re-routed" → "re-run") - Punctuation fixes (semicolon consistency in filter parameters) - Code style fixes (correct flag name `--num-prompts` instead of `--num-prompt`) - Capitalization consistency (e.g., "python" → "Python", "ascend" → "Ascend") - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> 2026-02-13 15:50:05 +08:00			`--max-model-len 9000 \`
[Feat] Dynamic Batch Feature (#3490) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com> 2025-10-22 14:13:32 +08:00			`--host localhost \`
			`--port 12091 \`
			`--gpu-memory-utilization 0.9 \`
			`--trust-remote-code`
			```