Files

herizhen 0d1424d81a [Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073 )

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>

2026-04-09 15:37:57 +08:00

2.7 KiB

Raw Blame History

Dynamic Batch

Dynamic batch is a technique that dynamically adjusts the chunksize during each inference iteration within the chunked prefilling strategy according to the resources and SLO targets, thereby improving the effective throughput and decreasing the TBT.

Dynamic batch is controlled by the value of the --SLO_limits_for_dynamic_batch. Notably, only 910 B3 is supported with decode token number scales below 2048 so far. Especially, the improvements are quite obvious on Qwen, Llama models. We are working on further improvements and this feature will support more XPUs in the future.

Getting started

Prerequisites

Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from A2-B3-BLK128.csv, renamed, and saved to the path vllm_ascend/core/profile_table.csv
Pandas is needed to load the lookup table, in case pandas is not installed.
```
pip install pandas 
```

Tuning Parameters

--SLO_limits_for_dynamic_batch is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.

--SLO_limits_for_dynamic_batch =-1 # default value, dynamic batch disabled.
--SLO_limits_for_dynamic_batch = 0  # baseline value for dynamic batch, dynamic batch disabled, FCFS and decode-first chunked prefilling strategy is used.
--SLO_limits_for_dynamic_batch > 0 # user-defined value for dynamic batch, dynamic batch enabled with FCFS and decode-first chunked prefilling strategy.

Supported Models

So far, dynamic batch performs better on several dense models including Qwen and Llama (from 8B to 32B) with tensor_parallel_size=8. For different models, a proper SLO_limits_for_dynamic_batch parameter is needed. The empirical value of this parameter is generally 35, 50, or 75. Therefore, some additional tests are needed to select the best parameter.

Usage

Dynamic batch is used in the online inference. A fully executable example is as follows:

SLO_LITMIT=50
vllm serve Qwen/Qwen2.5-14B-Instruct\
    --additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LITMIT}'}' \
    --max-num-seqs 256 \
    --block-size 128 \
    --tensor_parallel_size 8 \
    --load_format dummy \
    --max_num_batched_tokens 1024 \
    --max-model-len 9000 \
    --host localhost \
    --port 12091 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code

2.7 KiB Raw Blame History