Dynamic batch is a technique that dynamically adjusts the chunksize during each inference iteration within the chunked prefilling strategy according to the resources and SLO targets, thereby improving the effective throughput and decreasing the TBT.
Dynamic batch is controlled by the value of the `--SLO_limits_for_dynamic_batch`.
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [A2-B3-BLK128.csv](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values relax latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
--SLO_limits_for_dynamic_batch = -1 # Default value; dynamic batching is disabled.
--SLO_limits_for_dynamic_batch = 0 # Baseline value for dynamic batching; dynamic batching is disabled. FCFS and decode-first chunked prefilling strategy is used.
--SLO_limits_for_dynamic_batch > 0 # User-defined positive value; dynamic batching is enabled. FCFS and decode-first chunked prefilling strategy is used.
So far, dynamic batch performs better on several dense models including Qwen and Llama (from 8B to 32B) with `tensor_parallel_size=8`. For different models, a proper `SLO_limits_for_dynamic_batch` parameter is needed. The empirical value of this parameter is generally `35, 50, or 75`. Therefore, some additional tests are needed to select the best parameter.