[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
Cao Yi
2026-02-13 15:50:05 +08:00
committed by GitHub
parent b6bc3d2f9d
commit 6de207de88
50 changed files with 273 additions and 272 deletions

View File

@@ -3,7 +3,7 @@
Dynamic batch is a technique that dynamically adjusts the chunksize during each inference iteration within the chunked prefilling strategy according to the resources and SLO targets, thereby improving the effective throughput and decreasing the TBT.
Dynamic batch is controlled by the value of the `--SLO_limits_for_dynamic_batch`.
Notably, only 910 B3 is supported with decode token numbers scales below 2048 so far.
Notably, only 910 B3 is supported with decode token number scales below 2048 so far.
Especially, the improvements are quite obvious on Qwen, Llama models.
We are working on further improvements and this feature will support more XPUs in the future.
@@ -11,16 +11,16 @@ We are working on further improvements and this feature will support more XPUs i
### Prerequisites
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in a '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv`
2. `Pandas` is needed to load the lookup table, in case `pandas` is not installed.
2. `Pandas` is needed to load the lookup table, in case pandas is not installed.
```bash
pip install pandas
```
### Tuning Parameter
`--SLO_limits_for_dynamic_batch` is the tuning parameters (integer type) for the dynamic batch feature, greater values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
### Tuning Parameters
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
```python
--SLO_limits_for_dynamic_batch =-1 # default value, dynamic batch disabled.
@@ -45,7 +45,7 @@ vllm serve Qwen/Qwen2.5-14B-Instruct\
--tensor_parallel_size 8 \
--load_format dummy \
--max_num_batched_tokens 1024 \
--max_model_len 9000 \
--max-model-len 9000 \
--host localhost \
--port 12091 \
--gpu-memory-utilization 0.9 \