[Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
@@ -4,7 +4,7 @@
|
||||
|
||||
When in LLM inference, each token requires nearly thousand operator executions, and when host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
|
||||
|
||||
```
|
||||
```shell
|
||||
eager mode:
|
||||
|
||||
host: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |
|
||||
@@ -38,11 +38,12 @@ But in reality, graph mode is not that simple.
|
||||
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
|
||||
|
||||
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
|
||||
|
||||
1. Set a threshold;
|
||||
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
|
||||
3. Capture multiple graphs within a range below the threshold;
|
||||
|
||||
```
|
||||
```shell
|
||||
| graph1 |
|
||||
| graph2 |
|
||||
| graph3 |
|
||||
|
||||
Reference in New Issue
Block a user