[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)
### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
@@ -4,6 +4,7 @@ This document details the benchmark methodology for vllm-ascend, aimed at evalua
|
||||
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
|
||||
## 1. Run docker container
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update DEVICE according to your device (/dev/davinci[0-7])
|
||||
@@ -29,6 +30,7 @@ docker run --rm \
|
||||
```
|
||||
|
||||
## 2. Install dependencies
|
||||
|
||||
```bash
|
||||
cd /workspace/vllm-ascend
|
||||
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
@@ -37,11 +39,13 @@ pip install -r benchmarks/requirements-bench.txt
|
||||
|
||||
## 3. (Optional)Prepare model weights
|
||||
For faster running speed, we recommend downloading the model in advance:
|
||||
|
||||
```bash
|
||||
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
|
||||
```
|
||||
|
||||
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
|
||||
|
||||
```bash
|
||||
[
|
||||
{
|
||||
@@ -59,11 +63,13 @@ You can also replace all model paths in the [json](https://github.com/vllm-proje
|
||||
|
||||
## 4. Run benchmark script
|
||||
Run benchmark script:
|
||||
|
||||
```bash
|
||||
bash benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
After about 10 mins, the output is as shown below:
|
||||
|
||||
```bash
|
||||
online serving:
|
||||
qps 1:
|
||||
@@ -173,6 +179,7 @@ Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
|
||||
Total num prompt tokens: 42659
|
||||
Total num output tokens: 43545
|
||||
```
|
||||
|
||||
The result json files are generated into the path `benchmark/results`
|
||||
These files contain detailed benchmarking results for further analysis.
|
||||
|
||||
|
||||
@@ -10,6 +10,7 @@ The execution duration of each stage (including pre/post-processing, model forwa
|
||||
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
|
||||
|
||||
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
|
||||
|
||||
```
|
||||
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
|
||||
```
|
||||
@@ -36,4 +37,4 @@ VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_in
|
||||
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
|
||||
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
|
||||
|
||||
```
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user