[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)
### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
@@ -13,6 +13,7 @@ But you can still set up dev env on Linux/Windows/macOS for linting and basic
|
||||
test as following commands:
|
||||
|
||||
#### Run lint locally
|
||||
|
||||
```bash
|
||||
# Choose a base dir (~/vllm-project/) and set up venv
|
||||
cd ~/vllm-project/
|
||||
@@ -103,7 +104,6 @@ If the PR spans more than one category, please include all relevant prefixes.
|
||||
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
|
||||
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
|
||||
|
||||
|
||||
:::{toctree}
|
||||
:caption: Index
|
||||
:maxdepth: 1
|
||||
|
||||
@@ -172,6 +172,7 @@ pytest -sv tests/ut
|
||||
# Run single test
|
||||
pytest -sv tests/ut/test_ascend_config.py
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Multi cards test
|
||||
@@ -185,6 +186,7 @@ pytest -sv tests/ut
|
||||
# Run single test
|
||||
pytest -sv tests/ut/test_ascend_config.py
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
@@ -218,10 +220,12 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.
|
||||
# Run a certain case in test script
|
||||
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Multi cards test
|
||||
:sync: multi
|
||||
|
||||
```bash
|
||||
cd /vllm-workspace/vllm-ascend/
|
||||
# Run all single card the tests
|
||||
@@ -233,6 +237,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_dynamic_npugraph_ba
|
||||
# Run a certain case in test script
|
||||
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
@@ -3,4 +3,4 @@
|
||||
:::{toctree}
|
||||
:caption: Accuracy Report
|
||||
:maxdepth: 1
|
||||
:::
|
||||
:::
|
||||
|
||||
@@ -65,6 +65,7 @@ pip install gradio plotly evalscope
|
||||
## 3. Run gsm8k accuracy test using EvalScope
|
||||
|
||||
You can `evalscope eval` run gsm8k accuracy test:
|
||||
|
||||
```
|
||||
evalscope eval \
|
||||
--model Qwen/Qwen2.5-7B-Instruct \
|
||||
@@ -98,6 +99,7 @@ pip install evalscope[perf] -U
|
||||
### Basic usage
|
||||
|
||||
You can use `evalscope perf` run perf test:
|
||||
|
||||
```
|
||||
evalscope perf \
|
||||
--url "http://localhost:8000/v1/chat/completions" \
|
||||
@@ -111,7 +113,7 @@ evalscope perf \
|
||||
|
||||
### Output results
|
||||
|
||||
After 1-2 mins, the output is as shown below:
|
||||
After 1-2 mins, the output is as shown below:
|
||||
|
||||
```shell
|
||||
Benchmarking summary:
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# Using lm-eval
|
||||
This document will guide you have a accuracy testing using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness).
|
||||
|
||||
## 1. Run docker container
|
||||
## 1. Run docker container
|
||||
|
||||
You can run docker container on a single NPU:
|
||||
|
||||
@@ -36,6 +36,7 @@ Install lm-eval in the container.
|
||||
```bash
|
||||
pip install lm-eval
|
||||
```
|
||||
|
||||
Run the following command:
|
||||
|
||||
```
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Using OpenCompass
|
||||
# Using OpenCompass
|
||||
This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
|
||||
|
||||
## 1. Online Serving
|
||||
@@ -29,7 +29,9 @@ docker run --rm \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
|
||||
```
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```
|
||||
INFO: Started server process [6873]
|
||||
INFO: Waiting for application startup.
|
||||
@@ -37,6 +39,7 @@ INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts in new terminal:
|
||||
|
||||
```
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
|
||||
@@ -50,6 +50,7 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
|
||||
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
|
||||
4. Write your patch code in the new file. Here is an example:
|
||||
|
||||
```python
|
||||
import vllm
|
||||
|
||||
@@ -59,8 +60,10 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
|
||||
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
|
||||
```
|
||||
|
||||
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
|
||||
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
|
||||
|
||||
```
|
||||
# ** File: <The patch file name> **
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -74,8 +77,8 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
# Future Plan:
|
||||
# <Describe the future plan to remove the patch>
|
||||
```
|
||||
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
|
||||
|
||||
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
|
||||
|
||||
## Limitation
|
||||
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
|
||||
@@ -216,6 +216,7 @@ The first argument of `vllm.ModelRegistry.register_model()` indicates the unique
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
## Step 3: Verification
|
||||
|
||||
@@ -4,6 +4,7 @@ This document details the benchmark methodology for vllm-ascend, aimed at evalua
|
||||
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
|
||||
## 1. Run docker container
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update DEVICE according to your device (/dev/davinci[0-7])
|
||||
@@ -29,6 +30,7 @@ docker run --rm \
|
||||
```
|
||||
|
||||
## 2. Install dependencies
|
||||
|
||||
```bash
|
||||
cd /workspace/vllm-ascend
|
||||
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
@@ -37,11 +39,13 @@ pip install -r benchmarks/requirements-bench.txt
|
||||
|
||||
## 3. (Optional)Prepare model weights
|
||||
For faster running speed, we recommend downloading the model in advance:
|
||||
|
||||
```bash
|
||||
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
|
||||
```
|
||||
|
||||
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
|
||||
|
||||
```bash
|
||||
[
|
||||
{
|
||||
@@ -59,11 +63,13 @@ You can also replace all model paths in the [json](https://github.com/vllm-proje
|
||||
|
||||
## 4. Run benchmark script
|
||||
Run benchmark script:
|
||||
|
||||
```bash
|
||||
bash benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
After about 10 mins, the output is as shown below:
|
||||
|
||||
```bash
|
||||
online serving:
|
||||
qps 1:
|
||||
@@ -173,6 +179,7 @@ Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
|
||||
Total num prompt tokens: 42659
|
||||
Total num output tokens: 43545
|
||||
```
|
||||
|
||||
The result json files are generated into the path `benchmark/results`
|
||||
These files contain detailed benchmarking results for further analysis.
|
||||
|
||||
|
||||
@@ -10,6 +10,7 @@ The execution duration of each stage (including pre/post-processing, model forwa
|
||||
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
|
||||
|
||||
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
|
||||
|
||||
```
|
||||
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
|
||||
```
|
||||
@@ -36,4 +37,4 @@ VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_in
|
||||
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
|
||||
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
|
||||
|
||||
```
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user