[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
Li Wang
2025-07-25 22:16:10 +08:00
committed by GitHub
parent d629f0b2b5
commit bdfb065b5d
31 changed files with 215 additions and 64 deletions

View File

@@ -43,11 +43,13 @@ Execute the following commands on each node in sequence. The results must all be
### NPU Interconnect Verification:
#### 1. Get NPU IP Addresses
```bash
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
```
#### 2. Cross-Node PING Test
```bash
# Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address 10.20.0.20
@@ -95,6 +97,7 @@ Before launch the inference server, ensure some environment variables are set fo
Run the following scripts on two nodes respectively
**node0**
```shell
#!/bin/sh
@@ -135,6 +138,7 @@ vllm serve /root/.cache/ds_v3 \
```
**node1**
```shell
#!/bin/sh
@@ -173,7 +177,7 @@ vllm serve /root/.cache/ds_v3 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
```
The Deployment view looks like:
The Deployment view looks like:
![alt text](../assets/multi_node_dp.png)
Once your server is started, you can query the model with input prompts:
@@ -191,6 +195,7 @@ curl http://{ node0 ip:8004 }/v1/completions \
## Run benchmarks
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
```shell
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \
--dataset-name random --random-input-len 128 --random-output-len 128 \

View File

@@ -71,6 +71,7 @@ curl http://localhost:8000/v1/completions \
"temperature": 0.6
}'
```
::::
::::{tab-item} v1/chat/completions
@@ -91,6 +92,7 @@ curl http://localhost:8000/v1/chat/completions \
"add_special_tokens" : true
}'
```
::::
:::::
@@ -170,9 +172,11 @@ if __name__ == "__main__":
del llm
clean_up()
```
::::
::::{tab-item} Eager Mode
```{code-block} python
:substitutions:
import gc
@@ -226,6 +230,7 @@ if __name__ == "__main__":
del llm
clean_up()
```
::::
:::::

View File

@@ -30,7 +30,7 @@ docker run --rm \
## Install modelslim and convert model
:::{note}
You can choose to convert the model yourself or use the quantized model we uploaded,
You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
:::
@@ -55,6 +55,7 @@ python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --cal
## Verify the quantized model
The converted model files looks like:
```bash
.
|-- config.json
@@ -72,11 +73,13 @@ Run the following script to start the vLLM server with quantized model:
:::{note}
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
:::
```bash
vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
```
Once your server is started, you can query the model with input prompts
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
@@ -93,7 +96,7 @@ curl http://localhost:8000/v1/completions \
Run the following script to execute offline inference on multi-NPU with quantized model:
:::{note}
To enable quantization for ascend, quantization method must be "ascend"
To enable quantization for ascend, quantization method must be "ascend"
:::
```python
@@ -131,4 +134,4 @@ for output in outputs:
del llm
clean_up()
```
```

View File

@@ -80,6 +80,7 @@ curl http://localhost:8000/v1/completions \
"temperature": 0.6
}'
```
::::
::::{tab-item} Qwen/Qwen2.5-7B-Instruct
@@ -318,6 +319,7 @@ if __name__ == "__main__":
:::::
Run script:
```bash
python example.py
```

View File

@@ -66,6 +66,7 @@ for output in outputs:
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
::::
::::{tab-item} Eager Mode
@@ -92,6 +93,7 @@ for output in outputs:
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
::::
:::::
@@ -131,6 +133,7 @@ docker run --rm \
-it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240
```
::::
::::{tab-item} Eager Mode
@@ -156,6 +159,7 @@ docker run --rm \
-it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
```
::::
:::::

View File

@@ -191,4 +191,4 @@ Logs of the vllm server:
INFO 03-12 11:16:50 logger.py:39] Received request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\nWhat is the text in the illustrate?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16353, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-12 11:16:50 engine.py:280] Added request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb.
INFO: 127.0.0.1:54004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
```
```