[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)
### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
@@ -43,11 +43,13 @@ Execute the following commands on each node in sequence. The results must all be
|
||||
|
||||
### NPU Interconnect Verification:
|
||||
#### 1. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
```
|
||||
|
||||
#### 2. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace with actual IP)
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
@@ -95,6 +97,7 @@ Before launch the inference server, ensure some environment variables are set fo
|
||||
Run the following scripts on two nodes respectively
|
||||
|
||||
**node0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
@@ -135,6 +138,7 @@ vllm serve /root/.cache/ds_v3 \
|
||||
```
|
||||
|
||||
**node1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
@@ -173,7 +177,7 @@ vllm serve /root/.cache/ds_v3 \
|
||||
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
|
||||
```
|
||||
|
||||
The Deployment view looks like:
|
||||
The Deployment view looks like:
|
||||

|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
@@ -191,6 +195,7 @@ curl http://{ node0 ip:8004 }/v1/completions \
|
||||
|
||||
## Run benchmarks
|
||||
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
|
||||
|
||||
```shell
|
||||
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \
|
||||
--dataset-name random --random-input-len 128 --random-output-len 128 \
|
||||
|
||||
@@ -71,6 +71,7 @@ curl http://localhost:8000/v1/completions \
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} v1/chat/completions
|
||||
@@ -91,6 +92,7 @@ curl http://localhost:8000/v1/chat/completions \
|
||||
"add_special_tokens" : true
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
@@ -170,9 +172,11 @@ if __name__ == "__main__":
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
import gc
|
||||
@@ -226,6 +230,7 @@ if __name__ == "__main__":
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
|
||||
@@ -30,7 +30,7 @@ docker run --rm \
|
||||
|
||||
## Install modelslim and convert model
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
|
||||
:::
|
||||
|
||||
@@ -55,6 +55,7 @@ python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --cal
|
||||
|
||||
## Verify the quantized model
|
||||
The converted model files looks like:
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- config.json
|
||||
@@ -72,11 +73,13 @@ Run the following script to start the vLLM server with quantized model:
|
||||
:::{note}
|
||||
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
|
||||
:::
|
||||
|
||||
```bash
|
||||
vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
@@ -93,7 +96,7 @@ curl http://localhost:8000/v1/completions \
|
||||
Run the following script to execute offline inference on multi-NPU with quantized model:
|
||||
|
||||
:::{note}
|
||||
To enable quantization for ascend, quantization method must be "ascend"
|
||||
To enable quantization for ascend, quantization method must be "ascend"
|
||||
:::
|
||||
|
||||
```python
|
||||
@@ -131,4 +134,4 @@ for output in outputs:
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
```
|
||||
|
||||
@@ -80,6 +80,7 @@ curl http://localhost:8000/v1/completions \
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Qwen/Qwen2.5-7B-Instruct
|
||||
@@ -318,6 +319,7 @@ if __name__ == "__main__":
|
||||
:::::
|
||||
|
||||
Run script:
|
||||
|
||||
```bash
|
||||
python example.py
|
||||
```
|
||||
|
||||
@@ -66,6 +66,7 @@ for output in outputs:
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
@@ -92,6 +93,7 @@ for output in outputs:
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
@@ -131,6 +133,7 @@ docker run --rm \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen3-8B --max_model_len 26240
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
@@ -156,6 +159,7 @@ docker run --rm \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
|
||||
@@ -191,4 +191,4 @@ Logs of the vllm server:
|
||||
INFO 03-12 11:16:50 logger.py:39] Received request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\nWhat is the text in the illustrate?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16353, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
|
||||
INFO 03-12 11:16:50 engine.py:280] Added request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb.
|
||||
INFO: 127.0.0.1:54004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
|
||||
```
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user