xc-llm-ascend/docs/source/tutorials/single_npu_multimodal.md

# Single NPU (Qwen2.5-VL 7B)

## Run vllm-ascend on Single NPU

### Offline Inference on Single NPU

Run docker container:

```{code-block} bash
   :substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```

Set up environment variables:

```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True

# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```

:::{note}
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::

Run the following script to execute offline inference on a single NPU:

```bash
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
```

```python
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    max_model_len=16384,
    limit_mm_per_prompt={"image": 10},
)

sampling_params = SamplingParams(
    max_tokens=512
)

image_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "Please provide a detailed description of this image"},
        ],
    },
]

messages = image_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)
```

If you run this script successfully, you can see the info shown below:

```bash
The image displays a logo consisting of two main elements: a stylized geometric design and a pair of text elements.

1. **Geometric Design**: On the left side of the image, there is a blue geometric design that appears to be made up of interconnected shapes. These shapes resemble a network or a complex polygonal structure, possibly hinting at a technological or interconnected theme. The design is monochromatic and uses only blue as its color, which could be indicative of a specific brand or company.

2. **Text Elements**: To the right of the geometric design, there are two lines of text. The first line reads "TONGYI" in a sans-serif font, with the "YI" part possibly being capitalized. The second line reads "Qwen" in a similar sans-serif font, but in a smaller size.

The overall design is modern and minimalist, with a clear contrast between the geometric and textual elements. The use of blue for the geometric design could suggest themes of technology, connectivity, or innovation, which are common associations with the color blue in branding. The simplicity of the design makes it easily recognizable and memorable.
```

### Online Serving on Single NPU

Run docker container to start the vLLM server on a single NPU:

```{code-block} bash
   :substitutions:

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--max_model_len 16384 \
--max-num-batched-tokens 16384 
```

:::{note}
Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
:::

If your service starts successfully, you can see the info shown below:

```bash
INFO:     Started server process [2736]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

Once your server is started, you can query the model with input prompts:

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustrate?"}
    ]}
    ]
    }'
```

If you query the server successfully, you can see the info shown below (client):

```bash
{"id":"chatcmpl-f04fb20e79bb40b39b8ed7fdf5bd613a","object":"chat.completion","created":1741749149,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The text in the illustration reads \"TONGYI Qwen.\"","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":74,"total_tokens":89,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null}
```

Logs of the vllm server:

```bash
INFO 03-12 11:16:50 logger.py:39] Received request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\nWhat is the text in the illustrate?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16353, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-12 11:16:50 engine.py:280] Added request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb.
INFO:     127.0.0.1:54004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
```
[Doc] Add the release note for 0.7.3rc1 (#285) Add the release note for 0.7.3rc1 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-03-13 17:57:06 +08:00			`# Single NPU (Qwen2.5-VL 7B)`
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com> 2025-03-12 20:37:12 +08:00
			`## Run vllm-ascend on Single NPU`

			`### Offline Inference on Single NPU`

			`Run docker container:`

			```{code-block} bash
			`:substitutions:`
			`# Update the vllm-ascend image`
			`export IMAGE=quay.io/ascend/vllm-ascend:\|vllm_ascend_version\|`
			`docker run --rm \`
			`--name vllm-ascend \`
			`--device /dev/davinci0 \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /root/.cache:/root/.cache \`
			`-p 8000:8000 \`
			`-it $IMAGE bash`
			```

[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`Set up environment variables:`
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com> 2025-03-12 20:37:12 +08:00
			```bash
[Doc]Fix tutorial doc expression (#319) Fix tutorial doc expression Signed-off-by: wangli <wangli858794774@gmail.com> 2025-03-13 15:24:05 +08:00			`# Load model from ModelScope to speed up download`
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com> 2025-03-12 20:37:12 +08:00			`export VLLM_USE_MODELSCOPE=True`

[Doc]Fix tutorial doc expression (#319) Fix tutorial doc expression Signed-off-by: wangli <wangli858794774@gmail.com> 2025-03-13 15:24:05 +08:00			# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com> 2025-03-12 20:37:12 +08:00			`export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256`
			```

			`:::{note}`
			`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
			`:::`

			`Run the following script to execute offline inference on a single NPU:`

			```bash
[Doc] Update tutorials for single_npu_audio and single_npu_multimodal (#2252) ### What this PR does / why we need it? Update tutorials for single_npu_audio and single_npu_multimodal - vLLM version: v0.10.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6b47ef24de3d3b4f551aca0bc21b9f16f3d21b6a Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-08-07 14:08:14 +08:00			`pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/`
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com> 2025-03-12 20:37:12 +08:00			```

			```python
			`from transformers import AutoProcessor`
			`from vllm import LLM, SamplingParams`
			`from qwen_vl_utils import process_vision_info`

			`MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"`

			`llm = LLM(`
			`model=MODEL_PATH,`
			`max_model_len=16384,`
			`limit_mm_per_prompt={"image": 10},`
			`)`

			`sampling_params = SamplingParams(`
			`max_tokens=512`
			`)`

			`image_messages = [`
			`{"role": "system", "content": "You are a helpful assistant."},`
			`{`
			`"role": "user",`
			`"content": [`
			`{`
			`"type": "image",`
			`"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",`
			`"min_pixels": 224 * 224,`
			`"max_pixels": 1280 * 28 * 28,`
			`},`
			`{"type": "text", "text": "Please provide a detailed description of this image"},`
			`],`
			`},`
			`]`

			`messages = image_messages`

			`processor = AutoProcessor.from_pretrained(MODEL_PATH)`
			`prompt = processor.apply_chat_template(`
			`messages,`
			`tokenize=False,`
			`add_generation_prompt=True,`
			`)`

			`image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)`

			`mm_data = {}`
			`if image_inputs is not None:`
			`mm_data["image"] = image_inputs`

			`llm_inputs = {`
			`"prompt": prompt,`
			`"multi_modal_data": mm_data,`
			`}`

			`outputs = llm.generate([llm_inputs], sampling_params=sampling_params)`
			`generated_text = outputs[0].outputs[0].text`

			`print(generated_text)`
			```

			`If you run this script successfully, you can see the info shown below:`

			```bash
			`The image displays a logo consisting of two main elements: a stylized geometric design and a pair of text elements.`

			`1. Geometric Design: On the left side of the image, there is a blue geometric design that appears to be made up of interconnected shapes. These shapes resemble a network or a complex polygonal structure, possibly hinting at a technological or interconnected theme. The design is monochromatic and uses only blue as its color, which could be indicative of a specific brand or company.`

			`2. Text Elements: To the right of the geometric design, there are two lines of text. The first line reads "TONGYI" in a sans-serif font, with the "YI" part possibly being capitalized. The second line reads "Qwen" in a similar sans-serif font, but in a smaller size.`

			`The overall design is modern and minimalist, with a clear contrast between the geometric and textual elements. The use of blue for the geometric design could suggest themes of technology, connectivity, or innovation, which are common associations with the color blue in branding. The simplicity of the design makes it easily recognizable and memorable.`
			```

			`### Online Serving on Single NPU`

			`Run docker container to start the vLLM server on a single NPU:`

			```{code-block} bash
			`:substitutions:`

			`# Update the vllm-ascend image`
			`export IMAGE=quay.io/ascend/vllm-ascend:\|vllm_ascend_version\|`
			`docker run --rm \`
			`--name vllm-ascend \`
			`--device /dev/davinci0 \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /root/.cache:/root/.cache \`
			`-p 8000:8000 \`
			`-e VLLM_USE_MODELSCOPE=True \`
			`-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \`
			`-it $IMAGE \`
[Doc] Add Qwen2.5-VL eager mode doc (#1394) ### What this PR does / why we need it? Add Qwen2.5-VL eager mode doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com> 2025-06-28 09:08:51 +08:00			`vllm serve Qwen/Qwen2.5-VL-7B-Instruct \`
			`--dtype bfloat16 \`
			`--max_model_len 16384 \`
[Bugfix] Fix num_hidden_layers when Qwen2-Audio 7B (#1803) ### What this PR does / why we need it? Fix num_hidden_layers when Qwen2-Audio 7B and #1760 ： ``` INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode Traceback (most recent call last): File "/workspace/test1.py", line 58, in <module> main(audio_count) File "/workspace/test1.py", line 38, in main llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct", File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__ self.llm_engine = LLMEngine.from_engine_args( File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args vllm_config = engine_args.create_engine_config(usage_context) File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config config = VllmConfig( File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__ s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__ current_platform.check_and_update_config(self) File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config update_aclgraph_sizes(vllm_config) File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__ return super().__getattribute__(key) AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers' ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/1780 https://github.com/vllm-project/vllm-ascend/issues/1760 https://github.com/vllm-project/vllm-ascend/issues/1276 https://github.com/vllm-project/vllm-ascend/issues/359 - vLLM version: v0.10.0 - vLLM main: https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-07-26 20:13:00 +08:00			`--max-num-batched-tokens 16384`
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com> 2025-03-12 20:37:12 +08:00			```

			`:::{note}`
[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-7B-Instruct model's max seq len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series.
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com> 2025-03-12 20:37:12 +08:00			`:::`

[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`If your service starts successfully, you can see the info shown below:`
[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com> 2025-03-12 20:37:12 +08:00
			```bash
			`INFO: Started server process [2736]`
			`INFO: Waiting for application startup.`
			`INFO: Application startup complete.`
			```

			`Once your server is started, you can query the model with input prompts:`

			```bash
			`curl http://localhost:8000/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "Qwen/Qwen2.5-VL-7B-Instruct",`
			`"messages": [`
			`{"role": "system", "content": "You are a helpful assistant."},`
			`{"role": "user", "content": [`
			`{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},`
			`{"type": "text", "text": "What is the text in the illustrate?"}`
			`]}`
			`]`
			`}'`
			```

			`If you query the server successfully, you can see the info shown below (client):`

			```bash
			`{"id":"chatcmpl-f04fb20e79bb40b39b8ed7fdf5bd613a","object":"chat.completion","created":1741749149,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The text in the illustration reads \"TONGYI Qwen.\"","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":74,"total_tokens":89,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null}`
			```

			`Logs of the vllm server:`

			```bash
			INFO 03-12 11:16:50 logger.py:39] Received request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb: prompt: '<\|im_start\|>system\nYou are a helpful assistant.<\|im_end\|>\n<\|im_start\|>user\n<\|vision_start\|><\|image_pad\|><\|vision_end\|>\nWhat is the text in the illustrate?<\|im_end\|>\n<\|im_start\|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16353, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
			`INFO 03-12 11:16:50 engine.py:280] Added request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb.`
			`INFO: 127.0.0.1:54004 - "POST /v1/chat/completions HTTP/1.1" 200 OK`
[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: https://github.com/vllm-project/vllm/commit/29c6fbe58cfa705c26ed1b38f262d5ade0b4f9ba --------- Signed-off-by: wangli <wangli858794774@gmail.com> 2025-07-25 22:16:10 +08:00			```