xc-llm-ascend/docs/source/tutorials/single_npu_audio.md

# Single NPU (Qwen2-Audio 7B)

## Run vllm-ascend on Single NPU

### Offline Inference on Single NPU

Run docker container:

```{code-block} bash
   :substitutions:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```

Set up environment variables:

```bash
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True

# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
```

:::{note}
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::

Install packages required for audio processing:

```bash
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install librosa soundfile
```

Run the following script to execute offline inference on a single NPU:

```python
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.utils import FlexibleArgumentParser

# If network issues prevent AudioAsset from fetching remote audio files, retry or check your network.
audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]
question_per_audio_count = {
    1: "What is recited in the audio?",
    2: "What sport and what nursery rhyme are referenced?"
}


def prepare_inputs(audio_count: int):
    audio_in_prompt = "".join([
        f"Audio {idx+1}: <|audio_bos|><|AUDIO|><|audio_eos|>\n"
        for idx in range(audio_count)
    ])
    question = question_per_audio_count[audio_count]
    prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
              "<|im_start|>user\n"
              f"{audio_in_prompt}{question}<|im_end|>\n"
              "<|im_start|>assistant\n")

    mm_data = {
        "audio":
        [asset.audio_and_sample_rate for asset in audio_assets[:audio_count]]
    }

    # Merge text prompt and audio data into inputs
    inputs = {"prompt": prompt, "multi_modal_data": mm_data}
    return inputs


def main(audio_count: int):
    # NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
    # lower-end GPUs.
    # Unless specified, these settings have been tested to work on a single L4.
    # `limit_mm_per_prompt`: the max num items for each modality per prompt.
    llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
              max_model_len=4096,
              max_num_seqs=5,
              limit_mm_per_prompt={"audio": audio_count})

    inputs = prepare_inputs(audio_count)

    sampling_params = SamplingParams(temperature=0.2,
                                     max_tokens=64,
                                     stop_token_ids=None)

    outputs = llm.generate(inputs, sampling_params=sampling_params)

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)


if __name__ == "__main__":
    audio_count = 2
    main(audio_count)
```

If you run this script successfully, you can see the info shown below:

```bash
The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'.
```

### Online Serving on Single NPU

Currently, vLLM's OpenAI-Compatible server doesn't support audio inputs. Find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
[Doc] Add qwen2-audio eager mode tutorial (#1371) ### What this PR does / why we need it? Add qwen2-audio eager mode tutorial. Signed-off-by: shen-shanshan <467638484@qq.com> 2025-06-26 16:56:05 +08:00			`# Single NPU (Qwen2-Audio 7B)`

			`## Run vllm-ascend on Single NPU`

			`### Offline Inference on Single NPU`

			`Run docker container:`

			```{code-block} bash
			`:substitutions:`
			`# Update the vllm-ascend image`
			`export IMAGE=quay.io/ascend/vllm-ascend:\|vllm_ascend_version\|`
			`docker run --rm \`
			`--name vllm-ascend \`
			`--device /dev/davinci0 \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /root/.cache:/root/.cache \`
			`-p 8000:8000 \`
			`-it $IMAGE bash`
			```

[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`Set up environment variables:`
[Doc] Add qwen2-audio eager mode tutorial (#1371) ### What this PR does / why we need it? Add qwen2-audio eager mode tutorial. Signed-off-by: shen-shanshan <467638484@qq.com> 2025-06-26 16:56:05 +08:00
			```bash
			`# Load model from ModelScope to speed up download`
			`export VLLM_USE_MODELSCOPE=True`

			# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
			`export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256`
			```

			`:::{note}`
			`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
			`:::`

			`Install packages required for audio processing:`

			```bash
			`pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple`
			`pip install librosa soundfile`
			```

			`Run the following script to execute offline inference on a single NPU:`

			```python
			`from vllm import LLM, SamplingParams`
			`from vllm.assets.audio import AudioAsset`
			`from vllm.utils import FlexibleArgumentParser`

[Doc] Update tutorials for single_npu_audio and single_npu_multimodal (#2252) ### What this PR does / why we need it? Update tutorials for single_npu_audio and single_npu_multimodal - vLLM version: v0.10.0 - vLLM main: https://github.com/vllm-project/vllm/commit/6b47ef24de3d3b4f551aca0bc21b9f16f3d21b6a Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-08-07 14:08:14 +08:00			`# If network issues prevent AudioAsset from fetching remote audio files, retry or check your network.`
[Doc] Add qwen2-audio eager mode tutorial (#1371) ### What this PR does / why we need it? Add qwen2-audio eager mode tutorial. Signed-off-by: shen-shanshan <467638484@qq.com> 2025-06-26 16:56:05 +08:00			`audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]`
			`question_per_audio_count = {`
			`1: "What is recited in the audio?",`
			`2: "What sport and what nursery rhyme are referenced?"`
			`}`


			`def prepare_inputs(audio_count: int):`
			`audio_in_prompt = "".join([`
			`f"Audio {idx+1}: <\|audio_bos\|><\|AUDIO\|><\|audio_eos\|>\n"`
			`for idx in range(audio_count)`
			`])`
			`question = question_per_audio_count[audio_count]`
			`prompt = ("<\|im_start\|>system\nYou are a helpful assistant.<\|im_end\|>\n"`
			`"<\|im_start\|>user\n"`
			`f"{audio_in_prompt}{question}<\|im_end\|>\n"`
			`"<\|im_start\|>assistant\n")`

			`mm_data = {`
			`"audio":`
			`[asset.audio_and_sample_rate for asset in audio_assets[:audio_count]]`
			`}`

			`# Merge text prompt and audio data into inputs`
			`inputs = {"prompt": prompt, "multi_modal_data": mm_data}`
			`return inputs`


			`def main(audio_count: int):`
			# NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
			`# lower-end GPUs.`
			`# Unless specified, these settings have been tested to work on a single L4.`
			# `limit_mm_per_prompt`: the max num items for each modality per prompt.
			`llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",`
			`max_model_len=4096,`
			`max_num_seqs=5,`
[Bugfix] Fix num_hidden_layers when Qwen2-Audio 7B (#1803) ### What this PR does / why we need it? Fix num_hidden_layers when Qwen2-Audio 7B and #1760 ： ``` INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode Traceback (most recent call last): File "/workspace/test1.py", line 58, in <module> main(audio_count) File "/workspace/test1.py", line 38, in main llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct", File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__ self.llm_engine = LLMEngine.from_engine_args( File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args vllm_config = engine_args.create_engine_config(usage_context) File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config config = VllmConfig( File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__ s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__ current_platform.check_and_update_config(self) File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config update_aclgraph_sizes(vllm_config) File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__ return super().__getattribute__(key) AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers' ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/1780 https://github.com/vllm-project/vllm-ascend/issues/1760 https://github.com/vllm-project/vllm-ascend/issues/1276 https://github.com/vllm-project/vllm-ascend/issues/359 - vLLM version: v0.10.0 - vLLM main: https://github.com/vllm-project/vllm/commit/7728dd77bb802e1876012eb264df4d2fa2fc6f3c Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-07-26 20:13:00 +08:00			`limit_mm_per_prompt={"audio": audio_count})`
[Doc] Add qwen2-audio eager mode tutorial (#1371) ### What this PR does / why we need it? Add qwen2-audio eager mode tutorial. Signed-off-by: shen-shanshan <467638484@qq.com> 2025-06-26 16:56:05 +08:00
			`inputs = prepare_inputs(audio_count)`

			`sampling_params = SamplingParams(temperature=0.2,`
			`max_tokens=64,`
			`stop_token_ids=None)`

			`outputs = llm.generate(inputs, sampling_params=sampling_params)`

			`for o in outputs:`
			`generated_text = o.outputs[0].text`
			`print(generated_text)`


			`if __name__ == "__main__":`
			`audio_count = 2`
			`main(audio_count)`
			```

			`If you run this script successfully, you can see the info shown below:`

			```bash
			`The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'.`
			```

			`### Online Serving on Single NPU`

[v0.11.0][Doc] Update doc (#3852) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com> 2025-10-29 11:32:12 +08:00			`Currently, vLLM's OpenAI-Compatible server doesn't support audio inputs. Find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).`