`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
:::
Install packages required for audio processing:
```bash
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install librosa soundfile
```
Run the following script to execute offline inference on a single NPU:
```python
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.utils import FlexibleArgumentParser
# If network issues prevent AudioAsset from fetching remote audio files, retry or check your network.
If you run this script successfully, you can see the info shown below:
```bash
The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'.
```
### Online Serving on Single NPU
Currently, the `chat_template` for `Qwen2-Audio` has some issues which caused audio placeholder failed to be inserted, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
Nevertheless, we could use a custom template for online serving, which is shown below:
```jinja
{% set audio_count = namespace(value=0) %}
{% for message in messages %}
{% if loop.first and message['role'] != 'system' %}
<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n
{% endif %}
<|im_start|>{{ message['role'] }}\n
{% if message['content'] is string %}
{{ message['content'] }}<|im_end|>\n
{% else %}
{% for content in message['content'] %}
{% if 'audio' in content or 'audio_url' in content or message['type'] == 'audio' or content['type'] == 'audio' %}
{% set audio_count.value = audio_count.value + 1 %}
{"type": "text", "text": "What is in this audio? How does it sound?"}
]}
],
"max_tokens": 100
}'
```
If you query the server successfully, you can see the info shown below (client):
```bash
{"id":"chatcmpl-31f5f698f6734a4297f6492a830edb3f","object":"chat.completion","created":1761097383,"model":"/root/.cache/modelscope/models/Qwen/Qwen2-Audio-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The audio contains a background of a crowd cheering, a ball bouncing, and an object being hit. A man speaks in English saying 'and the o one pitch on the way to edgar martinez swung on and lined out.' The speech has a happy mood.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":689,"total_tokens":743,"completion_tokens":54,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}