Files
xc-llm-ascend/vllm_ascend/xlite
LVYANGGUO b1a78886a9 [xlite][Bugfix] Support mrope and deepstack features in xlite backend (#7295)
### What this PR does / why we need it?
This PR fixes a bug in Xlite
backend(https://atomgit.com/openeuler/GVirt/issues/3).

This PR adds support for mrope (Mixture-of-RoPE) and deepstack features
in the xlite backend. These features are necessary for running certain
multimodal models that utilize them.

The main changes include:
- Updating `_build_model_config` to parse mrope and deepstack
configurations from the model's `hf_config`.
- Modifying `XliteWrapper.__call__` to handle `deepstack_input_embeds`
and mrope positions during the model forward pass.
- Replacing `ModelAttnMeta` with the newer `AttnMeta` to accommodate the
new metadata fields required by these features.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
online server config:

```
python -m vllm.entrypoints.openai.api_server \
--model /mnt/nvme0n1/models/checkpoint-8200 \
--additional-config='{"xlite_graph_config": {"enabled": true}}'  \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 8192 \
--max-num-seqs=20 \
--block-size 128 \
--max-model-len 8192 \
--trust-remote-code \
--served-model-name Qwen3-VL-8B \
--host localhost \
--generation-config vllm \
--port 6777
```

test_config:
```
vllm bench serve \
--max-concurrency ${maxconcurrency} \
--num-prompts ${num_prompts} \
--host ${HOST} \
--port ${PORT} \
--model ${MODEL_NAME} \
--dataset-name random \
--backend openai-chat \
--random-input-len 512 \
--random-output-len 512  \
--random-range-ratio 0.2 \
--temperature 0.6 \
--metric-percentiles "50,90,99" \
--tokenizer ${TOKENIZER_PATH} \
--endpoint /v1/chat/completions \
--ignore-eos
```

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: LVYANGGUO <lvyangguo@huawei.com>
Co-authored-by: LVYANGGUO <lvyangguo@huawei.com>
2026-03-16 17:05:52 +08:00
..