xc-llm-ascend/vllm_ascend/models/__init__.py

from vllm import ModelRegistry

import vllm_ascend.envs as envs_ascend


def register_model():
    ModelRegistry.register_model(
        "Qwen2VLForConditionalGeneration",
        "vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration")

    ModelRegistry.register_model(
        "Qwen3VLMoeForConditionalGeneration",
        "vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration"
    )

    ModelRegistry.register_model(
        "Qwen3VLForConditionalGeneration",
        "vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration"
    )

    if envs_ascend.USE_OPTIMIZED_MODEL:
        ModelRegistry.register_model(
            "Qwen2_5_VLForConditionalGeneration",
            "vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration"
        )
    else:
        ModelRegistry.register_model(
            "Qwen2_5_VLForConditionalGeneration",
            "vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen2_5_VLForConditionalGeneration_Without_Padding"
        )

    ModelRegistry.register_model(
        "DeepseekV32ForCausalLM",
        "vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM")

    # There is no PanguProMoEForCausalLM in vLLM, so we should register it before vLLM config initialization
    # to make sure the model can be loaded correctly. This register step can be removed once vLLM support PanguProMoEForCausalLM.
    ModelRegistry.register_model(
        "PanguProMoEForCausalLM",
        "vllm_ascend.torchair.models.torchair_pangu_moe:PanguProMoEForCausalLM"
    )
    ModelRegistry.register_model(
        "Qwen3NextForCausalLM",
        "vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM")
register qwen2_vl to rewrite qwen2_vl forwad (#241) Add qwen2-vl ascend impletation. --------- Signed-off-by: zouyida <zouyida@huawei.com> 2025-03-07 15:41:47 +08:00			`from vllm import ModelRegistry`

[Misc] Remove redundant imported `envs`, using `envs_ascend` instead (#2193) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: https://github.com/vllm-project/vllm/commit/71683ca6f6764f35abe23d612a0d7dbd33babe32 --------- Signed-off-by: shen-shanshan <467638484@qq.com> 2025-08-14 09:33:39 +08:00			`import vllm_ascend.envs as envs_ascend`
[perf]: support dual-batch overlap(dbo) for deepseek (#941) ### What this PR does / why we need it? Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer. Compared with the previously proposed [draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources. ref: [overlap for llama](https://github.com/vllm-project/vllm/pull/15787/files) ref: [dbo in sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de) ### Does this PR introduce _any_ user-facing change? Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1" See /examples/offline_dualbatch_overlap_npu.py for more info. ### How was this patch tested? This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though). Any advice/discussion is welcome. ### Performance Benchmark We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap. `python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'` and run benchmark with the parameters of : `--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90` 1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8) 2. test with the version using alltoall: prefill qps: 0.90 -> 1.01 Mean TTFT：8226->7432ms The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com> 2025-06-07 16:46:58 +08:00
register qwen2_vl to rewrite qwen2_vl forwad (#241) Add qwen2-vl ascend impletation. --------- Signed-off-by: zouyida <zouyida@huawei.com> 2025-03-07 15:41:47 +08:00
			`def register_model():`
			`ModelRegistry.register_model(`
			`"Qwen2VLForConditionalGeneration",`
Optimize qwen2_vl and qwen2_5_vl (#701) ### What this PR does / why we need it? Optimize qwen2_vl and qwen2_5_vl. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got roughly 22% perf boost. --------- Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> 2025-04-30 14:22:38 +08:00			`"vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration")`

[Model]Add support for qwen3_vl and qwen3_vl_moe (#3103) ### What this PR does / why we need it? This PR is for the adaptation and optimization of qwen3_vl and qwen3_vl_moe on the Ascend platform. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/b1068903fdca26cf6b4a1a51a32c3365ce3ac636 --------- Signed-off-by: booker123456 <945658361@qq.com> 2025-09-25 18:50:12 +08:00			`ModelRegistry.register_model(`
			`"Qwen3VLMoeForConditionalGeneration",`
			`"vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration"`
			`)`

			`ModelRegistry.register_model(`
			`"Qwen3VLForConditionalGeneration",`
			`"vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration"`
			`)`

[Misc] Remove redundant imported `envs`, using `envs_ascend` instead (#2193) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: https://github.com/vllm-project/vllm/commit/71683ca6f6764f35abe23d612a0d7dbd33babe32 --------- Signed-off-by: shen-shanshan <467638484@qq.com> 2025-08-14 09:33:39 +08:00			`if envs_ascend.USE_OPTIMIZED_MODEL:`
[CherryPick] Add unpadded Qwen2.5-VL for verl scenario (#1095) Add unpadded Qwen2.5-VL for verl scenario. When using vllm-ascend for verl scenario, set `USE_OPTIMIZED_QWEN2_5_VL` (default `1`) to `0` to use unpadded Qwen2.5-VL to avoid errors. This is cherry-picked from 0.7.3-dev Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Shanshan Shen <467638484@qq.com> 2025-06-07 19:45:46 +08:00			`ModelRegistry.register_model(`
			`"Qwen2_5_VLForConditionalGeneration",`
			`"vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration"`
			`)`
			`else:`
			`ModelRegistry.register_model(`
			`"Qwen2_5_VLForConditionalGeneration",`
			`"vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen2_5_VLForConditionalGeneration_Without_Padding"`
			`)`
[deepseek][bugfix] support deepseek quant (#469) - support deepseek quant - add w8a8_dynamic quant see #391 Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> 2025-04-07 10:56:12 +08:00
Add DeepSeek V3.2 support (#3270) ### What this PR does / why we need it? This PR added the initial DeepSeek V3.2 support with [vLLM v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0) (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days. Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 . ### Does this PR introduce _any_ user-facing change? Yes! ### How was this patch tested? CI passed and Run deepseek doc soon. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com> 2025-09-30 03:25:58 +08:00			`ModelRegistry.register_model(`
			`"DeepseekV32ForCausalLM",`
[DeepSeek] Seperate deepseek v3.2 modeling form deepseek v2 (#3531) ### What this PR does / why we need it? Seperate deepseek v3.2 modeling form deepseek v2 ### How was this patch tested? - CI passed with existing test. - test deepseek v3.2 locally - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: MengqingCao <cmq0113@163.com> 2025-10-20 09:50:44 +08:00			`"vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM")`
Add DeepSeek V3.2 support (#3270) ### What this PR does / why we need it? This PR added the initial DeepSeek V3.2 support with [vLLM v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0) (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days. Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 . ### Does this PR introduce _any_ user-facing change? Yes! ### How was this patch tested? CI passed and Run deepseek doc soon. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com> 2025-09-30 03:25:58 +08:00
[Misc] Remove pangu model file (#2798) vllm-ascend won't contain model file anymore. Now pangu model file has been moved to torchair module. The origin one can be removed. Note: After this PR, pangu only works with torchair mode then. - vLLM version: v0.10.1.1 - vLLM main: https://github.com/vllm-project/vllm/commit/8c892b1831a8a35fbf5a89fedfaae0ab9bef2dd0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-09-08 21:30:37 +08:00			`# There is no PanguProMoEForCausalLM in vLLM, so we should register it before vLLM config initialization`
			`# to make sure the model can be loaded correctly. This register step can be removed once vLLM support PanguProMoEForCausalLM.`
Support Pangu Pro MoE model (#1204) ### What this PR does / why we need it? Support Pangu Pro MoE model (https://arxiv.org/abs/2505.21411) ### Does this PR introduce _any_ user-facing change? Yes, new model supported ### How was this patch tested? Test locally --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> 2025-06-20 23:59:59 +08:00			`ModelRegistry.register_model(`
			`"PanguProMoEForCausalLM",`
[Misc] Remove pangu model file (#2798) vllm-ascend won't contain model file anymore. Now pangu model file has been moved to torchair module. The origin one can be removed. Note: After this PR, pangu only works with torchair mode then. - vLLM version: v0.10.1.1 - vLLM main: https://github.com/vllm-project/vllm/commit/8c892b1831a8a35fbf5a89fedfaae0ab9bef2dd0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-09-08 21:30:37 +08:00			`"vllm_ascend.torchair.models.torchair_pangu_moe:PanguProMoEForCausalLM"`
			`)`
[New model] Qwen3-next support (#2917) ### What this PR does / why we need it? Add Qwen3-next support. ### Does this PR introduce _any_ user-facing change? Yes, users can use Qwen3 next. Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the tutorial will be ready in [here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) ### How was this patch tested? Doc CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/2884 Co-Authored-By: Angazenn <supperccell@163.com> Co-Authored-By: zzzzwwjj <1183291235@qq.com> Co-Authored-By: MengqingCao <cmq0113@163.com> Co-Authored-By: linfeng-yuan <1102311262@qq.com> Co-Authored-By: hust17yixuan <303660421@qq.com> Co-Authored-By: SunnyLee219 <3294305115@qq.com> Co-Authored-By: maoxx241 <maoxx241@umn.edu> - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/b834b4cbf1d5094affdf231df2be86920610d83e --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: Your Name <you@example.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: hust17yixuan <303660421@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Angazenn <supperccell@163.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: hust17yixuan <303660421@qq.com> 2025-09-16 01:17:42 +08:00			`ModelRegistry.register_model(`
			`"Qwen3NextForCausalLM",`
[3/N][Refactor][Qwen3-Next] Refacotr model structure and fix bug by vllm #25400 (#3142) ### What this PR does / why we need it? Refactor model structure in qwen3_next.py to reduce code line. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? ``` def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: Icey <1790571317@qq.com> 2025-09-28 21:14:36 +08:00			`"vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM")`