xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	de7649492d	[Refactor] cleanup converting_weight_acl_format_format (#2482 ) move maybe_converting_weight_acl_format_format to torchair module, it's only used with 310p+torchair - vLLM version: v0.10.1.1 - vLLM main: `49ab23b3cc` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-25 19:48:55 +08:00
Wang Yixuan	0f81e032f0	[1/N][refactor] torchair fused_moe refactor (#2438 ) ### What this PR does / why we need it? Move torchair related fused_moe section into torchair_fused_moe to make the code clear. Next step we'll remove all torchair related code outside of torchair_fused_moe . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.10.0 vLLM main: `08d5f7113a` - vLLM version: v0.10.1.1 - vLLM main: `170e8ea9ea` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-25 15:46:10 +08:00
Icey	f796e6280b	[CustomOp] Register RotaryEmbedding instead of overwrite forward (#2385 ) ### What this PR does / why we need it? Register RotaryEmbedding instead of overwrite forward ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.0 - vLLM main: `808d2e9aa0` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>	2025-08-25 09:32:35 +08:00
weichen	950c4b219a	[main] refactor alltoallv in fused_moe (#2487 ) ### What this PR does / why we need it? Refactor all2all-related fused_experts (both quantized/unquantized) into TokenDispatcherWithAll2AllV, including dispatch & combine calculation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E & UT - vLLM version: v0.10.0 - vLLM main: `65197a5fb3` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-08-23 20:38:17 +08:00
linfeng-yuan	4af5b80606	[Scheduler] validate max_num_batched_tokens and max_model_len in AscendSchedulerConfig (#2434 ) ### What this PR does / why we need it? Add configuration check logic for ascend scheduler: if chunked_prefill is disabled, max_num_batched_tokens couldn't be less than max_model_len, following vLLM; ### Does this PR introduce _any_ user-facing change? users cannot set max_num_batched_tokens smaller than max_model_len with ascend scheduler ### How was this patch tested? CI and vllm serving passed - vLLM version: v0.10.0 - vLLM main: `f77a0802b7` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-23 19:39:44 +08:00
ZhaoJiangJiang	3629bc4431	feat: add mtp ut and fix some bugs (#2453 ) ### What this PR does / why we need it? Fix mtp mode ut ### Does this PR introduce _any_ user-facing change? Nothing ### How was this patch tested? This can be tested in the same way as a unit test. - vLLM version: v0.10.0 - vLLM main: `53415653ff` Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com> Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>	2025-08-22 17:09:08 +08:00
Mengqing Cao	60ac4fb576	[QuickFix] Skip failed ut to recover CI quickly (#2484 ) ### What this PR does / why we need it? Skip failed ut to recover CI quickly related ut: - `test_embed_models_correctness`: revert me when pooler is adapted with the latest vllm main - `test_check_and_update_config_enforce_eager_mode`: revert me when the occasional failed is fixed - vLLM version: v0.10.0 - vLLM main: `8896eb72eb` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-22 14:14:51 +08:00
Mengqing Cao	b0403f8d8a	[CI] fix ci (#2464 ) ### What this PR does / why we need it? 1. use action/checkout@v5 instead of v4 2. remove dbo test case because there is issue with it and will be refactored later 3. make vllm-ascend compatible with vllm v0.10.1.1 and add CI for it 4. fix sampler api changes introduced by https://github.com/vllm-project/vllm/pull/22387 6. fix qwen3 moe config changes intruoduced by https://github.com/vllm-project/vllm/pull/20562 7. fix kvcache block changes introduced by https://github.com/vllm-project/vllm/pull/23262 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `0c6e40bbaa` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-22 07:30:48 +08:00
linfeng-yuan	0ca3f48c90	[2/N][refactor] torchair deepseek mla backend refactor (#2459 ) ### What this PR does / why we need it? This PR move current unified mla backend to torchair folder and remove torchair-related code in attention/mla_v1.py (1.3k -> 0.9k). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Running eager mode with mla backend, and torchair mode with code before [2445](https://github.com/vllm-project/vllm-ascend/pull/2445) - vLLM version: v0.10.0 - vLLM main: `f571ff8eb6` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-21 14:02:30 +08:00
sherie	3fb80ee356	add mlp tp optimze (#2120 ) ### What this PR does / why we need it? For dense models, by not applying tensor parallelism (TP) to the attention module and applying TP to the MLP module, the allreduce operations in the attention module can be eliminated, thereby reducing computational overhead. However, this approach increases memory usage, so the environment variable VLLM_ASCEND_ENABLE_MLP_OPTIMZE is used to control this optimization. - vLLM main: `b17109beea` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-08-21 09:22:07 +08:00
Wang Kunpeng	c40d4171bc	[main][quantization] Adapt to the new format of ds w4a8 weight (#2392 ) ### What this PR does / why we need it? The deepseek w4a8 weights we supported before were in mindie-format format. It uses int8 to represent int4, so the weight size is similar to w8a8, and we need to do a few extra steps to make vllm-ascend load it normally. Now we can directly use the new weight format, which uses two int4 packs to save the weight, the weight size is reduced, and there is no need to do many extra operations to directly use it on vllm-ascend, but we are also compatible with the weights of the previous mindie format. The weight changes in the new version: 1. The weight is packed （2 int4 pack to int8） 2. The bias required in the apply method is directly generated by modelslim ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` #### 1.How to get weights using Modelslim ##### Installation steps we can use the branch br_release_MindStudio_8.1.RC2_TR5_20260624 git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624 https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/br_release_MindStudio_8.1.RC2_TR5_20260624/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/br_release_MindStudio_8.1.RC2_TR5_20260624/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96) chapter Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} ##### Adapt to vllm-ascend Modification in `config.json`：`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3`; #### 2.How to run w4a8 ##### a.How to run eager mode export VLLM_ASCEND_MLA_PA=1 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --max-num-seqs 128 --enforce-eager ##### b.How to run graph mode export HCCL_BUFFSIZE=1024 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' eg: python -m vllm.entrypoints.openai.api_server --model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `103f1ec8d3` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-08-20 20:25:18 +08:00
wangxiyuan	eccfb715f6	[CI] Fix UT (#2452 ) Make UT CI happy - vLLM version: v0.10.0 - vLLM main: `d983769c41` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-08-20 16:26:07 +08:00
sherie	3f867ee708	refactor allgather/mc2-related fused_experts (#2369 ) ### What this PR does / why we need it? refactor allgather/mc2-related fused_experts - vLLM version: v0.10.0 - vLLM main: `de7b67a023` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-08-20 14:20:46 +08:00
Nicholas Tao	7bec1a9b9c	qwen3_moe/qwen25 support torchair graph (#2403 ) ### What this PR does / why we need it? Added support for the TorchAir graph mode in qwen3_moe and qwen2.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=True, max_model_len=4096, max_num_seqs=16, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.4, additional_config={ "torchair_graph_config": { "enabled": True, "use_cached_graph": False, "graph_batch_sizes_init": False, "graph_batch_sizes": [16] }, "ascend_scheduler_config": { "enabled": True, "chunked_prefill_enabled":True, }, "refresh": True, }, ) ``` - vLLM version: v0.10.0 - vLLM main: `b87cb97a53` Signed-off-by: taoyuxiang <oui.nicholas.tao@gmail.com>	2025-08-20 11:23:50 +08:00
Mengqing Cao	1327f9be1c	Fix some ci issue and refactor modelrunner (#2445 ) ### What this PR does / why we need it? Fix some ci issue and refactor modelrunner ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `4d9c61993a` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-20 09:01:04 +08:00
Shanshan Shen	83e0f41408	[3/N][Refactor] Move `torchair_attention` to `torchair` dir (#2017 ) ### What this PR does / why we need it? 1. Move `torchair_attention` to `torchair` dir. 2. Make `AscendAttentionTorchairBackend` extend `AscendAttentionBackend` to reduce duplicate methods. 3. Make `AscendTorchairMetadata` extend `AscendMetadata` to reduce duplicate properties. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `0933f9d518` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-19 10:25:22 +08:00
linfeng-yuan	3fc31ee1cb	[1/N][refactor] torchair deepseek modeling refactor (#2384 ) ### What this PR does / why we need it? Move torchair related model arch into torchair moduel to make the code clear. Next step we'll remove all torchair related code outside of torchair moduel. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `08d5f7113a` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-18 15:00:37 +08:00
Chao Lei	03ca2b26ca	[P/D] Mooncake Connector for v1 distributed (#1568 ) ### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.（see detail Mooncake[#502](https://github.com/kvcache-ai/Mooncake/pull/502)） 2. vllm-ascend This PR depends on changes introduced by #950 (modifications to `model_runner_v1`) and #1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `1c859a1387` --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>	2025-08-18 14:30:07 +08:00
Mengqing Cao	61866b8ac6	[Quickfix] update CachedRequestState as NewRequestData changed (#2367 ) ### What this PR does / why we need it? 1. update `CachedRequestState` as `NewRequestData` changed in https://github.com/vllm-project/vllm/pull/22570 2. drop maintenance of vllm v0.10.0 in the branch main ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `92ff41abea` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-15 07:35:27 +08:00
Icey	c721ae6042	[CustomOp] Register RMSNorm instead of overwrite forward_oot (#2284 ) ### What this PR does / why we need it? Use function CustomOp.register_oot to achieve the customop registery ``` from vllm.model_executor.custom_op import CustomOp CustomOp.register_oot(_decorated_op_cls=AscendRMSNorm, name="RMSNorm") ``` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-14 17:18:30 +08:00
shiyuan680	e14f2ef669	refactor select_experts of moe module (#2150 ) ### What this PR does / why we need it? this pr refactor select_experts of moe module i merge implementations of quantitative and non-quantitative method in a new class use such as vllm like ExpertsSelector.select_experts ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test in qwen3-moe and all ut. - vLLM version: v0.10.0 - vLLM main: `e18859298d` Signed-off-by: yangcheng <yangcheng104@huawei.com> Co-authored-by: yangcheng (AJ) <y00806874@china.huawei.com>	2025-08-14 11:50:53 +08:00
Shanshan Shen	103654ccd6	[Misc] Remove redundant imported `envs`, using `envs_ascend` instead (#2193 ) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: `71683ca6f6` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-14 09:33:39 +08:00
Wang Kunpeng	dc585f148a	[main][prefill optimization] Optimize parallel strategies to reduce communication overhead (#2198 ) ### What this PR does / why we need it? 1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to pure DP for shared experts, enabling more efficient execution. 2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by using ReduceScatter, made possible by pure DP sharding. 3.AllGather Postponed: Delayed to after QKV down projection to reduce synchronization impact during prefill. ### How was this patch tested? Adding ut case in `tests/ut/attention/test_mla_v1.py` #### How to run use parameter `--additional_config='{"enable_shared_expert_dp": true}'` ##### a.How to run eager mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true}' ##### b.How to run graph mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2025-08-12 14:12:12 +08:00
zhenghaojiang	eb43a475f4	[Feat] chunkprefill mla support torchair graph (#1772 ) chunkprefill mla only support eager mode now，we want to optimaze it by support torchair graph, the idea is simple, when all the request is running in decode, use torchair graph to deal with it, else when chunkprefill or prefill only, use the eager mode - vLLM version: v0.10.0 - vLLM main: `ebf7605b0d` Signed-off-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: haojiangzheng <justineric096@gmail.com>	2025-08-11 19:58:59 +08:00
whx	29aaba5f84	[Perf][MTP] Optimize reject sampler in greedy situation. (#2137 ) This PR port optimization in PR #2002 to main and makes it cleaner. - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-08-11 17:37:49 +08:00
yangqinghao-cmss	ee6f79c44a	Add ut for test_communicator.py (#2293 ) ### What this PR does / why we need it? Add ut for test_communicator.py - vLLM version: v0.10.0 - vLLM main: `e5ebeeba53` Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-08-09 08:26:04 +08:00
Mengqing Cao	ad1083761f	[CI][Quickfix] Fix AscendFusedMoE init error (#2268 ) ### What this PR does / why we need it? Fix AscendFusedMoE init error. Use `super().__init__()` instead of `super(FusedMoE, self).__init__()` to ensure the member variables in base class could be called by the children class ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new existing test. - vLLM version: v0.10.0 - vLLM main: `766bc8162c` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-08 10:20:23 +08:00
huangxialu	dceef080b1	[main] remove torch.cat and replace it by List[0] (#2153 ) ### What this PR does / why we need it? torch_npu.npu_grouped_matmul: https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md According to the document, when `split_item` is 2 or 3, `torch_npu.npu_grouped_matmul` will return a list which has one element. Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is unnecessary. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? ut and e2e covered: `tests/ut/ops/test_fused_ops.py`, `tests/e2e/singlecard/ops/test_fused_moe.py` performance: (qwen3 30B, 2k->20k) base: Total Token throughput (tok/s): 667.76 remove cat: Total Token throughput (tok/s): 680.82 - vLLM version: v0.10.0 - vLLM main: `fa00c5d75b` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-08-07 17:20:19 +08:00
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
xuyexiong	26fc36b0e0	[V1] MTP supports torchair (#2145 ) ### What this PR does / why we need it? Support MTP with： - [x] V0 Scheduler - [x] TorchAir - [x] Single DP - [x] Multi DP - [x] Disaggregate PD Known issues： - [ ] Not support V1 Scheduler (chunked prefill), will be supported in a few weeks - [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now, need to comment out the line 171-175 in file `vllm/vllm/v1/metrics/loggers.py` ``` if (len(self.engine_indexes) > 1 and vllm_config.speculative_config is not None): raise NotImplementedError("Prometheus metrics with Spec Decoding " "with >1 EngineCore per AsyncLLM is not " "supported yet.") ``` To start an online server with torchair enabled, here is an example: ``` python -m vllm.entrypoints.openai.api_server \ --model="/weights/DeepSeek-R1_w8a8/" \ --trust-remote-code \ --max-model-len 40000 \ --tensor-parallel-size 4 \ --data_parallel_size 4 \ --max-num-seqs 16 \ --no-enable-prefix-caching \ --enable_expert_parallel \ --served-model-name deepseekr1 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --quantization ascend \ --host 0.0.0.0 \ --port 1234 \ --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \ --gpu_memory_utilization 0.9 ``` offline example with torchair enabled ``` from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=16, temperature=0) # Create an LLM. llm = LLM( model="/home/data/DeepSeek-R1_w8a8/", tensor_parallel_size=16, max_num_seqs=16, gpu_memory_utilization=0.9, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "deepseek_mtp", "num_speculative_tokens": 1, }, trust_remote_code=True, enforce_eager=False, max_model_len=2000, additional_config = { 'torchair_graph_config': { 'enabled': True, "graph_batch_sizes": [16], 'enable_multistream_shared_expert': False, }, "ascend_scheduler_config": { "enabled": True }, # 'expert_tensor_parallel_size': 16, } ) # Generate texts from the prompts. # llm.start_profile() outputs = llm.generate(prompts, sampling_params) # llm.stop_profile() for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.0 - vLLM main: `302962e806` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-08-06 19:37:43 +08:00
Wang Kunpeng	8a59367d0c	[main][Feature] Support deepseek w4a8 quantization (#2172 ) ### What this PR does / why we need it? Supports Deepseek-R1 w4a8 quantization. Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and `tests/ut/quantization/test_quantizer.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC` to test deepseek w4a8_dynamic quantized model #### 1.How to get weights using Modelslim ##### Installation steps Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24 git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### The required transformers environment transformers>=4.48.2 ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96) chapter Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format ##### Adapt to vllm-ascend Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it: `quant_model_description_w8a8_dynamic.json` rename to `quant_model_description.json`, and add `"group_size": 256` Modification in `config.json`：`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3`; `quantization_config` is removed; tips:The group_size and weights match. If the w4a8 weights are not generated using msmodelslim, you can check the group_size in quantization_config in config.json. #### 2.How to run w4a8 ##### a.How to run eager mode export VLLM_USE_V1=1 # v1 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --max-num-seqs 128 --enforce-eager ##### b.How to run graph mode export VLLM_USE_V1=1 # v1 export HCCL_BUFFSIZE=1024 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' eg: python -m vllm.entrypoints.openai.api_server --model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-08-06 10:17:44 +08:00
sherie	126cdfc92b	[Test] add rejection sampler ut (#2084 ) ### What this PR does / why we need it? add rejection sampler ut. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT passed - vLLM version: v0.10.0 - vLLM main: `586f286789` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-08-05 19:03:36 +08:00
Slightwind	f3b50c54e8	[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication (#2195 ) This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations. In the previous implementation, the `all2all` operation was performed on unquantized `hidden_states` (in FP16/BF16) before quantization, resulting in substantial communication overhead. By performing quantization on each EP rank first and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%. Additionally, this PR includes a minor optimization to cast `int` inputs to `float` for the `argsort` operation, forcing it to run on a faster NPU core instead of the AICPU. These changes lead to a clear and significant performance gain in MoE quantization scenarios. - vLLM version: v0.10.0 - vLLM main: `7175817637` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-08-05 18:47:13 +08:00
leo-pony	807f0895b2	Bump torch version to 2.7.1 (#1562 ) ### What this PR does / why we need it? Bump torch version to 2.7.1, and cleanup infer schema patch https://github.com/vllm-project/vllm-ascend/commit/857f489 (https://github.com/vllm-project/vllm-ascend/pull/837), this patch depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974 ### Does this PR introduce any user-facing change? No #### How was this patch tested? CI passed torch-npu 2.7.1rc1 install guide: https://gitee.com/ascend/pytorch/tree/v2.7.1/ install depending: ``` pip3 install pyyaml pip3 install setuptools ``` install torch-npu: Closes: https://github.com/vllm-project/vllm-ascend/issues/1866 Closes: https://github.com/vllm-project/vllm-ascend/issues/1390 - vLLM version: v0.10.0 - vLLM main: `9af654cc38` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-05 08:43:24 +08:00
wangxiyuan	36e450eb0f	[Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097 ) we recently added disaggregated_prefill and ascend_forward_context feature by `ba3dfbd59e` and `df0ec55162`. This PR fix some nit introduced by them to make the code clear. 1. drop `current_platform` usage. It'll lead unknown circular import error in some case 2. update `set_ascend_forward_context` function to make the logic clear. for example, remove V0 support in this function. 3. Remove useless `self.local_rank_across_dp` in worker 4. Remove `soc_info.py` to use `get_ascend_soc_version` instead. - vLLM version: v0.10.0 - vLLM main: `02f82fe438` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 08:39:02 +08:00
Li Wang	ad366bf908	[Bugfix] Follow vLLM Qwen-Moe/VL and KV Connector change to fix broken CI (#2181 ) ### What this PR does / why we need it? This pr fix broken CI: 1. Fix the `ee2eb6ecd8` changes, in this commit, they fused the gate and up projections in the vision MLP, This can improve performance by reducing one matrix multiplication. so, this pr do the following things: - Specify that the two linear layers are fused as `mlp.gate_up_proj` when loading the weights. - Use a SiluAndMul activation function. 2. Fix `aefeea0fde`, Update ModelRunnerOutput parameters to adapt to its changes 3. Fix [vllm-commit](https://github.com/vllm-project/vllm/pull/20815/files#diff-3ffb829a39ab2b3e4706aa28f5e476815f36c3a87b98d6a66514ebedc8f3ffb4R354-R356), fix qwen moe ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `fed5849d3f` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-04 21:37:50 +08:00
CaveNightingale	957c7f108d	[Bugfix][PD] Make multiple Ps and Ds work on a single machine (#2080 ) (cherry picked from commit 816375e0c1071d0696dfab1a1ce35674f9f37aa0) ### What this PR does / why we need it? Suppose that you want to start a prefiller instance with npus `2,3` only. So you start the instance with `ASCEND_RT_VISIBLE_DEVICES=2,3`. The current programming will start two workers, whose ranks are `0` and `1` respectedly. And they will pick the first and second ip addresses of npus in the ranktable instead of the thirdth and forth ones. But actually they are using card `2,3` and therefore they can not link with remote instances when they attempt to transfer the KVCache. Hence, at most 1 prefiller instance and at most 1 decoder instance can work on a single machine since they always pick the first npu ip address in the ranktable currently. This pull request is proposed to fix the problem. This fix pick ips of only those devices that are in `ASCEND_RT_VISIBLE_DEVICES` from the ranktable. ### Does this PR introduce _any_ user-facing change? If the user use ranktable generated by `gen_ranktable.sh`, they should not face any change. ### How was this patch tested? Qwen-0.6B 1P 1D, dp=2, `ASCEND_RT_VISIBLE_DEVICES=2,3` for prefiller and `ASCEND_RT_VISIBLE_DEVICES=4,5` for decoder. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: CaveNightingale <cavenightingale@foxmail.com>	2025-08-04 17:22:18 +08:00
YuanCheng-coder	ddaded1537	Add ut for envs.py (#2131 ) What this PR does / why we need it? test vllm_ascend/envs.py contains environment variables defination Does this PR introduce any user-facing change? N/A How was this patch tested? CI passed with new added test. vLLM version: v0.10.0 vLLM main: `9532a6d563` - vLLM version: v0.10.0 - vLLM main: `b4e081cb15` --------- Signed-off-by: chengyuan <chengyuan27@huawei.com> Co-authored-by: chengyuan <chengyuan27@huawei.com>	2025-08-02 16:53:44 +08:00
weijinqian0	6e00aed4d5	[main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088 ) It comes from 0.9.1dev [0.9.1][Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo (#1547) - vLLM version: v0.10.0 - vLLM main: `97608dc276` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: curryliu <120010041@link.cuhk.edu.cn> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com> Signed-off-by: taoxudonghaha <justsheldon@163.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: TaoYu Chen <ctynb@qq.com> Co-authored-by: taoxudonghaha <justsheldon@163.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-08-02 09:49:10 +08:00
22dimensions	8cf97d8310	[Misc] Add extra checking to torchair_graph_config. (#1939 ) ### What this PR does / why we need it? cherry-pick #1675 to main This PR adds validation checking to torchair_graph_config for better reliability. Co-authored-by: whx-sjtu <2952154980@qq.com> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:24:11 +08:00
huangxialu	9c9a7cd90b	[main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112 ) backport of v0.9.1-dev: https://github.com/vllm-project/vllm-ascend/pull/1902 origin main npu_moe_gating_top_k_softmax: https://github.com/vllm-project/vllm-ascend/pull/1355 - vLLM version: v0.10.0 - vLLM main: `055bd3978e` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-31 21:05:56 +08:00
Ronald1995	e8660d7978	ut:add ut for qwen2_5_vl (#2143 ) ### What this PR does / why we need it? add ut for qwen2_5_vl ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-31 20:46:17 +08:00
daniel	db310c6ec9	add ut for device allocator/camem and mutistream/layers (#2037 ) What this PR does / why we need it? test device allocator/camem and mutistream/layers contains resource allocation and stream ops Does this PR introduce any user-facing change? N/A How was this patch tested? CI passed with new added test. - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 1024daniel <xxltju324@gmail.com>	2025-07-31 19:17:27 +08:00
CaranLic	7c90ba5fe8	[Test] add ut for decorator.py/deepseek_mtp.py (#2127 ) ### What this PR does / why we need it? add ut for decorator.py/deepseek_mtp.py ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new tests - vLLM version: v0.10.0 - vLLM main: `055bd3978e` --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-07-31 15:21:15 +08:00
Joey Gao	6192bc95c0	[Bugfix] fix tensor not same device in qwen2_5_vl_without_padding (#2051 ) bugfix cherry-pick from v0.9.1-dev https://github.com/vllm-project/vllm-ascend/pull/2007 ### What this PR does / why we need it? Minimum reproducing code： ```python # test.py from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="Qwen2.5-VL-7B-Instruct", max_model_len=26240) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```bash export USE_OPTIMIZED_MODEL=0 python test.py ``` exception as follow: ``` [rank0]: File "/home/xxx/vllm_ascend/models/qwen2_5_vl_without_padding.py", line 84, in forward [rank0]: q = torch_npu.npu_rotary_mul(q, cos, sin) [rank0]: File "/home/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__ [rank0]: return self._op(args, (kwargs or {})) [rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu! (when checking argument for argument r1 in method wrapper__npu_rotary_mul) ``` In `AscendQwen2_5_VisionAttention_Without_Padding`, `torch_npu.npu_rotary_mul(q, cos, sin)`， `cos`/`sin` on cpu, but `q` on npu, so there will be an error. `qwen2_5_vl_without_padding.py` need this bugfix, because `AscendQwen2_5_VisionTransformer_Without_Padding.rot_pos_emb` in wen2_5_vl_without_padding.py is from vllm and `inv_freq` will create on cpu. `40d86ee412/vllm/model_executor/models/qwen2_5_vl.py (L482)` ```python inv_freq = 1.0 / (theta(torch.arange(0, dim, 2, dtype=torch.float, device='cpu') / dim)) ``` `qwen2_5_vl.py` do not need, because `AscendQwen2_5_VisionRotaryEmbedding` in qwen2_5_vl.py rewrite `AscendQwen2_5_VisionRotaryEmbedding` and `inv_freq` will create on device. ```python inv_freq = 1.0 / (theta*(torch.arange(0, dim, 2, dtype=torch.float) / dim)) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: pjgao <gaopengju3@huawei.com> Co-authored-by: pjgao <gaopengju3@huawei.com>	2025-07-31 15:18:54 +08:00
Ronald1995	3386e09a40	ut:add ut for qwen2_vl.py (#2096 ) ### What this PR does / why we need it? add ut for qwen2_vl.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `555e7225bc` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-30 22:31:47 +08:00
Ruri	4fcca137a7	[main][Feature] Support Qwen3 W4A8 quantization (#2060 ) ### What this PR does / why we need it? Adding `W4A8_DYNAMIC` quantization support for linear. Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC` to test qwen3 w4a8_dynamic quantized model Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim` of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409` 1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim` ```shell git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409 bash install.sh ``` 2. Serve model using `vllm` ```shell VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \ --model vllm-ascend/Qwen3-8B-W4A8 \ --port 8000 \ --quantization ascend \ --tensor_parallel_size 2 \ --enforce-eager ``` - vLLM version: v0.10.0 - vLLM main: `4cd7fe6cea` --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>	2025-07-30 14:57:14 +08:00
YuanCheng-coder	34dd24adf2	add ut for vocab_parallel_embedding (#2067 ) ### What this PR does / why we need it? test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel embedding forward CI passed with new added test. vLLM version: v0.10.0 vLLM main: `2cc571199b` - vLLM version: v0.10.0 - vLLM main: `05cbbe20c5` Signed-off-by: chengyuan <chengyuan27@huawei.com> Co-authored-by: chengyuan <chengyuan27@huawei.com>	2025-07-30 14:35:45 +08:00
wangxiyuan	9b67c87b14	[Refactor]Refactor sampler (#2050 ) Refactor Sampler implementation from patch way to inherit from vLLM Sampler interface. Next step: Make the op `TopKTopPSampler` in vLLM support custom ops register mechanism - vLLM version: v0.10.0 - vLLM main: `61a6905ab0` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-30 08:47:22 +08:00
whx	98cadc2146	[Perf] Avoid performing index selection of sin/cos cache every layer (#1890 ) Optimize number of index selections of sin/cos cache. - vLLM version: v0.10.0 - vLLM main: `656c24f1b5` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-07-29 18:06:45 +08:00

1 2

100 Commits