xc-llm-ascend

Author	SHA1	Message	Date
huangxialu	dceef080b1	[main] remove torch.cat and replace it by List[0] (#2153 ) ### What this PR does / why we need it? torch_npu.npu_grouped_matmul: https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md According to the document, when `split_item` is 2 or 3, `torch_npu.npu_grouped_matmul` will return a list which has one element. Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is unnecessary. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? ut and e2e covered: `tests/ut/ops/test_fused_ops.py`, `tests/e2e/singlecard/ops/test_fused_moe.py` performance: (qwen3 30B, 2k->20k) base: Total Token throughput (tok/s): 667.76 remove cat: Total Token throughput (tok/s): 680.82 - vLLM version: v0.10.0 - vLLM main: `fa00c5d75b` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-08-07 17:20:19 +08:00
Ronald1995	b2598c3271	enable mm allreduce test (#2192 ) ### What this PR does / why we need it? This PR is to add e2e test for using npu_mm_all_reduce_base fusion kernel. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `5d5d419ca6` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-08-07 17:19:23 +08:00
Mengqing Cao	4604882a3e	[ReleaseNote] Release note of v0.10.0rc1 (#2225 ) ### What this PR does / why we need it? Release note of v0.10.0rc1 - vLLM version: v0.10.0 - vLLM main: `8e8e0b6af1` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-07 14:46:49 +08:00
Yikun Jiang	58c8d4fdcd	Remove transformer pins for v0.9.1-dev (#2234 ) ### What this PR does / why we need it? Remove transformer pins for v0.9.1-dev, because we already release the v0.9.1rc2 with right transformer version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest CI passed - vLLM version: v0.10.0 - vLLM main: `7e6544c797` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-07 14:41:10 +08:00
zhangxinyuehfad	92eebc0c9b	[Doc] Update user guide for suported models (#2263 ) ### What this PR does / why we need it? Update user guide for suported models - vLLM version: v0.10.0 - vLLM main: `4be02a3776` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:39:51 +08:00
22dimensions	440d28a138	[Tutorial] Add qwen3 8b w4a8 tutorial (#2249 ) ### What this PR does / why we need it? Add a new single npu quantization tutorial, and using the latest qwen3 model. - vLLM version: v0.10.0 - vLLM main: `8e8e0b6af1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-07 14:39:38 +08:00
zhangxinyuehfad	bcd0b532f5	[Doc] Update user guide for using lm-eval (#1325 ) ### What this PR does / why we need it? Update user guide for using lm-eval 1. add using lm-eval on online server 2. add using offline datasets - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:15:49 +08:00
zhangxinyuehfad	dbba3cabb0	[Doc] Update tutorials for single_npu_audio and single_npu_multimodal (#2252 ) ### What this PR does / why we need it? Update tutorials for single_npu_audio and single_npu_multimodal - vLLM version: v0.10.0 - vLLM main: `6b47ef24de` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:08:14 +08:00
Li Wang	205eff2b12	[Bugfix] Disable check vllm init temporary (#2250 ) ### What this PR does / why we need it? For the vllm src https://github.com/vllm-project/vllm/tree/main/vllm/attention/layers do not have `__init__.py`, which will break the python src init check, so we skip it for now ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `6b47ef24de` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-07 10:37:22 +08:00
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
Li Wang	57b9f02185	[Bugfix] Fix disaggregated pd error (#2242 ) ### What this PR does / why we need it? Fix `ascend_env has no attr VLLM_ASCEND_ENABLE_CHUNK_MC2`, remove useless lines - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:48:10 +08:00
xuyexiong	26fc36b0e0	[V1] MTP supports torchair (#2145 ) ### What this PR does / why we need it? Support MTP with： - [x] V0 Scheduler - [x] TorchAir - [x] Single DP - [x] Multi DP - [x] Disaggregate PD Known issues： - [ ] Not support V1 Scheduler (chunked prefill), will be supported in a few weeks - [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now, need to comment out the line 171-175 in file `vllm/vllm/v1/metrics/loggers.py` ``` if (len(self.engine_indexes) > 1 and vllm_config.speculative_config is not None): raise NotImplementedError("Prometheus metrics with Spec Decoding " "with >1 EngineCore per AsyncLLM is not " "supported yet.") ``` To start an online server with torchair enabled, here is an example: ``` python -m vllm.entrypoints.openai.api_server \ --model="/weights/DeepSeek-R1_w8a8/" \ --trust-remote-code \ --max-model-len 40000 \ --tensor-parallel-size 4 \ --data_parallel_size 4 \ --max-num-seqs 16 \ --no-enable-prefix-caching \ --enable_expert_parallel \ --served-model-name deepseekr1 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --quantization ascend \ --host 0.0.0.0 \ --port 1234 \ --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \ --gpu_memory_utilization 0.9 ``` offline example with torchair enabled ``` from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=16, temperature=0) # Create an LLM. llm = LLM( model="/home/data/DeepSeek-R1_w8a8/", tensor_parallel_size=16, max_num_seqs=16, gpu_memory_utilization=0.9, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "deepseek_mtp", "num_speculative_tokens": 1, }, trust_remote_code=True, enforce_eager=False, max_model_len=2000, additional_config = { 'torchair_graph_config': { 'enabled': True, "graph_batch_sizes": [16], 'enable_multistream_shared_expert': False, }, "ascend_scheduler_config": { "enabled": True }, # 'expert_tensor_parallel_size': 16, } ) # Generate texts from the prompts. # llm.start_profile() outputs = llm.generate(prompts, sampling_params) # llm.stop_profile() for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.0 - vLLM main: `302962e806` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-08-06 19:37:43 +08:00
Li Wang	bf84f2dbfa	[Doc] Support kimi-k2-w8a8 (#2162 ) ### What this PR does / why we need it? In fact, the kimi-k2 model is similar to the deepseek model, and we only need to make a few changes to support it. what does this pr do: 1. Add kimi-k2-w8a8 deployment doc 2. Update quantization doc 3. Upgrade torchair support list ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:28:47 +08:00
huangxialu	875a86cbe9	ut: add example and e2e test for sleepmode in external_launcher (#2152 ) ### What this PR does / why we need it? This pr add e2e testcase to make sure sleep mode in external_launcher is ok. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `74333ae2f6` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-08-06 11:11:53 +08:00
Wang Kunpeng	8a59367d0c	[main][Feature] Support deepseek w4a8 quantization (#2172 ) ### What this PR does / why we need it? Supports Deepseek-R1 w4a8 quantization. Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and `tests/ut/quantization/test_quantizer.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC` to test deepseek w4a8_dynamic quantized model #### 1.How to get weights using Modelslim ##### Installation steps Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24 git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### The required transformers environment transformers>=4.48.2 ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96) chapter Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format ##### Adapt to vllm-ascend Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it: `quant_model_description_w8a8_dynamic.json` rename to `quant_model_description.json`, and add `"group_size": 256` Modification in `config.json`：`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3`; `quantization_config` is removed; tips:The group_size and weights match. If the w4a8 weights are not generated using msmodelslim, you can check the group_size in quantization_config in config.json. #### 2.How to run w4a8 ##### a.How to run eager mode export VLLM_USE_V1=1 # v1 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --max-num-seqs 128 --enforce-eager ##### b.How to run graph mode export VLLM_USE_V1=1 # v1 export HCCL_BUFFSIZE=1024 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' eg: python -m vllm.entrypoints.openai.api_server --model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-08-06 10:17:44 +08:00
Ruri	e31b31f9c3	[main][Bugfix] Fix unable to load qwen3_moe quantized weights (#2219 ) ### What this PR does / why we need it? Fixes unable to load `qwen3_moe` quantized weights issue due to #1994 ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Add a `qwen3_moe` W8A8 quantized model in `tests/e2e/multicard/test_qwen3_moe.py` - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>	2025-08-06 09:08:36 +08:00
Yikun Jiang	54ace9e12b	Add release note for v0.9.1rc2 (#2188 ) ### What this PR does / why we need it? Add release note for v0.9.1rc2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-06 09:04:46 +08:00
sherie	126cdfc92b	[Test] add rejection sampler ut (#2084 ) ### What this PR does / why we need it? add rejection sampler ut. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT passed - vLLM version: v0.10.0 - vLLM main: `586f286789` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-08-05 19:03:36 +08:00
Slightwind	f3b50c54e8	[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication (#2195 ) This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations. In the previous implementation, the `all2all` operation was performed on unquantized `hidden_states` (in FP16/BF16) before quantization, resulting in substantial communication overhead. By performing quantization on each EP rank first and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%. Additionally, this PR includes a minor optimization to cast `int` inputs to `float` for the `argsort` operation, forcing it to run on a faster NPU core instead of the AICPU. These changes lead to a clear and significant performance gain in MoE quantization scenarios. - vLLM version: v0.10.0 - vLLM main: `7175817637` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-08-05 18:47:13 +08:00
wangxiyuan	292fb8f696	[1/N][Refactor] torchair model runner refactor (#2205 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What this PR does: create the new torchair model runner, more function will be added later - vLLM version: v0.10.0 - vLLM main: `586f286789` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 18:43:04 +08:00
wangxiyuan	458ab2db12	[BugFix] Fix the bug that qwen3 moe doesn't work with aclgraph (#2183 ) What's the PR does: 1. Move AscendSparseMoeBlock to qwen3 model, since it's only used by qwen3 model. 2. Disable AscendSparseMoeBlock if aclgraph is enabled, AscendSparseMoeBlock doesn't work with aclgraph currently. - vLLM version: v0.10.0 - vLLM main: `cdfd6871a5` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 17:42:52 +08:00
jinyuxin	583ad8f347	[main][refractor] Refractor forward metadata retrieval across DP nodes to reduce redundant padding. (#2062 ) Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic . ### What this PR does： 1. First checks whether any DP instance is in the prefill phase. 2. If in the `decode` phase and `torchair_graph_enabled `is true, pads each DP instance’s token count up to the global maximum. 3. If in the `prefill` phase, or in decode phase with graph mode disabled, returns each DP instance’s original token count without padding. This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary. - vLLM version: v0.10.0 - vLLM main: `bd3db7f469` Signed-off-by: yx0716 <jinyx1007@foxmail.com> Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-05 17:03:36 +08:00
xleoken	27c2b5c145	[Doc] Update pytorch version in README_zh doc (#2202 ) ### What this PR does / why we need it? Update pytorch version in README_zh doc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test. - vLLM version: v0.10.0 - vLLM main: `bd3db7f469` Signed-off-by: xleoken <xleoken@163.com>	2025-08-05 11:13:49 +08:00
leo-pony	807f0895b2	Bump torch version to 2.7.1 (#1562 ) ### What this PR does / why we need it? Bump torch version to 2.7.1, and cleanup infer schema patch https://github.com/vllm-project/vllm-ascend/commit/857f489 (https://github.com/vllm-project/vllm-ascend/pull/837), this patch depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974 ### Does this PR introduce any user-facing change? No #### How was this patch tested? CI passed torch-npu 2.7.1rc1 install guide: https://gitee.com/ascend/pytorch/tree/v2.7.1/ install depending: ``` pip3 install pyyaml pip3 install setuptools ``` install torch-npu: Closes: https://github.com/vllm-project/vllm-ascend/issues/1866 Closes: https://github.com/vllm-project/vllm-ascend/issues/1390 - vLLM version: v0.10.0 - vLLM main: `9af654cc38` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-05 08:43:24 +08:00
wangxiyuan	36e450eb0f	[Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097 ) we recently added disaggregated_prefill and ascend_forward_context feature by `ba3dfbd59e` and `df0ec55162`. This PR fix some nit introduced by them to make the code clear. 1. drop `current_platform` usage. It'll lead unknown circular import error in some case 2. update `set_ascend_forward_context` function to make the logic clear. for example, remove V0 support in this function. 3. Remove useless `self.local_rank_across_dp` in worker 4. Remove `soc_info.py` to use `get_ascend_soc_version` instead. - vLLM version: v0.10.0 - vLLM main: `02f82fe438` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 08:39:02 +08:00
Li Wang	ad366bf908	[Bugfix] Follow vLLM Qwen-Moe/VL and KV Connector change to fix broken CI (#2181 ) ### What this PR does / why we need it? This pr fix broken CI: 1. Fix the `ee2eb6ecd8` changes, in this commit, they fused the gate and up projections in the vision MLP, This can improve performance by reducing one matrix multiplication. so, this pr do the following things: - Specify that the two linear layers are fused as `mlp.gate_up_proj` when loading the weights. - Use a SiluAndMul activation function. 2. Fix `aefeea0fde`, Update ModelRunnerOutput parameters to adapt to its changes 3. Fix [vllm-commit](https://github.com/vllm-project/vllm/pull/20815/files#diff-3ffb829a39ab2b3e4706aa28f5e476815f36c3a87b98d6a66514ebedc8f3ffb4R354-R356), fix qwen moe ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `fed5849d3f` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-04 21:37:50 +08:00
hucong	e38fab011d	[Doc][PD] Restore the default configuration items in examples/disaggregate_prefill_v1/README.md (#2165 ) ### What this PR does / why we need it? - In the D node, the max-num-batched-tokens parameter can be set to a smaller value since the D node processes at most max-num-seqs batches concurrently. As the profile_run only needs to handle max-num-seqs sequences at a time, we can safely set max-num-batched-tokens equal to max-num-seqs. This optimization will help reduce activation memory consumption. - Restore the default configuration items for PD separation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `61dcc280fa` Signed-off-by: underfituu <hzhucong@163.com>	2025-08-04 20:30:53 +08:00
CaveNightingale	957c7f108d	[Bugfix][PD] Make multiple Ps and Ds work on a single machine (#2080 ) (cherry picked from commit 816375e0c1071d0696dfab1a1ce35674f9f37aa0) ### What this PR does / why we need it? Suppose that you want to start a prefiller instance with npus `2,3` only. So you start the instance with `ASCEND_RT_VISIBLE_DEVICES=2,3`. The current programming will start two workers, whose ranks are `0` and `1` respectedly. And they will pick the first and second ip addresses of npus in the ranktable instead of the thirdth and forth ones. But actually they are using card `2,3` and therefore they can not link with remote instances when they attempt to transfer the KVCache. Hence, at most 1 prefiller instance and at most 1 decoder instance can work on a single machine since they always pick the first npu ip address in the ranktable currently. This pull request is proposed to fix the problem. This fix pick ips of only those devices that are in `ASCEND_RT_VISIBLE_DEVICES` from the ranktable. ### Does this PR introduce _any_ user-facing change? If the user use ranktable generated by `gen_ranktable.sh`, they should not face any change. ### How was this patch tested? Qwen-0.6B 1P 1D, dp=2, `ASCEND_RT_VISIBLE_DEVICES=2,3` for prefiller and `ASCEND_RT_VISIBLE_DEVICES=4,5` for decoder. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: CaveNightingale <cavenightingale@foxmail.com>	2025-08-04 17:22:18 +08:00
yiz-liu	a9480d5f0a	[Fix] Adjust use_aclgraph logic (#2156 ) ### What this PR does / why we need it? Updates the FusedMoE method to determine whether to use ACL Graph based on the `torchair_graph_config` This is equivalent to #2154 on v0.9.1-dev. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-04 15:23:20 +08:00
liu	688350a3bb	[bugfixed] fix the bug when run the inference of quantized ds-w8a8-mtp (#2134 ) When run the inference of ds-w8a8-mtp, it reported 'ParamllelLMhead has no attribute 'params_dtype''. 1. add wrapper of vocab_parallel_embedding, fixed the bugs when running deepseek-w8a8-mtp Signed-off-by: curryliu <120010041@link.cuhk.edu.cn> - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>	2025-08-04 15:16:42 +08:00
Pleaplusone	4b3a210c33	Implementation of simple load balance routing proxy server (#1953 ) (#2124 ) ### What this PR does / why we need it? The PR is the cherry-pick from v0.9.1 https://github.com/vllm-project/vllm-ascend/pull/1953 This PR introduce a new load balance proxy server example implementation for disaggregated pd, which support simple token&kv_cache aware load balance routing strategy for the disaggregated pd system compared with origin round robin toy_proxy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tested on real workload and unittest - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-04 10:35:53 +08:00
Mengqing Cao	af04ee9e7a	[MoE][Dist] Fix Qwen MoE accuracy bug in DP scenario (#1856 ) ### What this PR does / why we need it? Fix Qwen MoE accuracy bug in DP scenario. Now the implentment of `FusedMoE` in vLLM use `All2AllManager` to manager different all2all algorithm branch. And the default branch use `Multicast` in `dispatch` phase and `all_reduce` in `combine` phase, which are not implented in vLLM-Ascend. This leading to invoking into a default implentment in `base_communicator`, with empty `dispatch` and `combine` operations, thus causing the accuracy issue on it. This pr is a temporary workaround, refacting all2all in vLLM-Ascend could be a better way. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-04 10:24:18 +08:00
Pleaplusone	f939381c6f	[Bugfix] Adopt the new changes on disaggregated pd from vllm main branch (#2122 ) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR https://github.com/vllm-project/vllm/pull/21072 and https://github.com/vllm-project/vllm/pull/21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-04 10:08:58 +08:00
YuanCheng-coder	ddaded1537	Add ut for envs.py (#2131 ) What this PR does / why we need it? test vllm_ascend/envs.py contains environment variables defination Does this PR introduce any user-facing change? N/A How was this patch tested? CI passed with new added test. vLLM version: v0.10.0 vLLM main: `9532a6d563` - vLLM version: v0.10.0 - vLLM main: `b4e081cb15` --------- Signed-off-by: chengyuan <chengyuan27@huawei.com> Co-authored-by: chengyuan <chengyuan27@huawei.com>	2025-08-02 16:53:44 +08:00
xleoken	bea3d5bbb4	[Bug] Fix run bug in run_dp_server.sh (#2139 ) ### What this PR does / why we need it? For `Qwen2.5-0.5B-Instruct` model - the model's total number of attention heads (14) must be divisible by tensor parallel size. (4 -> 2) - the model does not support enable-expert-parallel ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: xleoken <xleoken@163.com>	2025-08-02 16:52:12 +08:00
yangqinghao-cmss	47f688a2f0	Change retrieving remote files to local retrieval. (#2141 ) ### What this PR does / why we need it? Using vllm's AudioAsset class to retrieve remote audio files(https://vllm-public-assets.s3.us-west-2.amazonaws.com) is not feasible in some cases; it is recommended to switch to local retrieval. ### How was this patch tested? vllm:main vllm:ascend:main results: ```bash Adding requests: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:04<00:00, 4.62s/it] Processed prompts: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:03<00:00, 3.01s/it, est. speed input: 79.03 toks/s, output: 6.31 toks/s] generated_text: The sport referenced is soccer, and the nursery rhyme is 'Hey Diddle Diddle'. ``` - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-08-02 16:51:22 +08:00
zhangxinyuehfad	e48f32ec59	[CI] Update image for 310p ci (#2155 ) ### What this PR does / why we need it? update the latest image for 310p ci test - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-02 16:46:02 +08:00
leo-pony	e467fe1b77	Add qwen-vl model and sampling feature UT for 310I series (#2168 ) ### What this PR does / why we need it? Add qwen-vl model and sampling feature UT for 310I series - vLLM version: v0.10.0 - vLLM main: `e0f63e4a35` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-08-02 11:26:12 +08:00
weijinqian0	6e00aed4d5	[main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088 ) It comes from 0.9.1dev [0.9.1][Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo (#1547) - vLLM version: v0.10.0 - vLLM main: `97608dc276` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: curryliu <120010041@link.cuhk.edu.cn> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com> Signed-off-by: taoxudonghaha <justsheldon@163.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: TaoYu Chen <ctynb@qq.com> Co-authored-by: taoxudonghaha <justsheldon@163.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-08-02 09:49:10 +08:00
leo-pony	f0c1f0c828	[Doc] Add qwen vl example in tutorials for 310I series (#2160 ) ### What this PR does / why we need it? Add qwen vl example in tutorials for 310I series. Model: Qwen2.5-VL-3B-Instruct Accuracy test result, dataset MMM-val: \| \| 910B3 \| 310P3 \| \| --- \| --- \| --- \| \|Summary\|0.455 \| 0.46 \| \|--art_and_design\| 0.558 \| 0.566 \| \|--business\| 0.373 \| 0.366 \| \|--health_and_medicine\|0.513 \| 0.52 \| \|--science\|0.333 \| 0.333 \| \|--tech_and_engineering\|0.362 \| 0.380 \| \|--humanities_and_social_science\|0.691 \| 0.691 \| Function test result: 1. On line: ![image](https://github.com/user-attachments/assets/d81bba61-df28-4676-a246-c5d094815ac7) ![image](https://github.com/user-attachments/assets/0be81628-9999-4ef2-93c1-898b3043e09e) 2. Offline: ![image](https://github.com/user-attachments/assets/603275c1-6ed6-4cfc-a6e2-7726156de087) - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-08-02 08:58:56 +08:00
22dimensions	8cf97d8310	[Misc] Add extra checking to torchair_graph_config. (#1939 ) ### What this PR does / why we need it? cherry-pick #1675 to main This PR adds validation checking to torchair_graph_config for better reliability. Co-authored-by: whx-sjtu <2952154980@qq.com> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:24:11 +08:00
Li Wang	2284289880	[MISC] Cherry pick #1291 from v0.9.1-dev (#1825 ) ### What this PR does / why we need it? Cherry pick #1291 from v0.9.1-dev, This pr implement the synchronization of whether `dbo` is enabled across all dp ranks. specifically, it performed allreduce op across multiple DP ranks, only when all the dp rank is `enable_dbo`, it is enabled Co-authored-by: shikang-hangzhou <459956190@qq.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-01 09:08:45 +08:00
22dimensions	9e65da990e	[Misc] Add warning for incompatible Ray backend with ACL Graph mode (#2132 ) ### What this PR does / why we need it? cherry-pick #1501 from 0.9.1-dev to main Currently, Ray is not compatible with ACL Graph, so we need to fall back to eager mode when using the Ray backend. co-authored: Yizhou Liu <liu_yizhou@outlook.com> - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:06:09 +08:00
yangqinghao-cmss	99fa0ac882	[BugFix] update the kv transfer config (#2121 ) ### What this PR does / why we need it? The functions KVTransferConfig.from_cli and AscendHcclConnector are missing in the latest vLLM version. To resolve this, I propose modifying the kv_connector to use LLMDataDistCMgrConnector, which depends on [PR #2079](https://github.com/vllm-project/vllm-ascend/pull/2079) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vllm:main vllm-ascend:mian results: ```bash Adding requests: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 374.27it/s] Processed prompts: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 66.06it/s, est. speed input: 449.08 toks/s, output: 66.51 toks/s] Prefill node is finished. INFO 07-31 09:18:30 [model_runner_v1.py:2282] Graph capturing finished in 36 secs, took 0.21 GiB INFO 07-31 09:18:30 [core.py:201] init engine (profile, create kv cache, warmup model) took 52.49 seconds INFO 07-31 09:18:30 [factory.py:74] Creating v1 connector with name: LLMDataDistCMgrConnector and engine_id: 28c8ced8-575c-4f87-840a-48d04d0edf7e INFO 07-31 09:18:30 [platform.py:157] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode INFO 07-31 09:18:30 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 76 INFO 07-31 09:18:30 [utils.py:359] No adjustment needed for ACL graph batch sizes: Qwen2ForCausalLM model (layers: 24) with 67 sizes INFO 07-31 09:18:30 [llm.py:293] Supported_tasks: ['generate'] Waiting for prefill node to finish... Adding requests: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 709.70it/s] Processed prompts: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 16.23it/s, est. speed input: 109.70 toks/s, output: 260.01 toks/s] Prompt: 'Hello, how are you today?', Generated text: " I'm a computer program, so I don't have feelings. But I can" Prompt: 'Hi, what is your name?', Generated text: ' I am a computer programmer. I have a question about the programming language I am' Prompt: 'Tell me a very long story.', Generated text: ' I want to read it. I want to read it. I want to read' Prompt: 'what is your favourite book?', Generated text: " I'm sorry, but as an AI language model, I don't have personal" Cleanup prefill resources All process done ``` - vLLM version: v0.10.0 - vLLM main: `9cb497bfa3` Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-08-01 08:56:55 +08:00
Li Wang	968e6791d3	[Misc] Add data preprocess functions to qwen2.5_vl_without_padding (#2148 ) ### What this PR does / why we need it? Cherry pick #1705 from v0.9.1-dev Compared qwen2_5_vl.py, qwen2_5_vl_without_padding.py missing some funtions. The purpose of this PR is to supplement these. add: - rot_pos_emb(self, grid_thw: torch.Tensor) - get_window_index(self, grid_thw) - _process_image_input(self, image_input) - _process_video_input(self, video_input) Co-authored-by: zheliuyu [15750543867@163.com](mailto:15750543867@163.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `207b750e19` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-01 08:54:02 +08:00
Li Wang	e3b3ffb875	[Misc] Disable quantization in mindie_turbo (#2147 ) ### What this PR does / why we need it? cherry pick #1749 from v0.9.1-dev since the interface in vllm-ascend has changed so quickly, the quantization function in mindie_turbo is no longer needed, so it needs to be discarded. Co-authored-by: zouyida [zouyida@huawei.com](mailto:zouyida@huawei.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `207b750e19` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-01 08:53:00 +08:00
leo-pony	c62f346f5d	Fixed 310p failure when using the sampler feature (#2151 ) ### What this PR does / why we need it? Fixed 310p failure when using the sampler feature. The root cause is: torch_npu.npu_top_k_top_p uses the operator aclnnApplyTopKTopP, but aclnnApplyTopKTopP currently does not support 310P. First PR that has the issue is #1308. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `207b750e19` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-08-01 08:43:08 +08:00
Icey	86bdde1ca8	Enable pytest and yaml style accuracy test (#2073 ) ### What this PR does / why we need it? This PR enabled pytest and yaml style accuracy test, users now can enable accuracy test by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \ --report_output ./benchmarks/accuracy/Qwen3-8B-Base.md pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1970 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-07-31 21:39:13 +08:00
huangxialu	9c9a7cd90b	[main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112 ) backport of v0.9.1-dev: https://github.com/vllm-project/vllm-ascend/pull/1902 origin main npu_moe_gating_top_k_softmax: https://github.com/vllm-project/vllm-ascend/pull/1355 - vLLM version: v0.10.0 - vLLM main: `055bd3978e` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-31 21:05:56 +08:00
Ronald1995	e8660d7978	ut:add ut for qwen2_5_vl (#2143 ) ### What this PR does / why we need it? add ut for qwen2_5_vl ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-31 20:46:17 +08:00

1 2 3 4 5 ...

673 Commits