xc-llm-ascend

Author	SHA1	Message	Date
huangxialu	dceef080b1	[main] remove torch.cat and replace it by List[0] (#2153 ) ### What this PR does / why we need it? torch_npu.npu_grouped_matmul: https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md According to the document, when `split_item` is 2 or 3, `torch_npu.npu_grouped_matmul` will return a list which has one element. Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is unnecessary. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? ut and e2e covered: `tests/ut/ops/test_fused_ops.py`, `tests/e2e/singlecard/ops/test_fused_moe.py` performance: (qwen3 30B, 2k->20k) base: Total Token throughput (tok/s): 667.76 remove cat: Total Token throughput (tok/s): 680.82 - vLLM version: v0.10.0 - vLLM main: `fa00c5d75b` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-08-07 17:20:19 +08:00
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
xuyexiong	26fc36b0e0	[V1] MTP supports torchair (#2145 ) ### What this PR does / why we need it? Support MTP with： - [x] V0 Scheduler - [x] TorchAir - [x] Single DP - [x] Multi DP - [x] Disaggregate PD Known issues： - [ ] Not support V1 Scheduler (chunked prefill), will be supported in a few weeks - [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now, need to comment out the line 171-175 in file `vllm/vllm/v1/metrics/loggers.py` ``` if (len(self.engine_indexes) > 1 and vllm_config.speculative_config is not None): raise NotImplementedError("Prometheus metrics with Spec Decoding " "with >1 EngineCore per AsyncLLM is not " "supported yet.") ``` To start an online server with torchair enabled, here is an example: ``` python -m vllm.entrypoints.openai.api_server \ --model="/weights/DeepSeek-R1_w8a8/" \ --trust-remote-code \ --max-model-len 40000 \ --tensor-parallel-size 4 \ --data_parallel_size 4 \ --max-num-seqs 16 \ --no-enable-prefix-caching \ --enable_expert_parallel \ --served-model-name deepseekr1 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --quantization ascend \ --host 0.0.0.0 \ --port 1234 \ --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \ --gpu_memory_utilization 0.9 ``` offline example with torchair enabled ``` from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=16, temperature=0) # Create an LLM. llm = LLM( model="/home/data/DeepSeek-R1_w8a8/", tensor_parallel_size=16, max_num_seqs=16, gpu_memory_utilization=0.9, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "deepseek_mtp", "num_speculative_tokens": 1, }, trust_remote_code=True, enforce_eager=False, max_model_len=2000, additional_config = { 'torchair_graph_config': { 'enabled': True, "graph_batch_sizes": [16], 'enable_multistream_shared_expert': False, }, "ascend_scheduler_config": { "enabled": True }, # 'expert_tensor_parallel_size': 16, } ) # Generate texts from the prompts. # llm.start_profile() outputs = llm.generate(prompts, sampling_params) # llm.stop_profile() for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.0 - vLLM main: `302962e806` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-08-06 19:37:43 +08:00
Wang Kunpeng	8a59367d0c	[main][Feature] Support deepseek w4a8 quantization (#2172 ) ### What this PR does / why we need it? Supports Deepseek-R1 w4a8 quantization. Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and `tests/ut/quantization/test_quantizer.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC` to test deepseek w4a8_dynamic quantized model #### 1.How to get weights using Modelslim ##### Installation steps Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24 git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### The required transformers environment transformers>=4.48.2 ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96) chapter Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format ##### Adapt to vllm-ascend Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it: `quant_model_description_w8a8_dynamic.json` rename to `quant_model_description.json`, and add `"group_size": 256` Modification in `config.json`：`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3`; `quantization_config` is removed; tips:The group_size and weights match. If the w4a8 weights are not generated using msmodelslim, you can check the group_size in quantization_config in config.json. #### 2.How to run w4a8 ##### a.How to run eager mode export VLLM_USE_V1=1 # v1 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --max-num-seqs 128 --enforce-eager ##### b.How to run graph mode export VLLM_USE_V1=1 # v1 export HCCL_BUFFSIZE=1024 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' eg: python -m vllm.entrypoints.openai.api_server --model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-08-06 10:17:44 +08:00
sherie	126cdfc92b	[Test] add rejection sampler ut (#2084 ) ### What this PR does / why we need it? add rejection sampler ut. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT passed - vLLM version: v0.10.0 - vLLM main: `586f286789` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-08-05 19:03:36 +08:00
Slightwind	f3b50c54e8	[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication (#2195 ) This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations. In the previous implementation, the `all2all` operation was performed on unquantized `hidden_states` (in FP16/BF16) before quantization, resulting in substantial communication overhead. By performing quantization on each EP rank first and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%. Additionally, this PR includes a minor optimization to cast `int` inputs to `float` for the `argsort` operation, forcing it to run on a faster NPU core instead of the AICPU. These changes lead to a clear and significant performance gain in MoE quantization scenarios. - vLLM version: v0.10.0 - vLLM main: `7175817637` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-08-05 18:47:13 +08:00
leo-pony	807f0895b2	Bump torch version to 2.7.1 (#1562 ) ### What this PR does / why we need it? Bump torch version to 2.7.1, and cleanup infer schema patch https://github.com/vllm-project/vllm-ascend/commit/857f489 (https://github.com/vllm-project/vllm-ascend/pull/837), this patch depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974 ### Does this PR introduce any user-facing change? No #### How was this patch tested? CI passed torch-npu 2.7.1rc1 install guide: https://gitee.com/ascend/pytorch/tree/v2.7.1/ install depending: ``` pip3 install pyyaml pip3 install setuptools ``` install torch-npu: Closes: https://github.com/vllm-project/vllm-ascend/issues/1866 Closes: https://github.com/vllm-project/vllm-ascend/issues/1390 - vLLM version: v0.10.0 - vLLM main: `9af654cc38` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-05 08:43:24 +08:00
wangxiyuan	36e450eb0f	[Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097 ) we recently added disaggregated_prefill and ascend_forward_context feature by `ba3dfbd59e` and `df0ec55162`. This PR fix some nit introduced by them to make the code clear. 1. drop `current_platform` usage. It'll lead unknown circular import error in some case 2. update `set_ascend_forward_context` function to make the logic clear. for example, remove V0 support in this function. 3. Remove useless `self.local_rank_across_dp` in worker 4. Remove `soc_info.py` to use `get_ascend_soc_version` instead. - vLLM version: v0.10.0 - vLLM main: `02f82fe438` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 08:39:02 +08:00
Li Wang	ad366bf908	[Bugfix] Follow vLLM Qwen-Moe/VL and KV Connector change to fix broken CI (#2181 ) ### What this PR does / why we need it? This pr fix broken CI: 1. Fix the `ee2eb6ecd8` changes, in this commit, they fused the gate and up projections in the vision MLP, This can improve performance by reducing one matrix multiplication. so, this pr do the following things: - Specify that the two linear layers are fused as `mlp.gate_up_proj` when loading the weights. - Use a SiluAndMul activation function. 2. Fix `aefeea0fde`, Update ModelRunnerOutput parameters to adapt to its changes 3. Fix [vllm-commit](https://github.com/vllm-project/vllm/pull/20815/files#diff-3ffb829a39ab2b3e4706aa28f5e476815f36c3a87b98d6a66514ebedc8f3ffb4R354-R356), fix qwen moe ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `fed5849d3f` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-04 21:37:50 +08:00
CaveNightingale	957c7f108d	[Bugfix][PD] Make multiple Ps and Ds work on a single machine (#2080 ) (cherry picked from commit 816375e0c1071d0696dfab1a1ce35674f9f37aa0) ### What this PR does / why we need it? Suppose that you want to start a prefiller instance with npus `2,3` only. So you start the instance with `ASCEND_RT_VISIBLE_DEVICES=2,3`. The current programming will start two workers, whose ranks are `0` and `1` respectedly. And they will pick the first and second ip addresses of npus in the ranktable instead of the thirdth and forth ones. But actually they are using card `2,3` and therefore they can not link with remote instances when they attempt to transfer the KVCache. Hence, at most 1 prefiller instance and at most 1 decoder instance can work on a single machine since they always pick the first npu ip address in the ranktable currently. This pull request is proposed to fix the problem. This fix pick ips of only those devices that are in `ASCEND_RT_VISIBLE_DEVICES` from the ranktable. ### Does this PR introduce _any_ user-facing change? If the user use ranktable generated by `gen_ranktable.sh`, they should not face any change. ### How was this patch tested? Qwen-0.6B 1P 1D, dp=2, `ASCEND_RT_VISIBLE_DEVICES=2,3` for prefiller and `ASCEND_RT_VISIBLE_DEVICES=4,5` for decoder. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: CaveNightingale <cavenightingale@foxmail.com>	2025-08-04 17:22:18 +08:00
YuanCheng-coder	ddaded1537	Add ut for envs.py (#2131 ) What this PR does / why we need it? test vllm_ascend/envs.py contains environment variables defination Does this PR introduce any user-facing change? N/A How was this patch tested? CI passed with new added test. vLLM version: v0.10.0 vLLM main: `9532a6d563` - vLLM version: v0.10.0 - vLLM main: `b4e081cb15` --------- Signed-off-by: chengyuan <chengyuan27@huawei.com> Co-authored-by: chengyuan <chengyuan27@huawei.com>	2025-08-02 16:53:44 +08:00
weijinqian0	6e00aed4d5	[main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088 ) It comes from 0.9.1dev [0.9.1][Feature]Moe alltoallv communication optimization for unquantized RL training sence & alltoallv support dpo (#1547) - vLLM version: v0.10.0 - vLLM main: `97608dc276` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: curryliu <120010041@link.cuhk.edu.cn> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com> Signed-off-by: taoxudonghaha <justsheldon@163.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: curryliu <99582471+Irving11-BKN@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: TaoYu Chen <ctynb@qq.com> Co-authored-by: taoxudonghaha <justsheldon@163.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-08-02 09:49:10 +08:00
22dimensions	8cf97d8310	[Misc] Add extra checking to torchair_graph_config. (#1939 ) ### What this PR does / why we need it? cherry-pick #1675 to main This PR adds validation checking to torchair_graph_config for better reliability. Co-authored-by: whx-sjtu <2952154980@qq.com> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:24:11 +08:00
huangxialu	9c9a7cd90b	[main] adapt usage of npu_moe_gating_top_k_softmax and remove envs.SELECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112 ) backport of v0.9.1-dev: https://github.com/vllm-project/vllm-ascend/pull/1902 origin main npu_moe_gating_top_k_softmax: https://github.com/vllm-project/vllm-ascend/pull/1355 - vLLM version: v0.10.0 - vLLM main: `055bd3978e` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-31 21:05:56 +08:00
Ronald1995	e8660d7978	ut:add ut for qwen2_5_vl (#2143 ) ### What this PR does / why we need it? add ut for qwen2_5_vl ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-31 20:46:17 +08:00
daniel	db310c6ec9	add ut for device allocator/camem and mutistream/layers (#2037 ) What this PR does / why we need it? test device allocator/camem and mutistream/layers contains resource allocation and stream ops Does this PR introduce any user-facing change? N/A How was this patch tested? CI passed with new added test. - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 1024daniel <xxltju324@gmail.com>	2025-07-31 19:17:27 +08:00
CaranLic	7c90ba5fe8	[Test] add ut for decorator.py/deepseek_mtp.py (#2127 ) ### What this PR does / why we need it? add ut for decorator.py/deepseek_mtp.py ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new tests - vLLM version: v0.10.0 - vLLM main: `055bd3978e` --------- Signed-off-by: CaranLic <740821011@qq.com>	2025-07-31 15:21:15 +08:00
Joey Gao	6192bc95c0	[Bugfix] fix tensor not same device in qwen2_5_vl_without_padding (#2051 ) bugfix cherry-pick from v0.9.1-dev https://github.com/vllm-project/vllm-ascend/pull/2007 ### What this PR does / why we need it? Minimum reproducing code： ```python # test.py from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="Qwen2.5-VL-7B-Instruct", max_model_len=26240) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```bash export USE_OPTIMIZED_MODEL=0 python test.py ``` exception as follow: ``` [rank0]: File "/home/xxx/vllm_ascend/models/qwen2_5_vl_without_padding.py", line 84, in forward [rank0]: q = torch_npu.npu_rotary_mul(q, cos, sin) [rank0]: File "/home/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__ [rank0]: return self._op(args, (kwargs or {})) [rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu! (when checking argument for argument r1 in method wrapper__npu_rotary_mul) ``` In `AscendQwen2_5_VisionAttention_Without_Padding`, `torch_npu.npu_rotary_mul(q, cos, sin)`， `cos`/`sin` on cpu, but `q` on npu, so there will be an error. `qwen2_5_vl_without_padding.py` need this bugfix, because `AscendQwen2_5_VisionTransformer_Without_Padding.rot_pos_emb` in wen2_5_vl_without_padding.py is from vllm and `inv_freq` will create on cpu. `40d86ee412/vllm/model_executor/models/qwen2_5_vl.py (L482)` ```python inv_freq = 1.0 / (theta(torch.arange(0, dim, 2, dtype=torch.float, device='cpu') / dim)) ``` `qwen2_5_vl.py` do not need, because `AscendQwen2_5_VisionRotaryEmbedding` in qwen2_5_vl.py rewrite `AscendQwen2_5_VisionRotaryEmbedding` and `inv_freq` will create on device. ```python inv_freq = 1.0 / (theta*(torch.arange(0, dim, 2, dtype=torch.float) / dim)) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: pjgao <gaopengju3@huawei.com> Co-authored-by: pjgao <gaopengju3@huawei.com>	2025-07-31 15:18:54 +08:00
Ronald1995	3386e09a40	ut:add ut for qwen2_vl.py (#2096 ) ### What this PR does / why we need it? add ut for qwen2_vl.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `555e7225bc` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-30 22:31:47 +08:00
Ruri	4fcca137a7	[main][Feature] Support Qwen3 W4A8 quantization (#2060 ) ### What this PR does / why we need it? Adding `W4A8_DYNAMIC` quantization support for linear. Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC` to test qwen3 w4a8_dynamic quantized model Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim` of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409` 1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim` ```shell git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409 bash install.sh ``` 2. Serve model using `vllm` ```shell VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \ --model vllm-ascend/Qwen3-8B-W4A8 \ --port 8000 \ --quantization ascend \ --tensor_parallel_size 2 \ --enforce-eager ``` - vLLM version: v0.10.0 - vLLM main: `4cd7fe6cea` --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>	2025-07-30 14:57:14 +08:00
YuanCheng-coder	34dd24adf2	add ut for vocab_parallel_embedding (#2067 ) ### What this PR does / why we need it? test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel embedding forward CI passed with new added test. vLLM version: v0.10.0 vLLM main: `2cc571199b` - vLLM version: v0.10.0 - vLLM main: `05cbbe20c5` Signed-off-by: chengyuan <chengyuan27@huawei.com> Co-authored-by: chengyuan <chengyuan27@huawei.com>	2025-07-30 14:35:45 +08:00
wangxiyuan	9b67c87b14	[Refactor]Refactor sampler (#2050 ) Refactor Sampler implementation from patch way to inherit from vLLM Sampler interface. Next step: Make the op `TopKTopPSampler` in vLLM support custom ops register mechanism - vLLM version: v0.10.0 - vLLM main: `61a6905ab0` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-30 08:47:22 +08:00
whx	98cadc2146	[Perf] Avoid performing index selection of sin/cos cache every layer (#1890 ) Optimize number of index selections of sin/cos cache. - vLLM version: v0.10.0 - vLLM main: `656c24f1b5` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-07-29 18:06:45 +08:00
wangxiyuan	0190b68f51	[Misc]Remove PD v0 code (#2047 ) Cleanup V0 disaggregated prefill code for V0 Engine. part of https://github.com/vllm-project/vllm-ascend/issues/1620 TODO: enable v1 e2e test. - vLLM version: v0.10.0 - vLLM main: `2cc571199b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-28 19:09:22 +08:00
huangxialu	1a25b0a2dd	[Test] add ut for qwen3_moe.py (#2055 ) ### What this PR does / why we need it? Add ut for qwen3_moe.py ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `18cc33dd60` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-07-28 17:37:13 +08:00
LeeWenquan	3ad582c9a9	[Test] Add ut for files in /attention (#1944 ) ### What this PR does / why we need it? Add ut for files in folder /attention ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `139a7f07bd` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-28 15:54:40 +08:00
Ronald1995	32a9c5f694	[Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926 ) ### What this PR does / why we need it? it'll execute allreduce and malmul seperately in vllm RowParallelLinear forward funcion, this function use torch_npu.npu_mm_all_reduce_base to execute allreduce and matmul in a fused kernel way. this will gain a 20% performance promotion in eager mode. ### Does this PR introduce _any_ user-facing change? this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to control whether enable the feature or not. ### How was this patch tested? the patch is tested by adding a new test file `test_patch_linear.py` to guard the ut - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-28 15:13:37 +08:00
zzzzwwjj	ba3dfbd59e	[main][refactor] Refactoring forward_context and model_runner_v1 (#1979 ) ### What this PR does / why we need it? A refactoring of forward_context and model_runner_v1, add some context which is necessary in model inference into forward_context, and refactor dummy_run logic, make it more reasonable. Some details for this PR: Add `ascend_forward_context`; Update mc2_v2 op, and support `active_mask` param; Update scripts in examples dir; refactor `dummy_run` logic; Add soc_version for A2 and A3; ### Does this PR introduce _any_ user-facing change? No change at user-facing. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `57c22e57f9` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-07-28 14:06:20 +08:00
zhangxinyuehfad	d1c640841b	[Bugfix] Fix num_hidden_layers when Qwen2-Audio 7B (#1803 ) ### What this PR does / why we need it? Fix num_hidden_layers when Qwen2-Audio 7B and #1760 ： ``` INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode Traceback (most recent call last): File "/workspace/test1.py", line 58, in <module> main(audio_count) File "/workspace/test1.py", line 38, in main llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct", File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__ self.llm_engine = LLMEngine.from_engine_args( File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args vllm_config = engine_args.create_engine_config(usage_context) File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config config = VllmConfig( File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__ s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__ current_platform.check_and_update_config(self) File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config update_aclgraph_sizes(vllm_config) File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__ return super().__getattribute__(key) AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers' ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/1780 https://github.com/vllm-project/vllm-ascend/issues/1760 https://github.com/vllm-project/vllm-ascend/issues/1276 https://github.com/vllm-project/vllm-ascend/issues/359 - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-07-26 20:13:00 +08:00
Pleaplusone	df0ec55162	Disaggregate prefill for kv cache register style (#950 ) ### What this PR does / why we need it? This PR adopt `LLMDataDist` for kv cache register and `pull_blocks` style disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 . This PR can be test with the following step: - Generate the rank table for all machine. - execute`toy_proxy.py` to launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, port - Run the prefill server and decode server. - send the request to the disaggregate prefill proxy ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Signed-off-by: liziyu179 <3475441767@qq.com> Signed-off-by: underfitc <hucong24@huawei.com> Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Co-authored-by: liziyu179 <3475441767@qq.com> Co-authored-by: underfitc <hucong24@huawei.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-07-26 17:15:47 +08:00
Yikun Jiang	17a430f7b8	Upgrade vLLM to v0.10.0 (#1927 ) ### What this PR does / why we need it? - Upgrade to v0.10.0 - Drop v0.9.2 version compatibility - Add patch for `vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py` as workaround of `f3a683b7c9` for v0.10.0 and also add e2e test `test_models_prompt_logprobs` - Pin transformers<4.54.0 as workaround of https://github.com/vllm-project/vllm-ascend/issues/2034 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Test locally: `VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs` - CI passed - vLLM version: v0.9.2 - vLLM main: `7728dd77bb` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-26 15:43:29 +08:00
Ronald1995	e561a2c6ec	ut:add ut for qwen2_5_vl_without_padding.py (#1988 ) ### What this PR does / why we need it? this pr is to add ut for qwen2_5_vl_without_padding.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? this is only a ut test - vLLM version: v0.9.2 - vLLM main: `9c8b2c2a8a` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-25 14:12:44 +08:00
SunnyLee151064	ae560f7131	[Test] Add uts for files in /core (#1957 ) ### What this PR does / why we need it? Add uts for files in folder /core ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `5a19a6c670` --------- Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-25 09:48:19 +08:00
SunnyLee151064	ab7d5aca5d	[Test] Add ut for files in /multistream (#1947 ) ### What this PR does / why we need it? Add some uts for files in folder /multistream ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `b77c7d327f` Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-24 10:42:49 +08:00
SunnyLee151064	34571ea5ae	[Test] Add ut for files in /distributed (#1951 ) ### What this PR does / why we need it? Add some ut for files in folder /distributed ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `107111a859` Signed-off-by: lwq <liwenquan5@huawei.com> Co-authored-by: lwq <liwenquan5@huawei.com>	2025-07-24 10:36:11 +08:00
Zac	2ffe051859	[Test]add ut for deepseek_v2. (#1964 ) What this PR does / why we need it? Add uts for deepseek_v2 Does this PR introduce any user-facing change? No How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `f3137cdd81` --------- Signed-off-by: 张帮政 <zhangbangzheng@huawei.com>	2025-07-24 10:27:50 +08:00
wangxiyuan	846555cdb5	[Misc] Clean up uesless code in attention (#1933 ) Before do attention module refactor, we can do some code cleanup to make the next step easier. What this PR does: 1. remove uesless `common_prefix_len` for attention builder 2. remove uesless `is_only_prefill` and `num_input_tokens` in attention metadata. 3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead, `CommonAttentionMetadata` is over designed and uesless 4. update the attention backend input parameters to keep the same as vLLM. 5. Rename attention name to the same style with `ASCEND` prefix - vLLM version: v0.9.2 - vLLM main: `107111a859` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-24 10:23:34 +08:00
shiyuan680	ac0bf133f4	add ut of fused_moe.py (#1930 ) ### What this PR does / why we need it? add unit test for fused_moe.py - vLLM version: v0.9.2 - vLLM main: `2dec7c1a5d` Signed-off-by: yangcheng <yangcheng104@huawei.com> Co-authored-by: yangcheng <yangcheng104@huawei.com>	2025-07-23 16:24:09 +08:00
weichen	ac773aca43	Add UT for Patches (#1766 ) ### What this PR does / why we need it? Add UT for patches in vLLM Ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Irrelevant - vLLM version: v0.9.2 - vLLM main: `107111a859` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-07-23 16:07:20 +08:00
Mengqing Cao	3aa3b46bfe	[V1][PP] Support pp with ray backend in V1 (#1800 ) ### What this PR does / why we need it? Support pipeline parallel with ray backend in V1Engine. Fixes #1751 ### Does this PR introduce _any_ user-facing change? Users could specify ray as distributed backend when inferencing with pp ### How was this patch tested? CI passed with new added test. - vLLM version: v0.9.2 - vLLM main: `32142b3c62` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-23 14:52:52 +08:00
JohnJan	ce4970eee0	[Test] Add unit test for schedule_config.py (#1590 ) What this PR does / why we need it? According to issue https://github.com/vllm-project/vllm-ascend/issues/1298 , this pull request adds unit test code for schedule_config.py. Does this PR introduce any user-facing change? No How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2`	2025-07-22 11:43:25 +08:00
Yikun Jiang	5f0b42e414	[FOLLOWUP] Use base test to avoid patch everwhere (#1634 ) ### What this PR does / why we need it? Use base test to avoid patch everwhere. Followup here: https://github.com/vllm-project/vllm-ascend/pull/1566 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut ci passed - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-22 09:03:40 +08:00
wangxiyuan	7265dc090d	[2/4][Refactor] Refactor torchair utils (#1892 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker 2. Refactor utils (this PR) - simple change that move all torchair related util function to torchair module 3. Refactor model_runner 4. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 19:43:30 +08:00
wangxiyuan	af56ae3ed1	[1/4][Refactor] Refactor torchair worker (#1885 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker (this PR) - create torchair module and move torchair related code in worker to the new module 3. Refactor utils 4. Refactor model_runner 5. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 11:50:46 +08:00
Mengqing Cao	8cfd257992	[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681 ) ### What this PR does / why we need it? Remove ETP/EP maintained in branch main. We drop this as there is no relevant scenarios to use ETP now, and we may subsequently advocate implementing expert tensor parallelism in vLLM to support scenarios where the expert is needed to be sliced This is a part of #1422 backport. Fixes https://github.com/vllm-project/vllm-ascend/issues/1396 https://github.com/vllm-project/vllm-ascend/issues/1154 ### Does this PR introduce _any_ user-facing change? We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in vllm instead. ### How was this patch tested? CI passed with new added and existing test. - vLLM version: v0.9.2 - vLLM main: `fe8a2c544a` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-21 09:08:04 +08:00
wangxiyuan	a8b316ac5b	[CI] Make AttentionBackend interface compatible to fix broken CI (#1893 ) vLLM commit `752c6ade2e` removed `blocksparse_params` for attention backend. This PR does the same change to make CI happy. - vLLM version: v0.9.2 - vLLM main: `9499e26e2a` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-21 08:21:06 +08:00
lianyibo	53d2ea3789	[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 (#1811 ) ### What this PR does / why we need it? maybe fixes [#1728](https://github.com/vllm-project/vllm-ascend/issues/1728#issuecomment-3065083433) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test Qwen3-32B tp=4 with: ```bash vllm serve --port 1234 Qwen/Qwen3-32B \ --served-model-name Qwen3-32B \ --tensor-parallel-size 4 \ --swap-space 16 \ --max-model-len 6000 \ --load-format dummy \ --disable-log-stats \ --disable-log-requests \ ``` Request batch_size=128 input/output token=1024 In 0.9.2rc1 ```text ===================================================== Total TPS with prefill(tokens/s) : 785.1395 Total TPS without prefill : 846.6809 Mean TPS with prefill : 6.1339 Mean TPS without prefill : 6.6147 ===================================================== Mean TTFT(ms) : 10307.8123 Max TTFT(ms) : 21423.0733 Min TTFT(ms) : 362.3602 ===================================================== Mean TPOT(ms) : 151.3051 Max TPOT(ms) : 159.4649 Min TPOT(ms) : 140.899 ===================================================== Total Time(s) : 175.6032 Request Throughput(requests/s) : 0.7289 ===================================================== ``` Apply this PR ```text ===================================================== Total TPS with prefill(tokens/s) : 811.0014 Total TPS without prefill : 876.4423 Mean TPS with prefill : 6.3359 Mean TPS without prefill : 6.8472 ===================================================== Mean TTFT(ms) : 10263.8382 Max TTFT(ms) : 21151.2547 Min TTFT(ms) : 375.9136 ===================================================== Mean TPOT(ms) : 146.1686 Max TPOT(ms) : 154.0957 Min TPOT(ms) : 136.8879 ===================================================== Total Time(s) : 169.8579 Request Throughput(requests/s) : 0.7536 ===================================================== ``` The TPOT performance gap between these two sets of data is about 3%. - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` Signed-off-by: lianyibo <lianyibo1@kunlunit.com>	2025-07-18 23:09:54 +08:00
Mengqing Cao	574fe407eb	[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841 ) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-18 23:07:14 +08:00
xudongLi-cmss	33ef5dc813	add unit test for func wrapper (#1863 ) ### What this PR does / why we need it? test func wrapper file ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added test. - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>	2025-07-18 11:05:17 +08:00
Shanshan Shen	f96100fad5	[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805 ) ### What this PR does / why we need it? Remove V0 related codes of test, example, platform. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `235bfd5dfe` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-15 19:58:55 +08:00

1 2

73 Commits