xc-llm-ascend

Author	SHA1	Message	Date
zhangxinyuehfad	fae1c59a79	[Fix] Refactor and fix dist test to e2e full test (#3808 ) ### What this PR does / why we need it? Fix ci test on A3 1. delete lables 2. fix filter yaml file name 3. refactor dist test to e2e full test 4. skip test_models_distributed_Qwen3_MOE_TP2_WITH_EP & test_models_distributed_Qwen3_MOE_W8A8_WITH_EP because of https://github.com/vllm-project/vllm-ascend/issues/3895 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-11 10:36:05 +08:00
zhaomingyu13	7ffbe73d54	[main][Bugfix] Fix ngram precision issue and open e2e ngram test (#4090 ) ### What this PR does / why we need it? Fix ngram precision issue and open e2e ngram test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: Icey <1790571317@qq.com>	2025-11-11 09:06:24 +08:00
Levi	0a62e671fb	[Feat] flashcomm_v2 optim solution (#3232 ) ### What this PR does / why we need it? Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in the Attention module with pre-AlltoAll and post-AllGather operations (used in combination with FlashComm1). This feature is enabled during the Prefill phase and is recommended to be used together with FlashComm1, delivering broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP) configurations. Benchmark tests show that under TP16DP1 configuration, it can improve the prefill performance of the DeepSeek model by 8% on top of FlashComm1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: zzhxx <2783294813@qq.com> Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zzhxx <2783294813@qq.com>	2025-11-10 11:01:45 +08:00
wangx700	24d6314718	[Bugfix] fix sleepmode level2 e2e test (#4019 ) ### What this PR does / why we need it? enable sleepmode level2 e2e test and add the check logic to ensure the nz is not enabled. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? use e2e tests - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangx700 <wangxin700@huawei.com>	2025-11-08 14:11:55 +08:00
CodeCat	49d74785c4	[Test] Add new e2e test use deepseek-v2-lite in ge graph mode (#3937 ) ### What this PR does / why we need it? The current test cases lack end-to-end (e2e) testing for the deepseek-v2-lite network in ge graph mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>	2025-11-03 20:10:01 +08:00
Canlin Guo	f99762eb25	[E2E][MM] Add e2e tests for InternVL model (#3796 ) ### What this PR does / why we need it? As a validation for #3664, add end-to-end tests to monitor the InternVL model and ensure its continuous proper operation. This PR is only for single-card. So the models that have more parameters than 8B like 78B are needed to test using multi-cards. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? `pytest -sv tests/e2e/singlecard/multi-modal/test_internvl.py` - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-10-31 15:42:47 +08:00
lilinsiman	35a913cf1e	add new e2e tests case for aclgraph memory (#3879 ) ### What this PR does / why we need it? add new e2e tests case for aclgraph memory ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 09:16:52 +08:00
Song Zhixin	216fc0e8e4	[feature] Prompt Embeddings Support for v1 Engine (#3026 ) ### What this PR does / why we need it? this PR based on [19746](https://github.com/vllm-project/vllm/issues/19746), support Prompt Embeddings for v1 engine on NPU ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```python python examples/prompt_embed_inference.py ``` - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: jesse <szxfml@gmail.com>	2025-10-30 17:15:57 +08:00
Icey	a7450db1bd	Upgrade to 0.11.1 newest vllm commit (#3762 ) ### What this PR does / why we need it? `c9461e05a4` Fix ```spec decode rejection sampler```, caused by https://github.com/vllm-project/vllm/pull/26060 Fix some ```import```, caused by https://github.com/vllm-project/vllm/pull/27374 Fix ```scheduler_config.send_delta_data```, caused by https://github.com/vllm-project/vllm-ascend/pull/3719 Fix ```init_with_cudagraph_sizes```, caused by https://github.com/vllm-project/vllm/pull/26016 Fix ```vl model```of replacing PatchEmbed's conv3d to linear layer, caused by https://github.com/vllm-project/vllm/pull/27418 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-10-28 14:55:03 +08:00
wangxiyuan	07c8d4547c	[CI] Skip ops test for e2e (#3665 ) ### What this PR does / why we need it? Skip ops test for e2e and will move it to nightly test in the following pr - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-25 09:37:30 +08:00
offline893	9b0baa1182	[BugFix] Check all expert maps when using muilty instance. (#3576 ) ### What this PR does / why we need it? Check all expert maps when using muilty instance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Qwen 235B in double A3. case1：master has expert map, slave has not expert map. case2: master has expert map, slave has error expert map. case3: master has expert map,slave has correct expert map. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-24 17:10:14 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
Anion	5f8b1699ae	[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers (#3311 ) ### What this PR does / why we need it? Problem Description: The existing implementation for the w4a8-dynamic linear method only supports the old quantization format from msmodelslim. When attempting to load models quantized with the new version, vLLM encounters errors due to mismatched tensor shapes and unprocessed quantization parameters. Relavant issues: - https://github.com/vllm-project/vllm-ascend/issues/3192 - https://github.com/vllm-project/vllm-ascend/issues/3152 Proposed Changes: 1. Add support for w4a8 dynamic(new format) in AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod 2. Add unit tests and e2e tests for w4a8 dynamic new and old format models <details> <summary><b>details</b></summary> 1. Support for new w4a8-dynamic format: * Detects quantization format by reading the "version" field in quant_description to ensure backward compatibility. * Handles the new pre-packed weight format (`2x int4` in an `int8`), which has a halved dimension. It tells the vLLM loader how to unpack it using `_packed_dim` and `_packed_factor`. * Supports the new `scale_bias` parameter, setting its shape based on the layer type, as required by msmodelslim. For api consistency and future use, the `layer_type` parameter was also added to other quantization methods. * Updates the weight processing logic: new format weights are handled with `.view(torch.int32)` since they're pre-packed, while old ones are processed with `npu_convert_weight_to_int4pack`. 2. New unit and E2E tests: * Added unit tests that verify the logic for both the old and new formats. * Split the distributed E2E test to confirm that both old and new format models work correctly. </details> Theoretically, these changes will provide support for all common new version w4a8(dynamic) models from msmodelslim. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? I implement relevant unit tests and e2e tests and test the changes with following commands: ```bash # unit tests python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v # e2e tests pytest tests/e2e/singlecard/test_quantization.py -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s ``` I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic format: ``` vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384 ``` All tests mentioned passed locally. NOTE: I use quantization model from my own repo in test_offline_inference_distributed.py. Here is the description: [Anionex/Qwen3-1.7B-W4A8-V1](https://modelscope.cn/models/Anionex/Qwen3-1.7B-W4A8-V1/summary) (including quantization steps).This should be replaced by a model in vllm-ascend ci modelscope repo. Thanks for reading! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Anionex <1005128408@qq.com>	2025-10-21 20:18:39 +08:00
lilinsiman	70bef33f13	add new accuracy test case for aclgraph (#3390 ) ### What this PR does / why we need it? Add new accuracy test case Deepseek-V2-Lite-W8A8 for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-20 20:04:04 +08:00
xuyexiong	02c26dcfc7	[Feat] Supports Aclgraph for bge-m3 (#3171 ) ### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_embedding.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangyongjun <1104133197@qq.com>	2025-10-14 23:07:45 +08:00
leo-pony	0d59a3c317	[CI] Make the test_pipeline_parallel run normally in full test (#3391 ) ### What this PR does / why we need it? Make the test_pipeline_parallel take effect in full test of CI. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-10-12 15:43:13 +08:00
huangxialu	e8c871ed0a	[Test] enable external launcher and add e2e test for sleep mode in level2 (#3344 ) ### What this PR does / why we need it? 1. Enable tests/e2e/multicard/test_external_launcher.py 2. Add e2e test for sleep mode in level2 ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: huangxialu <huangxialu1@huawei.com> Co-authored-by: Shangwei-Li <lishangwei2@huawei.com>	2025-10-11 17:29:38 +08:00
wangxiyuan	ba19dd3183	Revert PTA upgrade PR (#3352 ) we notice that torch npu 0919 doesn't work. This PR revert related change which rely on 0919 version. Revert PR: #3295 #3205 #3102 Related: #3353 - vLLM version: v0.11.0	2025-10-10 14:09:53 +08:00
XiaoxinWang	579b7e5f21	add pagedattention to support FULL_DECODE_ONLY. (#3102 ) ### What this PR does / why we need it? Calculate in advance the workspace memory size needed for the PagedAttention operator to avoid deadlocks during resource cleanup. This PR requires torch_npu version 0920 or newer. ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-10 08:50:33 +08:00
wangxiyuan	c73dd8fecb	[CI] Fix CI by addressing max_split_size_mb config (#3258 ) ### What this PR does / why we need it? Fix CI by addressing max_split_size_mb config ### Does this PR introduce _any_ user-facing change? No, test onyl ### How was this patch tested? Full CI passed, espcially eagle one - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-29 14:05:12 +08:00
yupeng	9caf6fbaf5	[Bugfix][LoRA] Fix LoRA bug after supporting Qwen3-Next (#3044 ) ### What this PR does / why we need it? LoRA e2e test uses ilama-3.2-1B model. It uses transformers.py model files. Its self-attention layer names end with "\.attn", not "\.self_attn". There are some other model attention layer names end with "*.attn", such as baichuan.py, bert.py. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.2 - vLLM main: `17b4c6685c` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-09-26 11:12:45 +08:00
Yikun Jiang	693f547ccf	Refactor ci to reuse base workflow and re-enable ut coverage (#3064 ) ### What this PR does / why we need it? 1. Refactor ci to reuse base workflow and enable main 2 hours trigger job: - Extract e2e test in to _e2e_test.yaml - Reuse _e2e_test in light / full job - Enable main 2 hours trigger job 2. Rename e2e test to ascend test to make sure action display label 3. Re-enable ut coverage which was failed since `5bcb4c1528` and disable on `6d8bc38c7b` ### Does this PR introduce _any_ user-facing change? Only developer behavior changes: - Every job trigger full test with vllm release and hash - Run full job per 2 hours with vllm main - e2e light test (30 mins): `lint` (6mins) ---> ut (10mins) ---> `v0.10.2 + main / 4 jobs` (15mins) - e2e full test (1.5h): `ready label` ---> `v0.10.2 + main / 4 jobs`, about 1.5h - schedule test: 2hours ---> `v0.10.2 + main / 4 jobs`, about 1.5h ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-21 13:27:08 +08:00

22 Commits