xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	3b59d0ebe9	[Doc][Feature] Add vLLM Ascend development guidelines AGETNS.md (#6797 ) ### What this PR does / why we need it? This PR adds a new document, `AGENTS.md`, which provides detailed development guidelines for contributors to the vLLM Ascend project. These guidelines cover code style, testing, NPU-specific considerations, and the contribution process to ensure code quality and consistency. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update for developers. ### How was this patch tested? This is a documentation change and does not require testing. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-26 08:47:46 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
Li Wang	e35f304419	[CI] Auto partition for test cases (#6379 ) ### What this PR does / why we need it? This patch add auto-partition feat for tests, for example, before this pr, we are running e2e single card test for 2h40min, after the auto partition, test case is automatically allocated into the required n parts based on its test duration (greedy strategy) and run in parallel. The advantage of doing this is that our overall test duration will become 1/n of the original. ### Does this PR introduce _any_ user-facing change? Before: e2e single card test spend 2h40min After: e2e single card test spend 1h13min ### How was this patch tested? ```shell python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 0 args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=0, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600) +----------------+--------------------+ \| Suite \| Partition \| \|----------------+--------------------\| \| e2e-singlecard \| 1/2 (0-based id=0) \| +----------------+--------------------+ ✅ Enabled 13 test(s) (est total 4020.0s): - tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py (est_time=1800) - tests/e2e/singlecard/test_aclgraph_accuracy.py (est_time=480) - tests/e2e/singlecard/test_guided_decoding.py (est_time=354) - tests/e2e/singlecard/test_batch_invariant.py (est_time=320) - tests/e2e/singlecard/pooling/test_embedding.py (est_time=270) - tests/e2e/singlecard/test_quantization.py (est_time=200) - tests/e2e/singlecard/test_llama32_lora.py (est_time=162) - tests/e2e/singlecard/test_cpu_offloading.py (est_time=132) - tests/e2e/singlecard/pooling/test_classification.py (est_time=120) - tests/e2e/singlecard/test_camem.py (est_time=77) - tests/e2e/singlecard/compile/test_norm_quant_fusion.py (est_time=70) - tests/e2e/singlecard/test_auto_fit_max_mode_len.py (est_time=25) - tests/e2e/singlecard/test_profile_execute_duration.py (est_time=10) (base) wangli@Mac-mini vllm-ascend % python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 1 args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=1, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600) +----------------+--------------------+ \| Suite \| Partition \| \|----------------+--------------------\| \| e2e-singlecard \| 2/2 (0-based id=1) \| +----------------+--------------------+ ✅ Enabled 13 test(s) (est total 4025.0s): - tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py (est_time=1500) - tests/e2e/singlecard/pooling/test_scoring.py (est_time=500) - tests/e2e/singlecard/test_aclgraph_batch_invariant.py (est_time=410) - tests/e2e/singlecard/test_vlm.py (est_time=354) - tests/e2e/singlecard/test_models.py (est_time=300) - tests/e2e/singlecard/test_multistream_overlap_shared_expert.py (est_time=200) - tests/e2e/singlecard/test_sampler.py (est_time=200) - tests/e2e/singlecard/test_async_scheduling.py (est_time=150) - tests/e2e/singlecard/test_aclgraph_mem.py (est_time=130) - tests/e2e/singlecard/test_ilama_lora.py (est_time=95) - tests/e2e/singlecard/test_completion_with_prompt_embeds.py (est_time=76) - tests/e2e/singlecard/test_qwen3_multi_loras.py (est_time=65) - tests/e2e/singlecard/test_xlite.py (est_time=45) ``` - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-29 20:28:10 +08:00
Li Wang	c26ad78f86	[CI][lint] Add rule `codespell` back (#6236 ) ### What this PR does / why we need it? After removing codepsell a while, we discovered that typo had a problem correctly recognizing certain misspelled words, so I suggested adding it back. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 14:12:33 +08:00
Li Wang	484e7c59dc	[CI] optimize lint term (#5986 ) ### What this PR does / why we need it? This patch purpose to optimize the lint check term. The main idea is to reduce unnecessary installation time. 1. The installation of vllm is not must, only append the path of vllm src to the `PATHONPATH` is effective 2. This installation of `requirements-dev.txt` is not must, we have a pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the requirements installed in advance. NOTE: the conditions for triggering image builds are: 1).Daily scheduled build; 2) Build when requirements are modified; 3) Manual build. This ensures that the dependencies in our image are up-to-date to the greatest extent possible. 3. The `mypy` was separated from the `pre-commit` hook for performance reasons; we found that integrating `mypy` into the `pre-commit` hook resulted in poor performance. 4. Reduce the CPU core consumption from 16 -> 8 ### Does this PR introduce _any_ user-facing change? The end-to-end lint time was optimized from 20min/per PR to 8min/per PR ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 15:46:59 +08:00
SILONG ZENG	4811ba62e0	[Lint]Style: reformat markdown files via markdownlint (#5884 ) ### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-15 09:06:01 +08:00
SILONG ZENG	78d5ce3e01	[Lint]Style: Convert `example` to `ruff format` (#5863 ) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-13 20:46:50 +08:00
SILONG ZENG	523e83016b	[Lint]Style: Convert `root`, `benchmarks`, `tools` and `docs` to `ruff format` (#5843 ) ### What this PR does / why we need it? Description This PR fixes linting issues in the root directory, benchmarks/, tools/ and docs/ to align with the project's Ruff configuration. This is part of a gradual effort to enable full linting coverage across the repository. The corresponding paths have been removed from the exclude list in pyproject.toml. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-13 15:29:34 +08:00
Mengqing Cao	8ed6f98a5a	[Build] Add installation script of fused_infer_attention_score kernel with flash decoding (#5402 ) ### What this PR does / why we need it? Add installation script of `fused_infer_attention_score` kernel with flash decoding ### Userface changes Users can install the kernel `fused_infer_attention_score` with flash decoding feature by `bash tools/install_flash_infer_attention_score_ops_a2.sh` or `bash tools/install_flash_infer_attention_score_ops_a3.sh` - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-27 02:01:06 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
Wang Yixuan	c68ddc11ce	[OPS] add bmm_transpose ops (#3990 ) ### What this PR does / why we need it? Add a new fusion ops to custom_op, which can cobime the torch.bmm() and transpsose to achieve better peformance. This ops is used in mla_v1 to replace the bmm and transpose ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.2 --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-01 09:09:51 +08:00
Chenxi Qian	554f16ae1f	[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 ) ### What this PR does / why we need it? This PR introduces support for adding custom CANN `aclnn` ops to `vllm-ascend`, allowing users to define and use their own custom operators. Key changes include: - Building and installing custom ops into the `vllm-ascend`-specified directory - Binding the `aclnn` op interface to the `torch.ops._C_ascend` module - Enabling invocation of these ops within `vllm-ascend` This PR includes a sample custom op: `aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from the CANN operator [`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md). Its input parameters `weight` and `weight_scale` now accept `list[torch.Tensor]` (i.e., `at::TensorList`). ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.11.2 --------- Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>	2025-11-28 18:06:39 +08:00
Nengjun Ma	f7d1f73b98	[CI] Remove unsupported python 3.9 format check (#4172 ) ### What this PR does / why we need it? - Fixes the lint test fail for python 3.9, as python 3.9 is not support now. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-11-13 16:47:24 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
Chen Chen	bcc313e8f2	add mla_preprocess kernel (#3226 ) ### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-10-12 07:39:45 +08:00
Li Wang	bdfb065b5d	[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011 ) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `29c6fbe58c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-25 22:16:10 +08:00
Li Wang	c7446438a9	[1/N][CI] Move linting system to pre-commits hooks (#1256 ) ### What this PR does / why we need it? Follow vllm-project/vllm lint way: https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml Enable pre-commit to avoid some low level error AMAP. This pr is one step of #1241, The purpose is make linting system more clear and convenient, on this step, Mainly did the following things: yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit, enforce-import-regex-instead-of-re. TODO: - clang-format(check for csrc with google style) need clean code, disable for now - pymarkdown need clean code, disable for now - shellcheck need clean code, disable for now ### Does this PR introduce _any_ user-facing change? Only developer UX change: https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally ``` pip install -r requirements-lint.txt && pre-commit install bash format.sh ``` ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) - vLLM version: v0.9.1 - vLLM main: `5358cce5ff` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-10 14:17:15 +08:00

17 Commits