xc-llm-ascend

Author	SHA1	Message	Date
whx	16c879cdf7	[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 ) ### What this PR does / why we need it? Add muls_add triton kernel with related fusion pass. What's more, this PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-02 17:54:25 +08:00
iiiklw	a0315f6697	[npugraph_ex]enable npugraph_ex by default (#6664 ) ### What this PR does / why we need it? This pull request enables the `npugraph_ex` backend by default to improve performance on Ascend NPUs, as proposed in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/6214). ### Does this PR introduce _any_ user-facing change? Yes. `npugraph_ex` is now enabled by default. Users can disable it by setting `enable: false` in the `npugraph_ex_config` section of the `additional_config`. ### How was this patch tested? CI passed. The changes are covered by existing and new E2E tests (`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`) that have been updated to reflect the new default behavior. The tests verify correctness and consistency with `npugraph_ex` enabled and disabled, as well as with the new static kernel option. Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com> Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>	2026-02-12 08:44:06 +08:00
ChenCangtao	6c30f8bf87	[Feature]refactor the npugraph_ex config, support online-infer with static kernel (#5775 ) ### What this PR does / why we need it? This is a part of https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762 1. refactor the npugraph_ex config，modified the default configuration of the static kernel, new default value of static kernel is false 2. support online-infer with static kernel 3. fixed the issue where manually modifying FX graphs caused an abnormal model return type, and removed the related redundant code. ### Does this PR introduce _any_ user-facing change? yes，the new config of npugraph_ex is as follow: ``` additional_config={ "npugraph_ex_config": { "enable": True, "enable_static_kernel": False } } ``` ### How was this patch tested? ``` vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \ --host 0.0.0.0 \ --port 8004 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --quantization ascend \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --max-num-seqs 48 \ --max-model-len 40000 \ --async-scheduling \ --max-num-batched-tokens 9000 \ --trust-remote-code \ --no-enable-prefix-caching \ --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \ --gpu-memory-utilization 0.9 \ --compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --additional-config \ '{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}' ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-20 21:31:38 +08:00
zhangxinyuehfad	a5b099c73d	[Bugfix] Reset incompatible config (#6005 ) ### What this PR does / why we need it? This PR introduces compatibility fixes for running vLLM on Ascend NPU hardware. The changes ensure that GPU-specific parameters are automatically detected and reset to Ascend-compatible values with appropriate warnings logged. \| Module \| Parameter \| Default Value \| \|--------\|-----------\|---------------\| \| Model Config \| `disable_cascade_attn` \| `False` \| \| Parallel Config \| `all2all_backend` \| `"allgather_reducescatter"` \| \| Cache Config \| `cpu_kvcache_space_bytes` \| `None` \| \| MultiModal Config \| `mm_encoder_attn_backend` \| `None` \| \| Observability Config \| `enable_layerwise_nvtx_tracing` \| `False` \| \| Scheduler Config \| `max_num_partial_prefills` \| `1` \| \| Speculative Config \| `quantization` \| `None` \| \| KV Transfer Config \| `kv_buffer_size` \| `1e9` \| \| KV Transfer Config \| `enable_permute_local_kv` \| `False` \| \| Attention Config \| `use_prefill_decode_attention` \| `False` \| \| Attention Config \| `use_cudnn_prefill` \| `False` \| \| Attention Config \| `use_trtllm_ragged_deepseek_prefill` \| `False` \| \| Attention Config \| `use_trtllm_attention` \| `False` \| \| Attention Config \| `disable_flashinfer_prefill` \| `False` \| \| Attention Config \| `disable_flashinfer_q_quantization` \| `False` \| \| Attention Config \| `flash_attn_version` \| `None` \| \| Attention Config \| `backend` \| `None` \| \| Attention Config \| `flash_attn_max_num_splits_for_cuda_graph` \| `32` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 11:02:38 +08:00
aipaes	f58e110afe	【feat】switch for fusion ops gmmswigluquant (#5992 ) ### What this PR does / why we need it? Set a additional config parameter to control whether the gmmswigluequant fuseion operator is enabled; it is enabled by True. / When enabled with a small number of GPUs, the gmmswigluquant fused operator can cause some performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` #### Perf test model: GLM 4.6(w8a8) - single A3 node(ep16, tp16), async-scheduling, mtp, FULL_DECODE_ONLY - bs=1, input_lens=32000, ouput_lens=1024 Without this PR: TPOT 32.22.ms With this PR: TPOT 30.23ms --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-01-19 13:19:25 +00:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
panchao-hub	8069442b41	enable npugraph_ex (#5120 ) ### What this PR does / why we need it? We will expose the enabling switch for npugraph_ex to better facilitate subsequent optimization. ### Does this PR introduce _any_ user-facing change? Previously, the enable_npugraph_ex switch would trigger an error; now we have removed the error reporting mechanism to better facilitate subsequent optimization efforts. Basic functionalities are available in CANN and torch_npu for Q3, while advanced optimizations will depend on the Q4 release. ### How was this patch tested? llm =LLM( model=model, enforce_eager=False , additional_config={ "enable_npugraph_ex": True }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16], }, } - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-18 09:08:40 +08:00
Icey	18221c0e1d	[Fusion] normalize fusion naming and enable e2e test (#4693 ) ### What this PR does / why we need it? This PR standardizes the fusion naming, changing `enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e testing. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-11 17:53:43 +08:00
ChenCangtao	dd622aa6a6	[Feature] Support npuhraph_ex backend (#4700 ) ### What this PR does / why we need it? We introduced the npugraph_ex backend through the vllm's adaptor dispatch mechanism to accelerate aclgraph. This solution is based on torch.compile and uses torchair to optimize the fx.graph. The performance gains are mainly obtained from the static kernel. We conducted tests on Qwen3-30B and achieved over 5% performance optimization. ### Does this PR introduce _any_ user-facing change? Yes, we add a new switch named"enable_npugraph_ex" in additional_config, default is False. We also add an example to show how to register custom replacement pass ### More information about this PR This feature depends on the release of CANN and torch_npu in Q4. We tested it on a package that has not been publicly released yet and verified that the functionality works. This feature is still experimental at the moment; setting the config true will directly raise error. Merging into the main branch initially involves some preliminary commits to facilitate subsequent development and testing of the feature, as well as to avoid submitting an excessively large PR at once. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com> Co-authored-by: panchao-hub <315134829@qq.com> Co-authored-by: wbigat <wbigat@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 20:48:05 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
Icey	178ca1607e	Adopt inductor fusion and define quantization fusion pass (#4168 ) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC https://github.com/vllm-project/vllm-ascend/issues/4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-04 10:29:48 +08:00
wangxiyuan	400af665e6	[CI] Drop ascend scheduler from test (#4613 ) Drop ascend scheduler from test - vLLM version: v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-02 13:18:17 +08:00
Mengqing Cao	517fd9272d	Revert "drop ascend scheduler" (#4580 ) Reverts vllm-project/vllm-ascend#4498 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-11-29 22:20:48 +08:00
wangxiyuan	f10acddb78	drop ascend scheduler (#4498 ) Ascend scheduler was added for non chunk prefill case before, since that the npu ops didn't work well with chunked prefill. Now the ops with chunked prefill work better, it's time to remove the ascend scheduler to use vLLM default scheduler. - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 16:18:34 +08:00
zzhxxx	46a41b26d3	oproj TP support acl graph (#4073 ) ### What this PR does / why we need it? Reference #2167 and orpoj TP supports ACL graph. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2025-11-11 19:39:06 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
1Fire4	1f6465c399	Add an option of enable frozen parameter (#2869 ) ### What this PR does / why we need it? Add an option of enable frozen parameter ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `68dbde5dbb` Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>	2025-09-17 12:00:44 +08:00
yiz-liu	88ca8a051c	[Feat][Graph] Support DeepSeek with ACL Graph (#2707 ) ### What this PR does / why we need it? In memory of #677 , a long overdue milestone. Now DeepSeek V3/R1 should be OK with ACL Graph. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Working on it. - vLLM version: v0.10.2 - vLLM main: `68dbde5dbb` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-16 17:50:17 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
linfeng-yuan	90a75a90a9	[bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532 ) ### What this PR does / why we need it? This PR ports #2312 #2506 #2531 to main branch. Original implementation of torchair caching forces users to make everything prepared, fix all the configuration and enable `use_cached_npu_graph`, and it might cause some problems confusing to understand and tackle for users. It is better to compile the graph twice instead of reusing the old kvcaches and cached torchair graph. And the extra duration time is acceptable. Additionally, this pr fixes a recompilation problem of torchair graph mode caused by `running_in_graph` variable in `AscendMLATorchairImpl`. ### Does this PR introduce _any_ user-facing change? If users want to enabling torchair.cache_compile with high compilation speed, it is recommended to enable both `use_cached_kv_cache_bytes` and `use_cached_graph` in `torchair_graph_config`. Without `use_cached_kv_cache_bytes`, we'll compile torchair computation graph twice to avoid runtime error caused by configuration mismtaches (the second compilation will be much faster). Additionally, we've made a change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to enhance safety and prevent accidental file deletion by adding a suffix directory. ### How was this patch tested? CI and e2e vllm serving pass. - vLLM version: v0.10.1.1 - vLLM main: `70549c1245` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-03 17:56:12 +08:00
panchao-hub	ea53f9076e	support torchair mode (#2641 ) ### What this PR does / why we need it? support torchair mode ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `5438967fbc` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-09-01 15:49:07 +08:00
lidenghui1110	600b08f754	[Feat]: Add custom lmhead tensor model parallel (#2309 ) ### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| lmhead_tensor_parallel_size \| Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces \| No \| int \| default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. \| example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `de533ab2a1` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>	2025-08-29 11:41:21 +08:00
Nicholas Tao	7bec1a9b9c	qwen3_moe/qwen25 support torchair graph (#2403 ) ### What this PR does / why we need it? Added support for the TorchAir graph mode in qwen3_moe and qwen2.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=True, max_model_len=4096, max_num_seqs=16, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.4, additional_config={ "torchair_graph_config": { "enabled": True, "use_cached_graph": False, "graph_batch_sizes_init": False, "graph_batch_sizes": [16] }, "ascend_scheduler_config": { "enabled": True, "chunked_prefill_enabled":True, }, "refresh": True, }, ) ``` - vLLM version: v0.10.0 - vLLM main: `b87cb97a53` Signed-off-by: taoyuxiang <oui.nicholas.tao@gmail.com>	2025-08-20 11:23:50 +08:00
22dimensions	8cf97d8310	[Misc] Add extra checking to torchair_graph_config. (#1939 ) ### What this PR does / why we need it? cherry-pick #1675 to main This PR adds validation checking to torchair_graph_config for better reliability. Co-authored-by: whx-sjtu <2952154980@qq.com> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:24:11 +08:00
Mengqing Cao	8cfd257992	[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681 ) ### What this PR does / why we need it? Remove ETP/EP maintained in branch main. We drop this as there is no relevant scenarios to use ETP now, and we may subsequently advocate implementing expert tensor parallelism in vLLM to support scenarios where the expert is needed to be sliced This is a part of #1422 backport. Fixes https://github.com/vllm-project/vllm-ascend/issues/1396 https://github.com/vllm-project/vllm-ascend/issues/1154 ### Does this PR introduce _any_ user-facing change? We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in vllm instead. ### How was this patch tested? CI passed with new added and existing test. - vLLM version: v0.9.2 - vLLM main: `fe8a2c544a` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-21 09:08:04 +08:00
wangxiyuan	787010a637	[Test] Remove VLLM_USE_V1 in example and tests (#1733 ) V1 is enabled by default, no need to set it by hand now. This PR remove the useless setting in example and tests - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 12:49:57 +08:00
wangxiyuan	7bdada58eb	[Misc] Remove VLLM_USE_V1 usage in code (#1764 ) We plan to remove V0 code from this version. The first step is to delete v0 usage. Related: https://github.com/vllm-project/vllm-ascend/issues/1620 - vLLM version: v0.9.2 - vLLM main: `61e20828da` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 11:52:16 +08:00
Yikun Jiang	0c1d239df4	Add unit test local cpu guide and enable base testcase (#1566 ) ### What this PR does / why we need it? Use Base test and cleanup all manaul patch code - Cleanup EPLB config to avoid tmp test file - Use BaseTest with global cache - Add license - Add a doc to setup unit test in local env ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-06 10:42:27 +08:00
wangxiyuan	343955c7ac	[CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625 ) This commit `78fe77534b` from vllm reverted the change for FusedMoEParallelConfig This PR do the same to fix the CI error Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-04 17:54:33 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
wangxiyuan	69b817ed65	[CI] Add unit test framework (#1201 ) This PR added the unit test framework to enable ut for vLLM Ascend. Unit test runs on CPU machines. It'll be ran once lint check is passed the same as e2e test. For unit test, this PR created a new folder called `ut` under `tests` module. All the test file in `ut` should keep the same with the code in `vllm-ascend`. The file name should be start with `test_` prefix. For example, in this PR. the `test_ascend_config.py` is added for `ascend_config.py` test. A new fille `worker/test_worker_v1.py` is also added as the placeholder. This file should be the unit test for `vllm-ascend/worker/worker_v1.py`. Additional, a new `fake_weight` folder is added, it contains the config.json from `facebook/opt-125m`, so that the test will not always visit huggingface. TODO: We should add all the unit test file one by one in the future. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-16 18:32:28 +08:00

32 Commits