xc-llm-ascend

Author	SHA1	Message	Date
zhangyiming	1c954ff264	[main2main] upgrade vllm to 0308 (#7213 ) ### What this PR does / why we need it? Update main2main to vllm 0308. breaks: * https://github.com/vllm-project/vllm/pull/30681 * https://github.com/vllm-project/vllm/pull/35552 remove self.cudagraph_batch_sizes * https://github.com/vllm-project/vllm/pull/35158 clear_metadata -> defer_finalize * https://github.com/vllm-project/vllm/pull/36006 remove CacheConfig.cpu_offload_gb * https://github.com/vllm-project/vllm/pull/35472 * https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder * https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens * https://github.com/vllm-project/vllm/pull/28053 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-03-18 09:24:43 +08:00
zxr2333	5645ca8392	[BugFix]A2 MOE method&& layerwise MTP bugfix && Mamba gdn_metadata bugfix (#7364 ) ### What this PR does / why we need it? Some bug fixes, mainly including: 1. For A2, the number of experts each single card cannot be greater than 16 when using MC2. The PR fixed the error in the A2 moe communication method selection, which would cause the selection of an incorrect communication method when the number of model experts exceeds 256. For example, when using an A2 16-cards model to load the PD-disaggregation D node with Qwen3.5 series models, the incorrect MC2 method would be chosen. 2. Fixed the issue where the layerwise connector sends the kv-cache of the MTP layer multiple times when `num_spec_tokens` > 1. Now, the kv-cache is sent only when the MTP layer is forward for the first time. 3. Fix the accuracy issue of qwen3.5 when using MTP for PD disaggregation. The cause is that `num_decode_draft_tokens` does not consider that `spec_tokens` are not existed during the first inference when PD disaggregation (`spec_tokens` are generated during the first inference). However, `spec_tokens_padding` is added by `recomputed_scheduler`. As a result, `gdn_metadata` incorrectly considers that the prefill with a length of 2 is performed. --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-17 23:03:45 +08:00
pichangping	3f39ac9c8d	[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 ) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-16 22:49:05 +08:00
drslark	a6f6e919e6	[main][bugfix] Fixed the problem that eagle3 will crash in FULL_DECODE_ONLY (#7290 ) ### What this PR does / why we need it? Two problems have been solved in this pr. These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens` should be padded to some value in `cudagraph_capture_sizes`. 1. We found the length of `seq_lens_list` in drafter's `attn_metadata` is 1 shorter than expected. It will raise a kernel exception to make vllm crash. e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20], `actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But `seq_lens_list` = [5742, 4700, 7996], it is not padded. 3. Though the length of `seq_lens_list` in target's `attn_metadata` is the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at the end of the list. e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20], `actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But `seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end of the list. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-16 20:41:36 +08:00
rjg-lyh	4d443b9228	[bugfix] restore pr-7029 and fix patch error (#7294 ) ### What this PR does / why we need it? This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using the `lightning_indexer_quant` ops in the pd-mix stage. The original PR was reverted by #7288 because the patch did not work with the recompute scheduler. This PR also fixes the patching issue so that it works correctly with the recompute scheduler. ### Does this PR introduce _any_ user-facing change? Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to `"true"` in `additional_config`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-16 15:39:42 +08:00
Mengqing Cao	0c299f79b9	Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 )" (#7288 ) ### What this PR does / why we need it? This reverts commit `7ed9e9de69`, which introduces an issue that the patch doesn't work with recompute scheduler enabled. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-15 20:19:09 +08:00
Angazenn	ce5544bfc1	[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align` (#7103 ) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-15 09:44:09 +08:00
Mengqing Cao	e7aa2c285c	[SpecDecode] Fix Draft model proposer (#7230 ) ### What this PR does / why we need it? This pr fix the Unified draft parallel feature. 1. In Draft model proposer, there are exceed 1 attention layers in target model, thus removing the assertion on layer number. 2. we should get block size through `draft_attn_groups` instead of `attn_metadata_builder` after 0.17.0. 3. `attn_update_stack_num_spec_norm` shouldn't be done when unified draft parallel is enabled ### How was this patch tested? Test pass with `tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_parallel_drafting_acceptance`, which is already included in CI - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-14 18:26:37 +08:00
Mengqing Cao	986cd45397	[Version] Drop 0.16.0 support (#7153 ) ### What this PR does / why we need it? Drop 0.16.0 support in main - Fix eagle proposer break introduced by https://github.com/vllm-project/vllm/pull/34552. Mainly change to use the draft attention group to initialize the attention metadata builder. - Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes` error, which is a bug in vLLM v0.17.0, and fixed by a later pr https://github.com/vllm-project/vllm/pull/30515 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-13 16:14:15 +08:00
rjg-lyh	7ed9e9de69	[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 ) ### What this PR does / why we need it? This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops in pd-mix stage mainly. Because the code for the current PD-disaggregated scenario is still under refactoring and cleanup, this PR prioritizes ensuring the C8 functionality in the pd-mix scenario. The next steps are planned in two parts: ① Once the optimized scatter operator is updated, we will replace the original operator to improve the performance of storing k_scale. ② Once the code logic for the PD-disaggregated scenario becomes stable, we will carry out more comprehensive validation and make appropriate adaptations. ③ Because enabling C8 currently introduces several new operators whose performance still needs improvement, performance may regress in some scenarios. Therefore, only after all the operators are fully ready can we ensure that this feature does not cause any performance degradation. At that point, we will enable this feature by default and remove the switch in `additional_config`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-13 14:47:42 +08:00
kx	df1ee8070d	[feat][spec decode]Unified draft parallel (#6766 ) ### What this PR does / why we need it? Implement a unified parallelized speculative decoding in VLLM Ascend，which can simultaneously support parallel speculative inference schemes such as Pard, P-Eagle, etc. refer to https://github.com/vllm-project/vllm-ascend/pull/6565 and https://github.com/vllm-project/vllm-ascend/pull/4078 ### How was this patch tested? run with parallel drafting script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' base script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 benchmark script: MAX_CONCURRENCY=1 NUM_PROMPTS=80 vllm bench serve --port 8811 \ --temperature 0 \ --model /model/Llama-3.1-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts ${NUM_PROMPTS} \ --max-concurrency ${MAX_CONCURRENCY} \ --seed 1234 test results : base(without spec decode): TTFT 79.46ms TPOT 26.99ms output_tokens_throughput 36.75 tok/s this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms output_tokens_throughput 72.98 tok/s per-position acceptance(from position 0 to 7): 79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%. ---------------------------------------------------------------------- run on qwen3 model script ： export target=/model/Qwen3-1.7B export draft=/model/PARD-Qwen3-0.6B export CUDA_VISIBLE_DEVICES=1 export ASCEND_RT_VISIBLE_DEVICES=1 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' cc @NickJudyHvv - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: kx <1670186653@qq.com> Signed-off-by: HF-001 <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>	2026-03-13 14:07:35 +08:00
Qiu	c377e73933	Perf(PP): support PP with async scheduling. (#7136 ) ### What this PR does / why we need it? Follow up the PR https://github.com/vllm-project/vllm/pull/32618, this PR provides async scheduling support for PP in vllm-ascend. --- - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-13 10:27:23 +08:00
zxr2333	fe4cad24e9	[BugFix]fix qwen3.5 reshape_kvcache bug (#7209 ) ### What this PR does / why we need it? This PR fixes a bug in `reshape_kvcache_tensors` when reshaping the Mamba cache for models like Qwen3.5. The previous implementation did not correctly handle cases where the KV cache tensors have different data types. This change ensures that slicing is performed based on byte offsets before reshaping the tensors, which correctly handles heterogeneous dtypes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-12 23:51:40 +08:00
drslark	de93790d08	[main][bugfix] Fixed the problem of drafter crashed in FULL mode (#7158 ) ### What this PR does / why we need it? The merged graph of draft in `FULL` mode is broken now. This pr solves it. Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so, it is removed. It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and https://github.com/vllm-project/vllm-ascend/pull/7148. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test code is shown as below: ```python prompts = [ "1.Who are you?", "2. Who are you?", ] sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200) llm = LLM( model="/home/some-model/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_num_seqs=32, # enforce_eager=True, disable_log_stats=False, distributed_executor_backend="mp", gpu_memory_utilization=0.7, async_scheduling=True, speculative_config={ "enforce_eager": True, "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B", "disable_padded_drafter_batch": False, "method": "eagle3", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL", "cudagraph_num_of_warmups": 1, }, max_model_len=4096, enable_prefix_caching=False, ) outputs = llm.generate(prompts, sampling_params) ``` The result before: ```text File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia graph_params.events[num_tokens].append(event) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ KeyError: 132 ``` The result after: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 242 num_draft_tokens: 726 num_accepted_tokens: 156 mean acceptance length: 1.64 -------------------------------------------------- acceptance at token 0: 0.42 acceptance at token 1: 0.16 acceptance at token 2: 0.07 ``` We also test `FULL_DECODE_ONLY` mode. The result is: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 244 num_draft_tokens: 732 num_accepted_tokens: 155 mean acceptance length: 1.64 -------------------------------------------------- acceptance at token 0: 0.42 acceptance at token 1: 0.16 acceptance at token 2: 0.06 ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-12 18:38:50 +08:00
无脸男	09d26754cd	[Bugfix] Fix the issue where no exception is thrown when graph capture fails. (#5644 ) ### What this PR does / why we need it? Fix the issue where no exception is thrown when graph capture fails. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: WithHades <244036962@qq.com>	2026-03-12 16:14:45 +08:00
XiaoxinWang	37d1bd8c50	fixed fia pad logic in graph mode. (#7144 ) ### What this PR does / why we need it? related to vllm PR #34043 this pr delete func ‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual number of requests, due to fia operator requires that query_start_loc[-1] equals the total number of computed tokens, so this func delete cause the ifa error. In full graph mode, set num_reqs_paded = num_reqs to fix the error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2026-03-12 14:50:54 +08:00
lilinsiman	a5ea699e29	[eagle][cp] fix eagle_cp enable bug2 (#7079 ) ### What this PR does / why we need it? Fix acceptance and high-concurrency bug in eagle3 and cp enabled ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-10 16:32:49 +08:00
yupeng	40f7d93f1a	[bugfix][LoRA] Fix the lora accuracy issue introduced by the upstream vLLM changed. (#6958 ) ### What this PR does / why we need it? Fix the LoRA e2e test accuracy issue that introduced by the upstream PR https://github.com/vllm-project/vllm/pull/32005 ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama32_lora.py - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: yupeng <507435917@qq.com>	2026-03-10 10:43:18 +08:00
王远	82fdd40d49	[Feat]Xlite Qwen3 MoE Support Data Parallel (#6715 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE data parallel in Xlite. For more details about Xlite, please refer to the following link:[https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md). online server config: ```shell port=$1 log=$2 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false export VLLM_ASCEND_ENABLE_NZ=0 sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 ip=127.0.0.1 python -m vllm.entrypoints.openai.api_server \ --model /mnt/nvme1n1/wy/models/Qwen3-30B-A3B \ --tensor-parallel-size 2 \ --enable-expert-parallel \ --data-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 32768 \ --data-parallel-size-local 4 \ --max-num-seqs=200 \ --block-size 128 \ --max-model-len 6656 \ --trust-remote-code \ --disable-log-requests \ --served-model-name qwen \ --no-enable-prefix-caching \ --additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}, "enable_cpu_binding": true}' \ --compilation-config '{"cudagraph_capture_sizes":[1, 16, 32, 48, 64, 100, 150, 200], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ --host ${ip} \ --port ${port} > ${log} 2>&1 & ``` test_config: ```shell vllm bench serve \ --max-concurrency ${maxconcurrency} \ --num-prompts ${num_prompts} \ --host ${HOST} \ --port ${PORT} \ --model ${MODEL_NAME} \ --dataset-name random \ --backend openai-chat \ --random-input-len 512 \ --random-output-len 512 \ --random-range-ratio 0.2 \ --temperature 0.6 \ --metric-percentiles "50,90,99" \ --tokenizer ${TOKENIZER_PATH} \ --endpoint /v1/chat/completions \ --ignore-eos ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `c86cdcbcd2` Signed-off-by: uuzWY <Ethan.wangyuan@huawei.com> Co-authored-by: uuzWY <Ethan.wangyuan@huawei.com>	2026-03-09 17:53:35 +08:00
Cao Yi	cb4c7de856	[Perf] Optimize MTP execution by reordering state update operation (#6844 ) ## Summary - Move `_update_states_after_model_execute` call from after main model sampling to after draft model execution - This reordering reduces pipeline bubbles between main model and draft model execution - No accuracy impact - the state update operation is independent of draft token proposal ## Performance Impact Reduces idle time between main model and draft model execution stages, improving overall MTP (Multi-Token Prediction) performance. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-09 15:55:27 +08:00
zxr2333	d39d80830c	[KVCache]Qwen3.5 supports contiguous tensor hybrid-attn kv-cache (#6887 ) ### What this PR does / why we need it? Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid model, such as Qwen3Next and Qwen3.5. Due to the restrictions of Ascend operators, all KV tensors, conv tensors, and SSM tensors must be contiguous. Therefore, this PR uses the following solution to generate the KV cache: tensor1: [(kv_padding), conv , ...] tensor2: [k , ssm , ...] tensor3: [v , (mamba_padding), ...] Under this scheme, although some waste may occur, the tensors of all caches are guaranteed to be contiguous. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-09 15:28:40 +08:00
Cao Yi	aef9d4249d	[Perf] Avoid CPU sync in mrope_positions copy by using full tensor copy (#7014 ) ### What this PR does / why we need it? The index-select operation `mrope_positions.gpu[:, :total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU synchronization, which blocks subsequent operator dispatch and causes bubbles visible in Profiling. This PR changes to full tensor copy (`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync point. The trade-off is a negligible increase in memory usage since `mrope_positions.cpu` is a small tensor. Result: ~2-3% TPOT improvement with the profiling bubbles eliminated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified via Profiling that the CPU sync bubble is eliminated and TPOT is reduced by 2-3%. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-09 14:46:37 +08:00
wangxiaoteng888	a3f4f6b10b	[P/D][Bugfix] Layerwise stacking MTP error. (#7036 ) ### What this PR does / why we need it? The community has added a cleaning mechanism for the metadata after the main model finishes running. The MTP layer should not clean the metadata, and a new condition has been added to avoid cleaning it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-09 10:55:43 +08:00
drslark	6a7115fa0d	[main][feature] Support quarot for eagle3 without embedding (#7038 ) ### What this PR does / why we need it? If some `eagle3` model without embed_tokens works with `quarot` target model, the acceptence rate will drop. We solve it in this PR. The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225. - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-09 10:43:06 +08:00
lilinsiman	01d3515dcf	[eagle][cp][bugfix] Fix the bug in eagle and cp enabled (#6981 ) ### What this PR does / why we need it? When eagle and cp are enabled at the same time, there is an error in pcp_allgather due to hidden_states. This PR fixes this issue. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-06 20:49:49 +08:00
ZhaoJiangJiang	a51d6366b9	[Bugfix] Qwen3Next support FlashComm1 (#6830 ) ### What this PR does / why we need it? Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence Parallel (SP) and resolve precision problems in shared_out when both FlashComm1 is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com> Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>	2026-03-06 17:14:08 +08:00
Zetong Li	a2696006d1	[Refactor][EAGLE] 8/N delete mtp_proposer (re-pull) (#7033 ) ### What this PR does / why we need it? NOTE: This PR is re-pull of #7016 since ci mistakenly marked unfinished pr as having passed. This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and glm5, now it should be ok to remove mtp_proposer. The bug is actually about unnecessary slicing of `slot_mapping`. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-06 17:11:22 +08:00
wangxiyuan	16c3b0b822	Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030 ) Reverts vllm-project/vllm-ascend#7016 It breaks E2E test - vLLM version: v0.16.0 - vLLM main: `4034c3d32e`	2026-03-06 11:24:05 +08:00
Zetong Li	a60e179c7f	[Refactor][EAGLE] 8/N delete mtp_proposer (#7016 ) ### What this PR does / why we need it? This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and glm5, now it should be ok to remove mtp_proposer. The bug is actually about unnecessary slicing of `slot_mapping`. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-06 09:10:57 +08:00
SILONG ZENG	bd571cf6d6	[Main2Main] Upgrade vLLM to 0303 (#6944 ) ### What this PR does / why we need it? break: - https://github.com/vllm-project/vllm/pull/34102 Disable_full param replaced with valid_modes/invalid_modes API - https://github.com/vllm-project/vllm/pull/35503 Now must return float compilation_time - https://github.com/vllm-project/vllm/pull/35564 New sequence_lengths param added - https://github.com/vllm-project/vllm/pull/33807 A check was performed (if runner_backend != "auto") - https://github.com/vllm-project/vllm/pull/34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - https://github.com/vllm-project/vllm/pull/35274 Important change: - https://github.com/vllm-project/vllm/pull/28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>	2026-03-06 09:08:52 +08:00
Cao Yi	50441e4650	[BugFix][MTP] Fix prefill misclassified as decode when prompt tokens == num_spec_tokens + 1 (#6835 ) ## Problem When MTP is enabled, prefill requests with `prompt_tokens == num_spec_tokens + 1` are incorrectly classified as decode requests, causing accuracy issues. ## Root Cause The `uniform_decode` condition only checked: - `max_num_scheduled_tokens == uniform_decode_query_len` - `num_tokens == max_num_scheduled_tokens * num_reqs` This is insufficient because a prefill request with specific prompt length satisfies these conditions as well. ## Fix Add `is_all_decode` check to ensure all requests have `num_computed_tokens > 0` before classifying as uniform decode, since decode requests must have computed at least one token. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-03-05 17:33:10 +08:00
LI SHENGYONG	5a3744c542	[EPLB] The profiling can collect the time required for adjusting the eplb. (#7001 ) ### What this PR does / why we need it? To analyze the overhead of the dynamic eplb adjustment framework in detail, we added the time consumption of the adjustment to the print information in profiling mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-03-05_11-42-28](https://github.com/user-attachments/assets/41c2b82a-5dfa-4e39-8b50-f4649deed30c) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-05 16:10:57 +08:00
wangxiyuan	13777bf3f0	[Spec Decode]clean up spec decode interface (#6947 ) This pull request refactors the speculative decoding proposer interface to align with upstream vLLM, removing the local `Proposer` interface and renaming methods to `propose`. This is the first step. In the future we should remove the class register and just add few Ascend specified method once the arch in vLLM is ready. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:30:10 +08:00
zhaomingyu13	52d9086f64	[Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914 ) ### What this PR does / why we need it? When using the target model after rotational quantization, the acceptance rate decreases because the fc weight of the draft model has not undergone rotational quantization(issue: #6445). We fixed this issue by performing rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-03-04 11:29:49 +08:00
weiguihua2	5b05b3a090	[feat]ds3.2 pcp support mtp and chunkprefill (#6917 ) ### What this PR does / why we need it? ds3.2 pcp supports the combination of MTP and chunkprefill features. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-03-03 19:03:50 +08:00
realliujiaxu	5e24b26a54	[Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883 ) ### What this PR does / why we need it? PR #5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### 1. start server ``` bash vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}' ``` #### 2. curl ```bash curl -s http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Hello. I have a question. Who are you?"} ], "max_tokens": 10, "temperature": 0.0, "ignore_eos_token": true }' ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-01 20:22:50 +08:00
Bai Yongbin	9d09488b4a	[Feat] support basic pcp&dcp for qwen3next (#6091 ) ### What this PR does / why we need it? This PR implements Context Parallelism (CP) support for the Qwen3-Next model, including PCP (Parallel Context Parallelism) and DCP (Dynamic/Data Context Parallelism). - vLLM version: v0.15.0 - vLLM main: `f176443446` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Co-authored-by: SunnyLee219 <3294305115@qq.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-02-28 21:44:08 +08:00
starmountain1997	5ffae03156	[bugfix] fix capture shape in sp_eagle_fullgraph (#6846 ) ### What this PR does / why we need it? This was meant to be merged in #6536, but I accidentally restored a commit. You can find the relevant discussion [here](https://github.com/vllm-project/vllm-ascend/pull/6536#issuecomment-3882883471). Since `self.pass_config.enable_sp` is forcibly set to `False` in the [source code](`f176443446/vllm/config/compilation.py (L1066)`), this section will no longer verify whether the generated cudagraph shapes are multiples of both `uniform_decode_query_len` (`num_speculative_tokens + 1`) and `tensor_parallel_size`. This PR enables the `num_speculative_tokens + 1` and `tensor_parallel_size` check upfront. Therefore, it won't silently round up the `cudagraph_size` and throw a cryptic error for the user. A typical example of this cryptic error looks like: ``` ValueError: could not broadcast input array from shape (196,) into shape (14,) ``` ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? Have passed all test. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: lilinsiman <lilinsiman@gmail.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-28 17:30:02 +08:00
drslark	5666ce03f5	[bugfix] Fixed an accuracy problem of gdn layer in graph (#6822 ) ### What this PR does / why we need it? There will be random ouputs if we run model with GDN attention in graph mode: ```python prompts = [ "1. Who are you?", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5) llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, distributed_executor_backend="mp", gpu_memory_utilization=0.7, speculative_config={ "method": "qwen3_next_mtp", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [8], }, max_model_len=4096, enable_prefix_caching=False) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"{output.prompt_token_ids=}") print(f"{output.outputs[0].token_ids=}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Before appling this change, the outputs was: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 323, 279, 1112, 279] Prompt: '1. Who are you?', Generated text: ' What and the... the' ``` After applying this change, the output is: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 374, 697, 829, 30] Prompt: '1. Who are you?', Generated text: ' What is your name?' ``` Why does this change sovle the problem? Now, `query_start_loc` is padded because of `fia`. But, for `gdn-attention`, padded version of `query_start_loc` will cause accuracy problem. So, we need an unpadded version of `query_start_loc` named `gdn_query_start_loc` and use it in `gdn-attention`, it works fine. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? As described aboved. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: drslark <slarksblood@qq.com>	2026-02-28 08:57:53 +08:00
lilinsiman	c13d90b766	[Refactor][EAGLE] 7/N Merged PCP and disable_padded interface (#6811 ) ### What this PR does / why we need it? [Refactor][EAGLE] 7/N Merged PCP and disable_padded interface into eagle_proposer.py This pull request significantly refactors the speculative decoding mechanism by merging Parallel Context Processing (PCP) and Multi-Token Prediction (MTP) functionalities directly into the eagle_proposer.py. The changes aim to enhance the efficiency and correctness of distributed speculative decoding, particularly by enabling the Eagle feature to work seamlessly with the disable_padded interface. This involves detailed adjustments to attention metadata, input/output processing, and state management to ensure proper operation in parallel environments. 1. The PCP and MTP features are migrated to the eagle_proposer.py 2. The Eagle and PCP features are integrated 3. Enable the eagle feature to use the disable_padded interface ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests and UT - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-02-27 16:06:56 +08:00
Canlin Guo	e4458b2d2b	[Main2Main] Upgrade vLLM to 0226 (#6813 ) ### What this PR does / why we need it? Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-02-27 16:05:21 +08:00
realliujiaxu	5def28dcd3	[Feat]support sequence parallelism by pass for VL models (#5632 )	2026-02-27 08:27:41 +08:00
Li-Yongwen	2870f7c8ad	[Feat] Support routing replay (#6696 ) ### What this PR does / why we need it? [Feat] Support routing replay same as https://github.com/vllm-project/vllm-ascend/pull/6666 resubmit because of DOC failure ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: liyongwen <1310439159@qq.com> Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-26 10:22:47 +08:00
bowenli	e3927cc8f5	[Bugfix] fix bug for mtp (#6514 ) ### What this PR does / why we need it? fix(mtp): resolve MTP core bugs and enhance eager mode test cases 1. Resolved critical issues in eager mode MTP core execution logic; 2. Fixed functional bugs in the _update_states_after_model_execute function; 3. Updated and released test_mtp_qwen3_next.py to validate eager mode acceptance rate. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>	2026-02-25 17:50:57 +08:00
Canlin Guo	ad9d9569ea	[Bugfix] Add the missing parentheses to @torch.inference_mode (#6757 ) ### What this PR does / why we need it? This PR fixes a bug in `vllm_ascend/worker/model_runner_v1.py` where the `@torch.inference_mode` decorator was used without parentheses. Using the decorator without instantiation is deprecated and may not correctly disable gradient calculations, leading to performance degradation and increased memory usage during inference. This change adds the required parentheses to ensure `torch.inference_mode` is applied correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The change is a minor syntax correction. Existing CI tests should cover this. - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-02-25 14:37:53 +08:00
Shanshan Shen	957804df56	[Refactor][Bugfix] Use upstream `mem_utils` for profiling and correct non-torch memory recorded during profiling (#6625 ) ### What this PR does / why we need it? 1. Following https://github.com/vllm-project/vllm/pull/32322, use the `memory_profiling` context manager from vllm for profiling. 2. Fix wrong non-torch memory value recorded during profiling, which is not its peak during inference. --- More details about point 2: After profling, the non-torch memory value we recorded is lower than that in real inference. This is mainly because of the different memory management behaviour between `torch.cuda.empty_cache()` and `torch.npu.empty_cache()`. With regard to `torch.cuda.empty_cache()`, it only recycle the unused memory in pytorch memory pool (i.e., memory managed by pytorch caching allocator), with no affect to non-torch memory. However, as for `torch.npu.empty_cache()`, it has a totally different memory management mechanism, i.e., it may call `aclrtSynchronize` and enable Ascend runtime to free up non-torch memory. Thus, the non-torch memory value we recorded after `torch.npu.empty_cache()` is much lower than its peak during profling. Resolution: We record the peak non-torch memory value (`non_torch_memory_before_empty_cache`) after profiling, but before `torch.npu.empty_cache()`. Then, we add the diff (`non_torch_memory_cleared_by_empty_cache = non_torch_memory_before_empty_cache - self.non_torch_memory`) to non-torch memory when calculating available KV cache memory, which will lead to less KV cache memory (i.e., it's safer to avoid OOM issues). --- > [!NOTE] > This PR needs to wait for main2main aligning to latest vllm commit before merging. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? Before this PR, the non-torch memory we used to calculate available KV cache memory is 0.90 G, whereas its peak during real inference is 1.08 G, diff: 182.00 M. After this PR, we add this diff to non-torch memory after profiling and thus make the profiling results more accurate. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-02-25 14:28:08 +08:00
LI SHENGYONG	ff29e029de	[EPLB][Bugfix] Bugfix for ineffective dynamic eplb (#6653 ) ### What this PR does / why we need it? #6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-24 14:43:04 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
Angazenn	c0c2eb614e	[Main][Ops] Make triton rope support index_selecting from cos_sin_cache (#5450 ) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see #6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). Highlights - RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - Custom Operation Registration: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` Additional test on Qwen3-235b accuracy: \| Aime2024 \| GSM8K \| Livecodebench \| \| -------- \| -------- \| -------- \| \| 83.33 \| 96.26 \| 70.23 \| --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-02-11 21:20:53 +08:00
LI SHENGYONG	34eecacace	[EPLB] Avoiding eplb's dependency on a specified model (#6528 ) ### What this PR does / why we need it? 1. Currently, eplb registers different attributes for different models, but these attributes are not actually used. Now, these attributes are directly deleted. 2. Add some log about eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Deepseek v3.1 chat Of course! Here is a comprehensive explanation of deep learning, broken down for clarity.\n\n### The Simple Analogy: A Child Learning to Recognize a Cat\n\nImagine teaching a child what a cat is. You don't give them a rulebook with instructions like \"has pointy ears, whiskers, and a tail.\" Instead, you show them many pictures, saying \"this is a cat\" or \"this is not a cat.\" The child's brain gradually learns to identify the complex patterns—the combination of shapes, colors, and textures—that define \"cat-ness.\"\n\nDeep learning is essentially this, but for computers. It's a method for teaching computers to learn from examples and recognize patterns directly from data (like images, sound, or text) without being explicitly programmed with rigid rules.\n\n---\n\n### The Technical Definition\n\nDeep Learning is a subfield of machine learning, which itself is a subfield of artificial intelligence (AI). It uses artificial neural networks with many layers (\"deep\" networks) to model and understand complex patterns in data.\n\nHere are the key concepts in that definition:\n\n1. Artificial Intelligence (AI): The broad science of making machines smart and capable of performing tasks that typically require human intelligence.\n2. Machine Learning (ML): A subset of AI that gives computers the ability to learn from data without being explicitly programmed for every single rule.\n3. Deep Learning (DL): A specific, powerful - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-10 15:58:44 +08:00

1 2 3 4 5 ...

432 Commits