xc-llm-ascend

Author	SHA1	Message	Date
pppeng	a457d0f0e8	[doc] Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend (#7313 ) ### What this PR does / why we need it? Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend Base on vllm-ascend:v0.17.0rc1 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>	2026-03-17 22:54:57 +08:00
asunxiao	a370dfa962	[bugfix]Enable dispatch_ffn_combine feature for qwen3.5 (#7066 ) ### What this PR does / why we need it? Qwen3.5 Moe supports enabling the dispatch_ffn_combine fusion operator. Fix problem: In the w8a8 quantization scene, Qwen3.5 model's config.json lacks the quantize field. The previous logic strictly relied on quant_type == "w8a8_dynamic" to enable VLLM_ASCEND_ENABLE_FUSED_MC2. This caused the dispatch_ffn_combine fusion operator to fail to activate even when the environment variable was set. Enable dispatch_ffn_combine fusion operator for BF16 scenarios. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: asunxiao <asunxiao@qq.com>	2026-03-17 19:53:02 +08:00
aipaes	83ad14c74c	[bugfix] fix unzip file path for fia operator (#7367 ) ### What this PR does / why we need it? The decompression path of the FIA operator package is incorrect, and unnecessary folders have been created during modification. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-17 17:21:27 +08:00
rjg-lyh	7669963c27	[Perf] Optimize bias handling in AscendRMSNorm (#7226 ) ### What this PR does / why we need it? This PR optimizes bias handling in `AscendRMSNorm` without changing the intended functional behavior. In the current implementation, bias may be initialized for `AscendRMSNorm` based on configuration-level detection, even though some norm layers never actually load a bias weight. This can cause the inference path to enter the bias branch and execute an unnecessary `add_` operator. To improve this, this PR introduces a loader-based flag to record whether the bias has actually been loaded. The bias addition is then executed only when the bias is truly present. This optimization reduces redundant computation in inference and makes the bias application logic better aligned with the actual model weights. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-17 16:53:28 +08:00
lilinsiman	8f278fc101	[eagle3][pcp] fix bug for eagle3 and cp enable (#7309 ) ### What this PR does / why we need it? This PR fixes the bug for eagle3 and cp enable introduced by the parallel speculative inference PR. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-17 16:14:45 +08:00
lidenghui1110	4e62a2ae15	[Bugfix] fix TransposeKvCacheByBlock op error report in plog (#7235 ) ### What this PR does / why we need it? As issue #7201 reported, there are some TransposeKvCacheByBlock operation related ERRORs in plog when vllm launching, though it doesn't influence the running of vllm, but ERRORs will be very confused in debug, this PR fixed the problem as suggested. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-03-17 10:08:32 +08:00
pichangping	3f39ac9c8d	[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 ) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-16 22:49:05 +08:00
drslark	a6f6e919e6	[main][bugfix] Fixed the problem that eagle3 will crash in FULL_DECODE_ONLY (#7290 ) ### What this PR does / why we need it? Two problems have been solved in this pr. These problems occur in the `FULL_DECODE_ONLY` mode that `num_tokens` should be padded to some value in `cudagraph_capture_sizes`. 1. We found the length of `seq_lens_list` in drafter's `attn_metadata` is 1 shorter than expected. It will raise a kernel exception to make vllm crash. e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20], `actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But `seq_lens_list` = [5742, 4700, 7996], it is not padded. 3. Though the length of `seq_lens_list` in target's `attn_metadata` is the same as expected in `FULL_DECODE_ONLY`, some data are corrupted at the end of the list. e.g., `num_reqs` = 3, `cudagraph_capture_sizes` = [20], `actual_seq_lengths_q` is padded well to [4, 8, 12, 20]. But `seq_lens_list` = [5742, 4700, 7996, 5738], it has corrupted at the end of the list. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-16 20:41:36 +08:00
LVYANGGUO	b1a78886a9	[xlite][Bugfix] Support mrope and deepstack features in xlite backend (#7295 ) ### What this PR does / why we need it? This PR fixes a bug in Xlite backend(https://atomgit.com/openeuler/GVirt/issues/3). This PR adds support for mrope (Mixture-of-RoPE) and deepstack features in the xlite backend. These features are necessary for running certain multimodal models that utilize them. The main changes include: - Updating `_build_model_config` to parse mrope and deepstack configurations from the model's `hf_config`. - Modifying `XliteWrapper.__call__` to handle `deepstack_input_embeds` and mrope positions during the model forward pass. - Replacing `ModelAttnMeta` with the newer `AttnMeta` to accommodate the new metadata fields required by these features. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? online server config: ``` python -m vllm.entrypoints.openai.api_server \ --model /mnt/nvme0n1/models/checkpoint-8200 \ --additional-config='{"xlite_graph_config": {"enabled": true}}' \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 8192 \ --max-num-seqs=20 \ --block-size 128 \ --max-model-len 8192 \ --trust-remote-code \ --served-model-name Qwen3-VL-8B \ --host localhost \ --generation-config vllm \ --port 6777 ``` test_config: ``` vllm bench serve \ --max-concurrency ${maxconcurrency} \ --num-prompts ${num_prompts} \ --host ${HOST} \ --port ${PORT} \ --model ${MODEL_NAME} \ --dataset-name random \ --backend openai-chat \ --random-input-len 512 \ --random-output-len 512 \ --random-range-ratio 0.2 \ --temperature 0.6 \ --metric-percentiles "50,90,99" \ --tokenizer ${TOKENIZER_PATH} \ --endpoint /v1/chat/completions \ --ignore-eos ``` - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: LVYANGGUO <lvyangguo@huawei.com> Co-authored-by: LVYANGGUO <lvyangguo@huawei.com>	2026-03-16 17:05:52 +08:00
wangx700	22d0e1d3d7	[model_runner_v2]optimize the performance of the _topk_log_softmax_kernel (#7221 ) ### What this PR does / why we need it? Optimize the performance of the triton operator _topk_log_softmax_kernel in model_runner_v2 to 1.04xH100，which is 7% of its original value.(issue https://github.com/vllm-project/vllm-ascend/issues/5208) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangx700 <wangxin700@huawei.com>	2026-03-16 16:49:10 +08:00
rjg-lyh	4d443b9228	[bugfix] restore pr-7029 and fix patch error (#7294 ) ### What this PR does / why we need it? This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using the `lightning_indexer_quant` ops in the pd-mix stage. The original PR was reverted by #7288 because the patch did not work with the recompute scheduler. This PR also fixes the patching issue so that it works correctly with the recompute scheduler. ### Does this PR introduce _any_ user-facing change? Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to `"true"` in `additional_config`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-16 15:39:42 +08:00
zhaomingyu13	9320365dab	[Test][Feature] Add e2e test for QuaRot model with eagle3 (#7128 ) ### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(#6914, #7038) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-03-16 15:35:55 +08:00
LICO67373	71c21f76f5	[Refactor] Replace npu_ring_mla with FIA in MLA prefill (#5704 ) ### What this PR does / why we need it? Refactor: Replace npu_ring_mla with FIA in MLA prefill This PR refactors the MLA (Multi-Layer Attention) prefill implementation by replacing `npu_ring_mla` with `npu_fused_infer_attention_score` (FIA) operator, unifying the attention backend with the standard attention implementation. Key changes: 1. Core prefill refactoring (`mla_v1.py`) - Replace `npu_ring_mla` with `npu_fused_infer_attention_score` in `_forward_prefill` and `_compute_prefill_context` - Use TND layout with `softmax_lse_flag=True` for prefill attention - Use `npu_attention_update` to merge multiple chunk outputs with LSE (Log-Sum-Exp) - Change `attn_mask` from `get_final_mla_mask()` to `get_splitfuse_attn_mask()` for FIA compatibility 2. Data type handling - Add automatic float16 → bfloat16 conversion (FIA with TND layout only supports bfloat16) - Convert output back to original dtype after FIA computation 3. Metadata optimization - Pre-calculate `actual_seq_lengths_q` in `AscendMLAPrefillMetadata` - Pre-calculate `chunk_actual_seq_lengths_kv_list` in `ChunkedContextMetadata` - Move `torch.cumsum` operations from forward pass to metadata building phase 4. CP compatibility (`mla_cp.py`) - Add `_ring_mla_mask_builder` to get `npu_ring_mla`-compatible masks for Context Parallel scenarios - Add `chunk_actual_seq_lengths_kv_list` field to `CPChunkedContextMetadata` Why we need it: - Backend unification: Aligns MLA prefill with standard attention implementation (`attention_v1.py`) - Better chunked context support: FIA + `npu_attention_update` provides native LSE-based output merging - Future compatibility: Prepares for eventual `npu_ring_mla` removal across the codebase ### Does this PR introduce _any_ user-facing change? No. This is a pure refactoring with no functional changes - same behavior, unified backend. --- - Related issue: #5463 (item 7) - vLLM version: v0.14.1 Signed-off-by: lico67373 <918688502@qq.com>	2026-03-16 10:33:09 +08:00
Mengqing Cao	e20f0b1a0d	[ReleaseNote] Add release note for v0.17.0rc1 (#7240 ) ### What this PR does / why we need it? This pull request adds the release notes for `v0.17.0rc1`. It also updates version numbers across various documentation files, including `README.md`, `README.zh.md`, `docs/source/community/versioning_policy.md`, and `docs/source/conf.py` to reflect the new release. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e`	2026-03-15 22:47:47 +08:00
pppeng	7e85f2ff97	[CI] Add test_qwen3_5.py (#7133 ) ### What this PR does / why we need it? Add test_qwen3_5.py for base scenarios tp4 on Qwen3.5-27B and Qwen3.5-35B-A3B. - vLLM version: main - vLLM main: `4034c3d32e` --------- Signed-off-by: pppeng <zepengliu912@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-15 22:19:02 +08:00
Mengqing Cao	0c299f79b9	Revert "[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 )" (#7288 ) ### What this PR does / why we need it? This reverts commit `7ed9e9de69`, which introduces an issue that the patch doesn't work with recompute scheduler enabled. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-15 20:19:09 +08:00
yupeng	29f195a91c	[Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156 ) ### What this PR does / why we need it? Fix the error that reports while initializing qwen3-reranker-0.6b model with `--enable-lora`. And add a testcase to verify the fix. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-15 17:55:42 +08:00
Qiu	7daccf4b64	Perf(PP): support PP with async send/recv. (#7143 ) ### What this PR does / why we need it? Follow up the PR https://github.com/vllm-project/vllm/pull/33368, this PR provides async send/recv support for PP in vllm-ascend. --- - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-15 09:45:09 +08:00
Angazenn	ce5544bfc1	[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align` (#7103 ) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-03-15 09:44:09 +08:00
bazingazhou233-hub	c69291eefc	[Doc] Add USE_MODELSCOPE_HUB=0 to lm-eval guide (#7279 ) ## Summary - Add `USE_MODELSCOPE_HUB=0` to both Online and Offline lm-eval sections - Add explanatory notes about Docker containers launching with `VLLM_USE_MODELSCOPE=True` The Docker containers set `VLLM_USE_MODELSCOPE=True`, which causes lm-eval to download datasets from ModelScope instead of HuggingFace, resulting in "Repo not exists" errors. Setting `USE_MODELSCOPE_HUB=0` disables this behavior. Fixes #607 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com> Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>	2026-03-14 22:41:02 +08:00
bazingazhou233-hub	9e6c547d98	[Doc] Replace deprecated full_cuda_graph with cudagraph_mode in Qwen2.5-Omni (#7286 ) ## Summary - Replace `full_cuda_graph: 1` with `cudagraph_mode: FULL_DECODE_ONLY` in both single-NPU and multi-NPU examples - `full_cuda_graph` is deprecated and falls back to `NONE` on NPU Fixes #4696 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com> Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>	2026-03-14 22:38:36 +08:00
NJX	bb506a1c99	[Doc][Installation] Clarify SOC_VERSION for CPU-only source builds (#7278 ) ### What this PR does / why we need it? - Clarify that `SOC_VERSION` must be set when building from source in a CPU-only environment where `npu-smi` is unavailable. - Add concrete `SOC_VERSION` examples (A2/A3/300I/A5) and point users to `Dockerfile*` defaults. - Improve the `setup.py` error message so users get actionable guidance when `SOC_VERSION` is missing. Fixes #6816. ### Does this PR introduce _any_ user-facing change? - Yes. Documentation is updated and the build-time error message is more informative. ### How was this patch tested? - (Local) Syntax check: `python -m compileall setup.py`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-14 22:38:25 +08:00
DreamerLeader	199df03524	[BugFix]Fix CI errors “ascend_transport.so: cannot open shared object file: No such file or directory” (#7242 ) ### What this PR does / why we need it? Conditional Import for Mooncake: The import of mooncake.engine.TransferEngine was moved into a try-except block within the GlobalTE class's constructor. This ensures that mooncake is only imported when needed and provides a clear error message with installation instructions if it's missing. ### Does this PR introduce _any_ user-facing change? The error message "ascend_transport.so: cannot open shared object file: No such file or directory" in the CI is fixed to ensure the normal running of the CI. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>	2026-03-14 21:23:05 +08:00
Mengqing Cao	e7aa2c285c	[SpecDecode] Fix Draft model proposer (#7230 ) ### What this PR does / why we need it? This pr fix the Unified draft parallel feature. 1. In Draft model proposer, there are exceed 1 attention layers in target model, thus removing the assertion on layer number. 2. we should get block size through `draft_attn_groups` instead of `attn_metadata_builder` after 0.17.0. 3. `attn_update_stack_num_spec_norm` shouldn't be done when unified draft parallel is enabled ### How was this patch tested? Test pass with `tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_parallel_drafting_acceptance`, which is already included in CI - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-14 18:26:37 +08:00
Hexiang Wang	0ad52517a1	Revert "Refactor quantization layer name mapping to leverage vLLM built-in mappers" (#7237 ) Reverts vllm-project/vllm-ascend#7050, which breaks kimi-k2.5 and qwen-omin. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e`	2026-03-14 00:05:54 +08:00
Cao Yi	5ec610e832	[Feature][Quant] Reapply auto-detect quantization format and support remote model ID (#7111 ) ### What this PR does / why we need it? Reapply the auto-detect quantization format feature (originally in #6645, reverted in #6873) and extend it to support remote model identifiers (e.g., `org/model-name`). Changes: - Reapply auto-detection of quantization method from model files (`quant_model_description.json` for ModelSlim, `config.json` for compressed-tensors) - Add `get_model_file()` utility to handle file retrieval from both local paths and remote repos (HuggingFace Hub / ModelScope) - Update `detect_quantization_method()` to accept remote repo IDs with optional `revision` parameter - Update `maybe_update_config()` to work with remote model identifiers - Add platform-level `auto_detect_quantization` support - Add unit tests and e2e tests for both local and remote model ID scenarios Closes #6836 ### Does this PR introduce _any_ user-facing change? Yes. When `--quantization` is not explicitly specified, vllm-ascend will now automatically detect the quantization format from the model files for both local directories and remote model IDs. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-03-13 22:53:25 +08:00
Junyuan	6852a2e267	[feat] add LMCacheAscendConnector (#6882 ) ### What this PR does / why we need it? LMCache-Ascend is LMCache's solution on the Ascend platform and one of the KVCache pooling solutions for Ascend. We hope to integrate LMCache-Ascend into the vLLM-Ascend community as one of the official KVCache pooling solutions for vLLM-Ascend. We added a new LMCacheAscendConnector in vLLM-Ascend and registered it. ### Does this PR introduce _any_ user-facing change? Users can specify the kvconnector using `--kv-transfer-config`, allowing them to freely choose which kvconnector to use, without any user-facing change. ### How was this patch tested? Test by specifying `--kv-transfer-config '{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: chloroethylene <jjysama@gmail.com>	2026-03-13 17:41:35 +08:00
Mengqing Cao	986cd45397	[Version] Drop 0.16.0 support (#7153 ) ### What this PR does / why we need it? Drop 0.16.0 support in main - Fix eagle proposer break introduced by https://github.com/vllm-project/vllm/pull/34552. Mainly change to use the draft attention group to initialize the attention metadata builder. - Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes` error, which is a bug in vLLM v0.17.0, and fixed by a later pr https://github.com/vllm-project/vllm/pull/30515 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-13 16:14:15 +08:00
rjg-lyh	7ed9e9de69	[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 ) ### What this PR does / why we need it? This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops in pd-mix stage mainly. Because the code for the current PD-disaggregated scenario is still under refactoring and cleanup, this PR prioritizes ensuring the C8 functionality in the pd-mix scenario. The next steps are planned in two parts: ① Once the optimized scatter operator is updated, we will replace the original operator to improve the performance of storing k_scale. ② Once the code logic for the PD-disaggregated scenario becomes stable, we will carry out more comprehensive validation and make appropriate adaptations. ③ Because enabling C8 currently introduces several new operators whose performance still needs improvement, performance may regress in some scenarios. Therefore, only after all the operators are fully ready can we ensure that this feature does not cause any performance degradation. At that point, we will enable this feature by default and remove the switch in `additional_config`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-13 14:47:42 +08:00
kx	df1ee8070d	[feat][spec decode]Unified draft parallel (#6766 ) ### What this PR does / why we need it? Implement a unified parallelized speculative decoding in VLLM Ascend，which can simultaneously support parallel speculative inference schemes such as Pard, P-Eagle, etc. refer to https://github.com/vllm-project/vllm-ascend/pull/6565 and https://github.com/vllm-project/vllm-ascend/pull/4078 ### How was this patch tested? run with parallel drafting script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' base script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 benchmark script: MAX_CONCURRENCY=1 NUM_PROMPTS=80 vllm bench serve --port 8811 \ --temperature 0 \ --model /model/Llama-3.1-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts ${NUM_PROMPTS} \ --max-concurrency ${MAX_CONCURRENCY} \ --seed 1234 test results : base(without spec decode): TTFT 79.46ms TPOT 26.99ms output_tokens_throughput 36.75 tok/s this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms output_tokens_throughput 72.98 tok/s per-position acceptance(from position 0 to 7): 79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%. ---------------------------------------------------------------------- run on qwen3 model script ： export target=/model/Qwen3-1.7B export draft=/model/PARD-Qwen3-0.6B export CUDA_VISIBLE_DEVICES=1 export ASCEND_RT_VISIBLE_DEVICES=1 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' cc @NickJudyHvv - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: kx <1670186653@qq.com> Signed-off-by: HF-001 <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>	2026-03-13 14:07:35 +08:00
pppeng	6ee7ffb98a	Add Qwen3_5 to model list (#7130 ) ### What this PR does / why we need it? The pr aims to add new models like Qwen3.5-35B-A3B/Qwen3.5-27B to model list for testing. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>	2026-03-13 11:42:28 +08:00
Qiu	c377e73933	Perf(PP): support PP with async scheduling. (#7136 ) ### What this PR does / why we need it? Follow up the PR https://github.com/vllm-project/vllm/pull/32618, this PR provides async scheduling support for PP in vllm-ascend. --- - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-13 10:27:23 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
Li Wang	1f71da80eb	[CI] Fix server start failure when long weight loading (#7098 ) ### What this PR does / why we need it? When loading large models (e.g., 163 shards), weight loading can exceed the default 600s timeout. Engine startup timeout with the error: ```shell TimeoutError: Timed out waiting for engines to send initial message on input socket. ``` We should increase the `VLLM_ENGINE_READY_TIMEOUT_S ` to avoid it ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-13 08:52:56 +08:00
Li Wang	7fe0469e27	[CI][Misc] Use offline mode for model downloads (#7179 ) ### What this PR does / why we need it? 1. For all parts of the current test module involving the millisecond download model, add the `local_file_only` parameter to specify offline mode; this ensures that CI will not fail due to network instability. 2. Install modelscope from a fixed commit until it next release ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? check if the env or arg `local_files_only` works 1) set the env: ```shell export HF_HUB_OFFLINE=1 ``` 2) run the script ```python from transformers import PretrainedConfig import huggingface_hub from modelscope.utils.hf_util import patch_hub patch_hub() model="Qwen/Qwen3-0.6B" kwargs = {} config_dict, _ = PretrainedConfig.get_config_dict( model, trust_remote_code=True, local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE, kwargs, ) print(config_dict) ``` it works well: ```shell 2026-03-06 06:40:12,546 - modelscope - WARNING - We can not confirm the cached file is for revision: master The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. {'architectures': ['Qwen3ForCausalLM'], 'attention_bias': False, 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'head_dim': 128, 'hidden_act': 'silu', 'hidden_size': 1024, 'initializer_range': 0.02, 'intermediate_size': 3072, 'max_position_embeddings': 40960, 'max_window_layers': 28, 'model_type': 'qwen3', 'num_attention_heads': 16, 'num_hidden_layers': 28, 'num_key_value_heads': 8, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000, 'sliding_window': None, 'tie_word_embeddings': True, 'torch_dtype': 'bfloat16', 'transformers_version': '4.51.0', 'use_cache': True, 'use_sliding_window': False, 'vocab_size': 151936, '_commit_hash': None} ``` 3) test the model repo does not cached locally when the env `HF_HUB_OFFLINE`==True ```python from transformers import PretrainedConfig import huggingface_hub from modelscope.utils.hf_util import patch_hub patch_hub() model="FireRedTeam/FireRed-OCR" kwargs = {} config_dict, _ = PretrainedConfig.get_config_dict( model, trust_remote_code=True, local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE, kwargs, ) print(config_dict) ``` and the result is as expected: ```shell File "/workspace/demo.py", line 12, in <module> config_dict, _ = PretrainedConfig.get_config_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 189, in patch_get_config_dict model_dir = get_model_dir(pretrained_model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 164, in get_model_dir model_dir = snapshot_download( ^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 137, in snapshot_download return _snapshot_download( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 283, in _snapshot_download raise ValueError( ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable look-ups and downloads online, set 'local_files_only' to False ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-13 08:52:24 +08:00
zxr2333	fe4cad24e9	[BugFix]fix qwen3.5 reshape_kvcache bug (#7209 ) ### What this PR does / why we need it? This PR fixes a bug in `reshape_kvcache_tensors` when reshaping the Mamba cache for models like Qwen3.5. The previous implementation did not correctly handle cases where the KV cache tensors have different data types. This change ensures that slicing is performed based on byte offsets before reshaping the tensors, which correctly handles heterogeneous dtypes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-12 23:51:40 +08:00
drizzlezyk	5fe7942bbd	[CI] add action for issue labeler on issue open/edit (#7208 ) ### What this PR does / why we need it? New Workflow File bot_issue_manage.yaml Automatically runs when issues are opened or edited Uses the official GitHub Issue Labeler action to categorize issues Label Configuration issue-labeler.yml Defines regex patterns for model-specific labels (310p, GLM5, Qwen 3.5, DeepSeek, Kimi K2, Kimi K2.5) Enables automatic issue classification based on title/content matching ### Does this PR introduce _any_ user-facing change? No. This PR only introduces internal GitHub Actions workflow and configuration changes. There are no API, interface, or behavior changes visible to end users. It purely improves the issue management process on GitHub. ### How was this patch tested? - GitHub Actions workflow syntax is valid and follows the official GitHub documentation - The issue labeler action (github/issue-labeler@v3.4) is a well-maintained official GitHub action - Configuration file follows the expected YAML format for the issue-labeler action - Regex patterns for model names have been verified for correct syntax - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: drizzlezyk <drizzlezyk@163.com>	2026-03-12 20:16:17 +08:00
wangbj127	0c659e91ed	[MTP][Bugfix] Fix GLM5-W8A8 precision issues caused by rotary quant MTP weights (#7139 ) ### What this PR does / why we need it? When GLM5 target model uses rotary quant, the final hidden states passes to MTP need to do an extra rotary. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Wangbingjie <wangbj1207@126.com> Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>	2026-03-12 20:01:24 +08:00
drslark	de93790d08	[main][bugfix] Fixed the problem of drafter crashed in FULL mode (#7158 ) ### What this PR does / why we need it? The merged graph of draft in `FULL` mode is broken now. This pr solves it. Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so, it is removed. It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and https://github.com/vllm-project/vllm-ascend/pull/7148. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test code is shown as below: ```python prompts = [ "1.Who are you?", "2. Who are you?", ] sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200) llm = LLM( model="/home/some-model/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_num_seqs=32, # enforce_eager=True, disable_log_stats=False, distributed_executor_backend="mp", gpu_memory_utilization=0.7, async_scheduling=True, speculative_config={ "enforce_eager": True, "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B", "disable_padded_drafter_batch": False, "method": "eagle3", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL", "cudagraph_num_of_warmups": 1, }, max_model_len=4096, enable_prefix_caching=False, ) outputs = llm.generate(prompts, sampling_params) ``` The result before: ```text File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia graph_params.events[num_tokens].append(event) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ KeyError: 132 ``` The result after: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 242 num_draft_tokens: 726 num_accepted_tokens: 156 mean acceptance length: 1.64 -------------------------------------------------- acceptance at token 0: 0.42 acceptance at token 1: 0.16 acceptance at token 2: 0.07 ``` We also test `FULL_DECODE_ONLY` mode. The result is: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 244 num_draft_tokens: 732 num_accepted_tokens: 155 mean acceptance length: 1.64 -------------------------------------------------- acceptance at token 0: 0.42 acceptance at token 1: 0.16 acceptance at token 2: 0.06 ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-12 18:38:50 +08:00
Li Wang	88c56e3bf2	[Misc] Fix main lint to make CI happy (#7204 ) ### What this PR does / why we need it? Fix lint failed due to the merging of a previous PR. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-12 18:27:48 +08:00
Li Wang	0a171b5cdd	[Test][BugFix] Fix dispatch_gmm_combine_decode test stability (#7097 ) ### What this PR does / why we need it? This patch fix the nightly failure 1. Each case uses a copy of the global kwargs instead of a reference to prevent parameter pollution between use cases. 2. Add weight initialization in the scenario of `eplb` + `w8a8_dynamic` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```python pytest -sv tests/e2e/nightly/single_node/ops/multicard_ops_a3/test_dispatch_gmm_combine_decode.py ``` ```shell ===================================================================== 3 passed, 4 warnings in 194.86s (0:03:14) ====================================================================== ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-12 17:22:44 +08:00
Li Wang	d866e6b238	[Bugfix] Fixed permission issues with the automatic PR submission workflow (#7142 ) ### What this PR does / why we need it? Auto submit a pull request via https://github.com/vllm-ascend-ci/vllm-ascend, the workflow looks like: 1. get a new config.yaml via run e2e tests 2. push the changed `config.yaml` to a new branch of https://github.com/vllm-ascend-ci/vllm-ascend 3. submit a pull request to vllm-ascend via gh cli ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-12 17:18:59 +08:00
Shaoxu Cheng	e5343d6eb3	[310P][Bugfix]: fix ngram graph replay accuracy error (#7134 ) ### What this PR does / why we need it? On the 310P device, when running ACLGraph together with the n-gram speculative decoding algorithm, both graph capture and graph replay require `uniform_decode_query_len` and do not depend on `attention_state`. This leads to a rather interesting and unexpected issue on 310P: during decode-only, execution does not enter the graph, while in the split-fuse state (that is, the chunked prefill state), it instead enters graph execution directly. The issue can be resolved by forcibly setting `uniform_decode_query_len` to `1`, so that 310P captures only the decode-only graph, and replay is then controlled through `attention_state`. ### Does this PR introduce _any_ user-facing change? NO - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-12 17:08:08 +08:00
Ronald	bfd049aa2c	[Lint] fix typos error in epd_load_balance_proxy_layerwise_server_example.py (#7199 ) ### What this PR does / why we need it? his PR fixes a typo in two function names in the `epd_load_balance_proxy_layerwise_server_example.py` example script. The function names `aquire_aborted_pd_requests` and `aquire_aborted_prefiller_requests` were misspelled and have been corrected to `acquire_aborted_pd_requests` and `acquire_aborted_prefiller_requests` respectively. This improves code readability and correctness. Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-12 17:04:38 +08:00
tfhddd	21fea86b08	feat: [CI] Introduce uv to accelerate pip install (#7127 ) ### What this PR does / why we need it? Integrates uv: Significantly accelerates pip install execution and resolves concurrency issues caused by traditional pip caching mechanisms. Why pip install uc-manager is explicitly added: This project depends on uc-manager. However, installing it via uv pip install uc-manager currently fails due to a known issue. An issue has already been filed with the upstream uv repository to address this. Consequently, we explicitly invoke pip install uc-manager as a temporary workaround to ensure the build succeeds. https://github.com/ModelEngine-Group/unified-cache-management/issues/736 Why use UV_SYSTEM_PYTHON: 1: No virtual environment has been created yet; this configuration has the same effect as directly using `pip install`. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: tfhddd <2272751277@qq.com>	2026-03-12 16:47:23 +08:00
shaopeng-666	592661e787	[Doc] EPD doc and load-balance proxy example (#6221 ) Add EPD doc and load-balance proxy example - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-03-12 16:17:17 +08:00
无脸男	09d26754cd	[Bugfix] Fix the issue where no exception is thrown when graph capture fails. (#5644 ) ### What this PR does / why we need it? Fix the issue where no exception is thrown when graph capture fails. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: WithHades <244036962@qq.com>	2026-03-12 16:14:45 +08:00
xleoken	77b43492ae	improve the ttft when use mooncake (#6125 ) ### What this PR does / why we need it? improve performance of mooncake by change the log level from info to debug ### ENV 2P + 4D, EP 1. benchmark script ``` evalscope perf \ --parallel 512 \ --number 1024 \ --model deepseek \ --url http://localhost:9000/v1/chat/completions \ --api openai \ --dataset random \ --max-tokens 2 \ --min-tokens 2 \ --prefix-length 0 \ --min-prompt-length 512 \ --max-prompt-length 512 \ --tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814 \ --extra-args '{"ignore_eos": true}' \ --rate 2 ``` 2. before patch ``` +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 209.484 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1022 \| +-----------------------------------+-----------+ \| Failed requests \| 2 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 9.7573 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2507.62 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 4.8786 \| +-----------------------------------+-----------+ \| Average latency (s) \| 7.0561 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 5.7444 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 14:56:32 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.6062 \| 0.5113 \| 0.5113 \| 1.234 \| 512 \| 2 \| 0.0888 \| 22.8338 \| \| 25% \| 0.7248 \| 0.5639 \| 0.5639 \| 1.4114 \| 512 \| 2 \| 0.2 \| 51.3919 \| \| 50% \| 0.9092 \| 0.7748 \| 0.7748 \| 1.6767 \| 512 \| 2 \| 1.1935 \| 306.7171 \| \| 66% \| 1.0745 \| 1.0345 \| 1.0345 \| 3.1308 \| 512 \| 2 \| 1.3395 \| 344.2495 \| \| 75% \| 7.0812 \| 1.5389 \| 1.5389 \| 10.0016 \| 512 \| 2 \| 1.417 \| 364.1808 \| \| 80% \| 10.6944 \| 1.8552 \| 1.8552 \| 13.3717 \| 512 \| 2 \| 1.4778 \| 379.7911 \| \| 90% \| 19.2342 \| 2.4325 \| 2.4326 \| 22.5105 \| 512 \| 2 \| 1.6208 \| 416.5381 \| \| 95% \| 24.4399 \| 2.8289 \| 2.8289 \| 26.0329 \| 512 \| 2 \| 1.7548 \| 450.9942 \| \| 98% \| 45.0941 \| 3.4098 \| 3.4098 \| 45.6287 \| 512 \| 2 \| 1.8193 \| 467.5476 \| \| 99% \| 46.2786 \| 3.8492 \| 3.8492 \| 46.9282 \| 512 \| 2 \| 1.8576 \| 477.4157 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` 3. after patch ``` Benchmarking summary: +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 191.613 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1024 \| +-----------------------------------+-----------+ \| Failed requests \| 0 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 10.6882 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2746.87 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 5.3441 \| +-----------------------------------+-----------+ \| Average latency (s) \| 2.0407 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 0.7989 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 15:10:31 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.5727 \| 0.5051 \| 0.5051 \| 1.1761 \| 512 \| 2 \| 1.0368 \| 266.4696 \| \| 25% \| 0.6497 \| 0.5324 \| 0.5324 \| 1.3159 \| 512 \| 2 \| 1.1763 \| 302.3184 \| \| 50% \| 0.7767 \| 0.6908 \| 0.6908 \| 1.4793 \| 512 \| 2 \| 1.3521 \| 347.4944 \| \| 66% \| 0.8711 \| 0.7912 \| 0.7912 \| 1.5916 \| 512 \| 2 \| 1.4518 \| 373.1092 \| \| 75% \| 0.9125 \| 0.8797 \| 0.8797 \| 1.7008 \| 512 \| 2 \| 1.521 \| 390.9018 \| \| 80% \| 0.9381 \| 0.9442 \| 0.9442 \| 1.7657 \| 512 \| 2 \| 1.5749 \| 404.7606 \| \| 90% \| 0.994 \| 1.0818 \| 1.0818 \| 1.9289 \| 512 \| 2 \| 1.7006 \| 437.0518 \| \| 95% \| 1.0369 \| 1.2454 \| 1.2454 \| 2.2154 \| 512 \| 2 \| 1.7937 \| 460.9731 \| \| 98% \| 1.1237 \| 18.8814 \| 18.8814 \| 19.4607 \| 512 \| 2 \| 1.8755 \| 482.0097 \| \| 99% \| 1.6752 \| 24.4406 \| 24.4406 \| 25.4734 \| 512 \| 2 \| 1.907 \| 490.0993 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` --------- Signed-off-by: xleoken <xleoken@163.com>	2026-03-12 16:13:48 +08:00
Hexiang Wang	f244f3c4a9	[BugFix] Fix problem of extra processes on rank0 device (#7107 ) ### What this PR does / why we need it? Currently when tp>1, we have extra processes on tp rank0 device which consumes extra HBM memory. This is caused by `import torch_npu._inductor` before set_device which introduces extra initialization of device. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All ci passed. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-12 15:59:03 +08:00
herizhen	e5024d0264	[doc] Add Ascend PyTorch Profiler section (#7117 ) ### What this PR does / why we need it? add Ascend PyTorch Profiler section ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Documentation Format Checks Technical Content Validation Build Verification Version Compatibility - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: herizhen <1270637059@qq.com>	2026-03-12 15:51:00 +08:00

1 2 3 4 5 ...

2626 Commits