xc-llm-ascend

Author	SHA1	Message	Date
Hexiang Wang	0ad52517a1	Revert "Refactor quantization layer name mapping to leverage vLLM built-in mappers" (#7237 ) Reverts vllm-project/vllm-ascend#7050, which breaks kimi-k2.5 and qwen-omin. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e`	2026-03-14 00:05:54 +08:00
Cao Yi	5ec610e832	[Feature][Quant] Reapply auto-detect quantization format and support remote model ID (#7111 ) ### What this PR does / why we need it? Reapply the auto-detect quantization format feature (originally in #6645, reverted in #6873) and extend it to support remote model identifiers (e.g., `org/model-name`). Changes: - Reapply auto-detection of quantization method from model files (`quant_model_description.json` for ModelSlim, `config.json` for compressed-tensors) - Add `get_model_file()` utility to handle file retrieval from both local paths and remote repos (HuggingFace Hub / ModelScope) - Update `detect_quantization_method()` to accept remote repo IDs with optional `revision` parameter - Update `maybe_update_config()` to work with remote model identifiers - Add platform-level `auto_detect_quantization` support - Add unit tests and e2e tests for both local and remote model ID scenarios Closes #6836 ### Does this PR introduce _any_ user-facing change? Yes. When `--quantization` is not explicitly specified, vllm-ascend will now automatically detect the quantization format from the model files for both local directories and remote model IDs. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-03-13 22:53:25 +08:00
Junyuan	6852a2e267	[feat] add LMCacheAscendConnector (#6882 ) ### What this PR does / why we need it? LMCache-Ascend is LMCache's solution on the Ascend platform and one of the KVCache pooling solutions for Ascend. We hope to integrate LMCache-Ascend into the vLLM-Ascend community as one of the official KVCache pooling solutions for vLLM-Ascend. We added a new LMCacheAscendConnector in vLLM-Ascend and registered it. ### Does this PR introduce _any_ user-facing change? Users can specify the kvconnector using `--kv-transfer-config`, allowing them to freely choose which kvconnector to use, without any user-facing change. ### How was this patch tested? Test by specifying `--kv-transfer-config '{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: chloroethylene <jjysama@gmail.com>	2026-03-13 17:41:35 +08:00
Mengqing Cao	986cd45397	[Version] Drop 0.16.0 support (#7153 ) ### What this PR does / why we need it? Drop 0.16.0 support in main - Fix eagle proposer break introduced by https://github.com/vllm-project/vllm/pull/34552. Mainly change to use the draft attention group to initialize the attention metadata builder. - Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes` error, which is a bug in vLLM v0.17.0, and fixed by a later pr https://github.com/vllm-project/vllm/pull/30515 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-13 16:14:15 +08:00
rjg-lyh	7ed9e9de69	[Perf][1/N] w8a8c8 support in dsv3.2/glm5 (#7029 ) ### What this PR does / why we need it? This PR supports W8A8C8 in dsv3.2/glm5 with lightning_indexer_quant ops in pd-mix stage mainly. Because the code for the current PD-disaggregated scenario is still under refactoring and cleanup, this PR prioritizes ensuring the C8 functionality in the pd-mix scenario. The next steps are planned in two parts: ① Once the optimized scatter operator is updated, we will replace the original operator to improve the performance of storing k_scale. ② Once the code logic for the PD-disaggregated scenario becomes stable, we will carry out more comprehensive validation and make appropriate adaptations. ③ Because enabling C8 currently introduces several new operators whose performance still needs improvement, performance may regress in some scenarios. Therefore, only after all the operators are fully ready can we ensure that this feature does not cause any performance degradation. At that point, we will enable this feature by default and remove the switch in `additional_config`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-13 14:47:42 +08:00
kx	df1ee8070d	[feat][spec decode]Unified draft parallel (#6766 ) ### What this PR does / why we need it? Implement a unified parallelized speculative decoding in VLLM Ascend，which can simultaneously support parallel speculative inference schemes such as Pard, P-Eagle, etc. refer to https://github.com/vllm-project/vllm-ascend/pull/6565 and https://github.com/vllm-project/vllm-ascend/pull/4078 ### How was this patch tested? run with parallel drafting script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Llama-3.2-1B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' base script: export target=/model/Llama-3.1-8B-Instruct export draft=/model/PARD-Llama-3.2-1B export CUDA_VISIBLE_DEVICES=6 export ASCEND_RT_VISIBLE_DEVICES=6 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 benchmark script: MAX_CONCURRENCY=1 NUM_PROMPTS=80 vllm bench serve --port 8811 \ --temperature 0 \ --model /model/Llama-3.1-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts ${NUM_PROMPTS} \ --max-concurrency ${MAX_CONCURRENCY} \ --seed 1234 test results : base(without spec decode): TTFT 79.46ms TPOT 26.99ms output_tokens_throughput 36.75 tok/s this pr(with parallel drafting): TTFT 72.24ms TPOT 13.45ms output_tokens_throughput 72.98 tok/s per-position acceptance(from position 0 to 7): 79.48%、56.93%、40%、27.90%、19.79%、14.25%、10.57%、7.61%. ---------------------------------------------------------------------- run on qwen3 model script ： export target=/model/Qwen3-1.7B export draft=/model/PARD-Qwen3-0.6B export CUDA_VISIBLE_DEVICES=1 export ASCEND_RT_VISIBLE_DEVICES=1 vllm serve $target \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --no-enable-prefix-caching \ --port 8811 \ --speculative-config '{"model": "/model/PARD-Qwen3-0.6B", "method": "draft_model", "num_speculative_tokens": 8, "parallel_drafting": true}' cc @NickJudyHvv - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Signed-off-by: kx <1670186653@qq.com> Signed-off-by: HF-001 <1670186653@qq.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>	2026-03-13 14:07:35 +08:00
pppeng	6ee7ffb98a	Add Qwen3_5 to model list (#7130 ) ### What this PR does / why we need it? The pr aims to add new models like Qwen3.5-35B-A3B/Qwen3.5-27B to model list for testing. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>	2026-03-13 11:42:28 +08:00
Qiu	c377e73933	Perf(PP): support PP with async scheduling. (#7136 ) ### What this PR does / why we need it? Follow up the PR https://github.com/vllm-project/vllm/pull/32618, this PR provides async scheduling support for PP in vllm-ascend. --- - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-13 10:27:23 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
Li Wang	1f71da80eb	[CI] Fix server start failure when long weight loading (#7098 ) ### What this PR does / why we need it? When loading large models (e.g., 163 shards), weight loading can exceed the default 600s timeout. Engine startup timeout with the error: ```shell TimeoutError: Timed out waiting for engines to send initial message on input socket. ``` We should increase the `VLLM_ENGINE_READY_TIMEOUT_S ` to avoid it ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-13 08:52:56 +08:00
Li Wang	7fe0469e27	[CI][Misc] Use offline mode for model downloads (#7179 ) ### What this PR does / why we need it? 1. For all parts of the current test module involving the millisecond download model, add the `local_file_only` parameter to specify offline mode; this ensures that CI will not fail due to network instability. 2. Install modelscope from a fixed commit until it next release ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? check if the env or arg `local_files_only` works 1) set the env: ```shell export HF_HUB_OFFLINE=1 ``` 2) run the script ```python from transformers import PretrainedConfig import huggingface_hub from modelscope.utils.hf_util import patch_hub patch_hub() model="Qwen/Qwen3-0.6B" kwargs = {} config_dict, _ = PretrainedConfig.get_config_dict( model, trust_remote_code=True, local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE, kwargs, ) print(config_dict) ``` it works well: ```shell 2026-03-06 06:40:12,546 - modelscope - WARNING - We can not confirm the cached file is for revision: master The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. {'architectures': ['Qwen3ForCausalLM'], 'attention_bias': False, 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'head_dim': 128, 'hidden_act': 'silu', 'hidden_size': 1024, 'initializer_range': 0.02, 'intermediate_size': 3072, 'max_position_embeddings': 40960, 'max_window_layers': 28, 'model_type': 'qwen3', 'num_attention_heads': 16, 'num_hidden_layers': 28, 'num_key_value_heads': 8, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000, 'sliding_window': None, 'tie_word_embeddings': True, 'torch_dtype': 'bfloat16', 'transformers_version': '4.51.0', 'use_cache': True, 'use_sliding_window': False, 'vocab_size': 151936, '_commit_hash': None} ``` 3) test the model repo does not cached locally when the env `HF_HUB_OFFLINE`==True ```python from transformers import PretrainedConfig import huggingface_hub from modelscope.utils.hf_util import patch_hub patch_hub() model="FireRedTeam/FireRed-OCR" kwargs = {} config_dict, _ = PretrainedConfig.get_config_dict( model, trust_remote_code=True, local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE, kwargs, ) print(config_dict) ``` and the result is as expected: ```shell File "/workspace/demo.py", line 12, in <module> config_dict, _ = PretrainedConfig.get_config_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 189, in patch_get_config_dict model_dir = get_model_dir(pretrained_model_name_or_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 164, in get_model_dir model_dir = snapshot_download( ^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 137, in snapshot_download return _snapshot_download( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 283, in _snapshot_download raise ValueError( ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable look-ups and downloads online, set 'local_files_only' to False ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-13 08:52:24 +08:00
zxr2333	fe4cad24e9	[BugFix]fix qwen3.5 reshape_kvcache bug (#7209 ) ### What this PR does / why we need it? This PR fixes a bug in `reshape_kvcache_tensors` when reshaping the Mamba cache for models like Qwen3.5. The previous implementation did not correctly handle cases where the KV cache tensors have different data types. This change ensures that slicing is performed based on byte offsets before reshaping the tensors, which correctly handles heterogeneous dtypes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-12 23:51:40 +08:00
drizzlezyk	5fe7942bbd	[CI] add action for issue labeler on issue open/edit (#7208 ) ### What this PR does / why we need it? New Workflow File bot_issue_manage.yaml Automatically runs when issues are opened or edited Uses the official GitHub Issue Labeler action to categorize issues Label Configuration issue-labeler.yml Defines regex patterns for model-specific labels (310p, GLM5, Qwen 3.5, DeepSeek, Kimi K2, Kimi K2.5) Enables automatic issue classification based on title/content matching ### Does this PR introduce _any_ user-facing change? No. This PR only introduces internal GitHub Actions workflow and configuration changes. There are no API, interface, or behavior changes visible to end users. It purely improves the issue management process on GitHub. ### How was this patch tested? - GitHub Actions workflow syntax is valid and follows the official GitHub documentation - The issue labeler action (github/issue-labeler@v3.4) is a well-maintained official GitHub action - Configuration file follows the expected YAML format for the issue-labeler action - Regex patterns for model names have been verified for correct syntax - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: drizzlezyk <drizzlezyk@163.com>	2026-03-12 20:16:17 +08:00
wangbj127	0c659e91ed	[MTP][Bugfix] Fix GLM5-W8A8 precision issues caused by rotary quant MTP weights (#7139 ) ### What this PR does / why we need it? When GLM5 target model uses rotary quant, the final hidden states passes to MTP need to do an extra rotary. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Wangbingjie <wangbj1207@126.com> Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>	2026-03-12 20:01:24 +08:00
drslark	de93790d08	[main][bugfix] Fixed the problem of drafter crashed in FULL mode (#7158 ) ### What this PR does / why we need it? The merged graph of draft in `FULL` mode is broken now. This pr solves it. Also, `actual_seq_lengths_q` in `model_runner` is found redundant, so, it is removed. It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 and https://github.com/vllm-project/vllm-ascend/pull/7148. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test code is shown as below: ```python prompts = [ "1.Who are you?", "2. Who are you?", ] sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200) llm = LLM( model="/home/some-model/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_num_seqs=32, # enforce_eager=True, disable_log_stats=False, distributed_executor_backend="mp", gpu_memory_utilization=0.7, async_scheduling=True, speculative_config={ "enforce_eager": True, "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B", "disable_padded_drafter_batch": False, "method": "eagle3", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL", "cudagraph_num_of_warmups": 1, }, max_model_len=4096, enable_prefix_caching=False, ) outputs = llm.generate(prompts, sampling_params) ``` The result before: ```text File "/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py", line 575, in full_graph_fia graph_params.events[num_tokens].append(event) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ KeyError: 132 ``` The result after: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 242 num_draft_tokens: 726 num_accepted_tokens: 156 mean acceptance length: 1.64 -------------------------------------------------- acceptance at token 0: 0.42 acceptance at token 1: 0.16 acceptance at token 2: 0.07 ``` We also test `FULL_DECODE_ONLY` mode. The result is: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 244 num_draft_tokens: 732 num_accepted_tokens: 155 mean acceptance length: 1.64 -------------------------------------------------- acceptance at token 0: 0.42 acceptance at token 1: 0.16 acceptance at token 2: 0.06 ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-12 18:38:50 +08:00
Li Wang	88c56e3bf2	[Misc] Fix main lint to make CI happy (#7204 ) ### What this PR does / why we need it? Fix lint failed due to the merging of a previous PR. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-12 18:27:48 +08:00
Li Wang	0a171b5cdd	[Test][BugFix] Fix dispatch_gmm_combine_decode test stability (#7097 ) ### What this PR does / why we need it? This patch fix the nightly failure 1. Each case uses a copy of the global kwargs instead of a reference to prevent parameter pollution between use cases. 2. Add weight initialization in the scenario of `eplb` + `w8a8_dynamic` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```python pytest -sv tests/e2e/nightly/single_node/ops/multicard_ops_a3/test_dispatch_gmm_combine_decode.py ``` ```shell ===================================================================== 3 passed, 4 warnings in 194.86s (0:03:14) ====================================================================== ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-12 17:22:44 +08:00
Li Wang	d866e6b238	[Bugfix] Fixed permission issues with the automatic PR submission workflow (#7142 ) ### What this PR does / why we need it? Auto submit a pull request via https://github.com/vllm-ascend-ci/vllm-ascend, the workflow looks like: 1. get a new config.yaml via run e2e tests 2. push the changed `config.yaml` to a new branch of https://github.com/vllm-ascend-ci/vllm-ascend 3. submit a pull request to vllm-ascend via gh cli ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-12 17:18:59 +08:00
Shaoxu Cheng	e5343d6eb3	[310P][Bugfix]: fix ngram graph replay accuracy error (#7134 ) ### What this PR does / why we need it? On the 310P device, when running ACLGraph together with the n-gram speculative decoding algorithm, both graph capture and graph replay require `uniform_decode_query_len` and do not depend on `attention_state`. This leads to a rather interesting and unexpected issue on 310P: during decode-only, execution does not enter the graph, while in the split-fuse state (that is, the chunked prefill state), it instead enters graph execution directly. The issue can be resolved by forcibly setting `uniform_decode_query_len` to `1`, so that 310P captures only the decode-only graph, and replay is then controlled through `attention_state`. ### Does this PR introduce _any_ user-facing change? NO - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-12 17:08:08 +08:00
Ronald	bfd049aa2c	[Lint] fix typos error in epd_load_balance_proxy_layerwise_server_example.py (#7199 ) ### What this PR does / why we need it? his PR fixes a typo in two function names in the `epd_load_balance_proxy_layerwise_server_example.py` example script. The function names `aquire_aborted_pd_requests` and `aquire_aborted_prefiller_requests` were misspelled and have been corrected to `acquire_aborted_pd_requests` and `acquire_aborted_prefiller_requests` respectively. This improves code readability and correctness. Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-12 17:04:38 +08:00
tfhddd	21fea86b08	feat: [CI] Introduce uv to accelerate pip install (#7127 ) ### What this PR does / why we need it? Integrates uv: Significantly accelerates pip install execution and resolves concurrency issues caused by traditional pip caching mechanisms. Why pip install uc-manager is explicitly added: This project depends on uc-manager. However, installing it via uv pip install uc-manager currently fails due to a known issue. An issue has already been filed with the upstream uv repository to address this. Consequently, we explicitly invoke pip install uc-manager as a temporary workaround to ensure the build succeeds. https://github.com/ModelEngine-Group/unified-cache-management/issues/736 Why use UV_SYSTEM_PYTHON: 1: No virtual environment has been created yet; this configuration has the same effect as directly using `pip install`. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: tfhddd <2272751277@qq.com>	2026-03-12 16:47:23 +08:00
shaopeng-666	592661e787	[Doc] EPD doc and load-balance proxy example (#6221 ) Add EPD doc and load-balance proxy example - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-03-12 16:17:17 +08:00
无脸男	09d26754cd	[Bugfix] Fix the issue where no exception is thrown when graph capture fails. (#5644 ) ### What this PR does / why we need it? Fix the issue where no exception is thrown when graph capture fails. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: WithHades <244036962@qq.com>	2026-03-12 16:14:45 +08:00
xleoken	77b43492ae	improve the ttft when use mooncake (#6125 ) ### What this PR does / why we need it? improve performance of mooncake by change the log level from info to debug ### ENV 2P + 4D, EP 1. benchmark script ``` evalscope perf \ --parallel 512 \ --number 1024 \ --model deepseek \ --url http://localhost:9000/v1/chat/completions \ --api openai \ --dataset random \ --max-tokens 2 \ --min-tokens 2 \ --prefix-length 0 \ --min-prompt-length 512 \ --max-prompt-length 512 \ --tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814 \ --extra-args '{"ignore_eos": true}' \ --rate 2 ``` 2. before patch ``` +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 209.484 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1022 \| +-----------------------------------+-----------+ \| Failed requests \| 2 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 9.7573 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2507.62 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 4.8786 \| +-----------------------------------+-----------+ \| Average latency (s) \| 7.0561 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 5.7444 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 14:56:32 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.6062 \| 0.5113 \| 0.5113 \| 1.234 \| 512 \| 2 \| 0.0888 \| 22.8338 \| \| 25% \| 0.7248 \| 0.5639 \| 0.5639 \| 1.4114 \| 512 \| 2 \| 0.2 \| 51.3919 \| \| 50% \| 0.9092 \| 0.7748 \| 0.7748 \| 1.6767 \| 512 \| 2 \| 1.1935 \| 306.7171 \| \| 66% \| 1.0745 \| 1.0345 \| 1.0345 \| 3.1308 \| 512 \| 2 \| 1.3395 \| 344.2495 \| \| 75% \| 7.0812 \| 1.5389 \| 1.5389 \| 10.0016 \| 512 \| 2 \| 1.417 \| 364.1808 \| \| 80% \| 10.6944 \| 1.8552 \| 1.8552 \| 13.3717 \| 512 \| 2 \| 1.4778 \| 379.7911 \| \| 90% \| 19.2342 \| 2.4325 \| 2.4326 \| 22.5105 \| 512 \| 2 \| 1.6208 \| 416.5381 \| \| 95% \| 24.4399 \| 2.8289 \| 2.8289 \| 26.0329 \| 512 \| 2 \| 1.7548 \| 450.9942 \| \| 98% \| 45.0941 \| 3.4098 \| 3.4098 \| 45.6287 \| 512 \| 2 \| 1.8193 \| 467.5476 \| \| 99% \| 46.2786 \| 3.8492 \| 3.8492 \| 46.9282 \| 512 \| 2 \| 1.8576 \| 477.4157 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` 3. after patch ``` Benchmarking summary: +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 191.613 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1024 \| +-----------------------------------+-----------+ \| Failed requests \| 0 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 10.6882 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2746.87 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 5.3441 \| +-----------------------------------+-----------+ \| Average latency (s) \| 2.0407 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 0.7989 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 15:10:31 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.5727 \| 0.5051 \| 0.5051 \| 1.1761 \| 512 \| 2 \| 1.0368 \| 266.4696 \| \| 25% \| 0.6497 \| 0.5324 \| 0.5324 \| 1.3159 \| 512 \| 2 \| 1.1763 \| 302.3184 \| \| 50% \| 0.7767 \| 0.6908 \| 0.6908 \| 1.4793 \| 512 \| 2 \| 1.3521 \| 347.4944 \| \| 66% \| 0.8711 \| 0.7912 \| 0.7912 \| 1.5916 \| 512 \| 2 \| 1.4518 \| 373.1092 \| \| 75% \| 0.9125 \| 0.8797 \| 0.8797 \| 1.7008 \| 512 \| 2 \| 1.521 \| 390.9018 \| \| 80% \| 0.9381 \| 0.9442 \| 0.9442 \| 1.7657 \| 512 \| 2 \| 1.5749 \| 404.7606 \| \| 90% \| 0.994 \| 1.0818 \| 1.0818 \| 1.9289 \| 512 \| 2 \| 1.7006 \| 437.0518 \| \| 95% \| 1.0369 \| 1.2454 \| 1.2454 \| 2.2154 \| 512 \| 2 \| 1.7937 \| 460.9731 \| \| 98% \| 1.1237 \| 18.8814 \| 18.8814 \| 19.4607 \| 512 \| 2 \| 1.8755 \| 482.0097 \| \| 99% \| 1.6752 \| 24.4406 \| 24.4406 \| 25.4734 \| 512 \| 2 \| 1.907 \| 490.0993 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` --------- Signed-off-by: xleoken <xleoken@163.com>	2026-03-12 16:13:48 +08:00
Hexiang Wang	f244f3c4a9	[BugFix] Fix problem of extra processes on rank0 device (#7107 ) ### What this PR does / why we need it? Currently when tp>1, we have extra processes on tp rank0 device which consumes extra HBM memory. This is caused by `import torch_npu._inductor` before set_device which introduces extra initialization of device. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All ci passed. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-12 15:59:03 +08:00
herizhen	e5024d0264	[doc] Add Ascend PyTorch Profiler section (#7117 ) ### What this PR does / why we need it? add Ascend PyTorch Profiler section ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Documentation Format Checks Technical Content Validation Build Verification Version Compatibility - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: herizhen <1270637059@qq.com>	2026-03-12 15:51:00 +08:00
Mercykid-bash	132f3c5d0a	Support per-step heat collection and enhance FlashLB for multi-stage load balancing (#6477 ) # Feature: FlashLB algorithm ## Purpose This Pull Request enhances the EPLB (Expert Parallelism Load Balancing) system by introducing a novel load balancing algorithm: FlashLB. 1. The default algorithm adopts two separate sub-procedures to optimize expert replication and placement independently: a. Expert Replica Allotment Sub-procedure : Determines the number of replicas for all experts. At each step, it greedily adds one more replica to the expert with the highest per-replica load, aiming to minimize load skew at the expert replica granularity (Min Max Replica, MMR). b. Expert Replica Placement Sub-procedure : Distributes all replicas across devices. First, it sorts the generated replicas in descending order of hotness, then iteratively places the currently hottest replica onto the device with the lowest cumulative load and available slots. However, this simplistic combination of two separate procedures lacks synergy and often leads to sub-optimal load balancing. For example, in the simple scenario illustrated below: Given 8 logical experts with hotness values [600, 560, 120, 120, 20, 10, 10, 10], and 2 replicas allocated per device across 8 devices, the default EPLB algorithm results in a maximum per-device hotness of 232 (peak-average load ratio 1.28), while our proposed FlashLB algorithm reduces this value to 205 (peak-average load ratio 1.13). <figure><img src="https://github.com/user-attachments/assets/b9b10fab-651e-4524-9942-adbca8d044a4" width="90%"</figure> 2. The default algorithm simply aggregates hotness measurements across the entire profiling window. While this provides a coarse approximation of the hotness distribution, it fails to capture the time-phased variations and temporal correlations in expert hotness (both within and between experts) across iterations—phenomena that have been observed in real-world scenarios. Such single-point hotness estimation degrades the solution quality of the load balancing algorithm. 3. The default algorithm regularly recalculates updated expert placement results for all layers without discrimination. Considering that excessive expert updates can impact Service Level Objectives (SLOs), such full-scale redeployment leads to excessively high adjustment overhead, which negatively affects end-to-end performance. ## FlashLB Algorithm Principle ### 1. Joint Optimization of Replica Allotment and Placement FlashLB achieves joint optimization of replica allotment and placement through a novel tree search approach, combined with carefully designed e Fl fficient pruning and lightweight look-ahead estimation. We partition all experts into several subsets, and for each subset, hierarchically determine the optimal replica count and placement. Leveraging efficient pruning and lightweight look-ahead estimation, the process consistently aims to optimize the globally expected inter-device load balance degree (considering both deployed and unexplored experts) while ensuring sufficient computational efficiency. Additionally, precompilation techniques are employed for acceleration, delivering load balancing that is both high-quality and practically efficient. ### 2. Multi-Episode Enhancement Instead of performing full-duration averaging like the default algorithm, FlashLB partitions each profiling interval (e.g., 1024 iterations) into multiple consecutive smaller episodes (e.g., 16 iterations). This preserves hotness fluctuation and correlation information. It then constructs a multi-objective optimization problem to co-optimize these episodes simultaneously, enabling adaptability to interleaved hotness patterns and improving statistical robustness. ### 3. Layer-wise Cherry-Picking Redeployment To reduce the overhead of frequent expert redeployment, FlashLB introduces a cherry-picking redeployment scheme. During each algorithmic decision cycle, it real-time tracks load balance degree of all layers and triggers expert placement updates only for those layers whose peak-average ratio exceeds a predefined threshold. This avoids unnecessary redeployment for stable layers, significantly reducing adjustment overhead and thereby improving end-to-end performance gains. ## Co-author: Co-authored-by: Skywalker-EP 173723846@qq.com This PR mainly introduces two key optimizations for load balancing scheduling: 1. Add per-step heat collection function: Support real-time collection of per-step heat information during model inference. This enables more fine-grained load balancing decisions by taking per-step heat as the optimization target, improving scheduling accuracy for dynamic and fluctuating workloads. 2. Update FlashLB algorithm: Upgrade the FlashLB scheduling logic to better adapt to multi-stage heat distribution scenarios. The improved algorithm can comprehensively perceive and utilize multi-stage heat characteristics, achieving more stable and efficient load balancing under complex expert deployment and dynamic traffic patterns. --------- Signed-off-by: Mercykid-bash <ruanche0218@gmail.com> Signed-off-by: xuzewei28 <xuzewei2@h-partners.com> Co-authored-by: xuzewei28 <xuzewei2@h-partners.com>	2026-03-12 15:49:09 +08:00
Feng-xiaosuo	abe72d7cb9	Refactor quantization layer name mapping to leverage vLLM built-in mappers (#7050 ) …the quantization layer name ### What this PR does / why we need it? This PR modifies the loading logic for layer name prefixes in quantized models. The goal is to reduce or eliminate the need for point-to-point (hardcoded) modifications by leveraging the built-in mapper mechanism already provided in vLLM's model code. For models that do not yet have a corresponding mapper, the original point-to-point modification approach has been retained to ensure backward compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The changes were validated using an offline deployment script to launch and verify multiple multimodal models. Testing confirmed that the updated loading logic correctly handles layer name prefixes across different model architectures, with no regression in model initialization or inference behavior. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Matrix_K <zhangke144@huawei.com> Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com> Co-authored-by: Matrix_K <zhangke144@huawei.com>	2026-03-12 15:48:14 +08:00
drslark	fb0d6dd175	[main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148 ) ### What this PR does / why we need it? Fixed the error of speculative decoding in FULL mode when `num_spec + 1` not in `cudagraph_capture_sizes`. Now, we can run speculative decoding in FULL mode, but with drafter as eager. It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 . ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test code is shown as below: ```python prompts = [ "1.Who are you?", "2. Who are you?", ] sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200) llm = LLM( model="/home/some-model/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_num_seqs=32, # enforce_eager=True, disable_log_stats=False, distributed_executor_backend="mp", gpu_memory_utilization=0.7, async_scheduling=True, speculative_config={ "enforce_eager": True, "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B", "disable_padded_drafter_batch": False, "method": "eagle3", "num_speculative_tokens": 2, }, compilation_config={ "cudagraph_mode": "FULL", "cudagraph_num_of_warmups": 1, }, max_model_len=4096, enable_prefix_caching=False, ) outputs = llm.generate(prompts, sampling_params) ``` The result before: ```text File "/vllm-workspace/vllm/vllm/v1/cudagraph_dispatcher.py", line 140, in _create_padded_batch_descriptor assert num_tokens_padded % uniform_decode_query_len == 0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError ``` The result after: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 249 num_draft_tokens: 498 num_accepted_tokens: 149 mean acceptance length: 1.60 -------------------------------------------------- acceptance at token 0: 0.43 acceptance at token 1: 0.17 ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-12 14:51:12 +08:00
XiaoxinWang	37d1bd8c50	fixed fia pad logic in graph mode. (#7144 ) ### What this PR does / why we need it? related to vllm PR #34043 this pr delete func ‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual number of requests, due to fia operator requires that query_start_loc[-1] equals the total number of computed tokens, so this func delete cause the ifa error. In full graph mode, set num_reqs_paded = num_reqs to fix the error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2026-03-12 14:50:54 +08:00
MengLong Chen	bbffe58b63	[Doc] fix DSV3.1 PD configs (#7187 ) ### What this PR does / why we need it? Modify the `kv_port` and `engine_id` config of DeepSeek-V3.1/R1 in the 2P1D scenario - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2026-03-12 14:24:49 +08:00
Qiu	aa0143e55d	refactor: add a check before layer_sharding logging (#7186 ) ### What this PR does / why we need it? We should only display this log message when layer_sharding is enabled. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-12 11:56:04 +08:00
linfeng-yuan	5f3826b093	[Build] Add support for Ascend950 chip (#7151 ) ### What this PR does / why we need it? This PR adds support for the Ascend950 chip. This includes: - Updating build scripts (`CMakeLists.txt` and `setup.py`) to recognize the Ascend950 chip and set appropriate compilation flags. - Disabling a set of custom operators that are not yet supported on the Ascend950 hardware target. - Performing a codebase-wide refactoring of `pipe_barrier()` calls to the namespaced `AscendC::PipeBarrier<>()` for improved code consistency and adherence to the latest API standards. Ascend950DT e2e passed (Qwen3-32B-MXFP8) and CI passed - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-12 10:25:51 +08:00
meihanc	da01a74009	Revert "[CI] fix skiped e2e test when upgrade vllm version (#6654 )" (#7166 ) This reverts commit `f6db47f103`. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-03-11 23:03:15 +08:00
shiyuan680	3b6b3c4214	[MODELRUNNERV2]fix penality ops (#7013 ) ### What this PR does / why we need it? fix penality ops for new version, and achieved a 10% performance improvement ### How was this patch tested? pytest ‎tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shiyuan680 <917935075@qq.com>	2026-03-11 17:13:34 +08:00
yupeng	830f39dd70	[Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650 ) ### What this PR does / why we need it? Fix the issue #6143 . ### Does this PR introduce _any_ user-facing change? Allow to start the server with "--enable-lora && --fully-sharded-loras && --tensor_parallel_size 2". ### How was this patch tested? pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-11 15:43:15 +08:00
pz1116	a7f91fce71	[KV Pool]get_num_new_matched_tokens return 0 if token length < block_size (#7146 ) ### What this PR does / why we need it? Currently, we call lookup_client for looking up token hit in KV Pool, however, when token length < block size, the key will be empty and there is no point to lookup in KV Pool backend since there will never be a hit. Hence, add early return in `get_num_new_matched_tokens` when `token_len` < `block_size` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-03-11 15:05:34 +08:00
Mengqing Cao	1a83c8e2f5	[CI] Build Image for v0.16.0rc1 (#7155 ) ### What this PR does / why we need it? Build Image for v0.16.0rc1 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-11 14:48:50 +08:00
SILONG ZENG	90aa048e60	[CI] Skip `test_mooncake_layerwise_connector.py` in `ut` (#7147 ) ### What this PR does / why we need it? The `test_mooncake_layerwise_connector.py` file in the `ut` test will be skipped for now and fixed later. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-11 11:46:29 +08:00
zxr2333	e16009b2cc	[BugFix]Fix recomputed scheduler bug (#7137 ) ### What this PR does / why we need it? Fix the wrong usage of `model_type`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-11 00:32:19 +08:00
SparrowMu	54668e73c5	[Model] Support Minimax-m2.5 on NPU (#7105 ) ### What this PR does / why we need it? Initial version to support minimax-m2.5 on vllm-ascend. This commit coverting original fp8 weight to a quantilized bf16 to support Minimax-m2.5 on NPU. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` ### Test Report Self tested precision summary, where the official precision score of AIME2025 is 86.3 <img width="426" height="84" alt="image" src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a" /> --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-11 00:12:02 +08:00
zxr2333	239683c7a6	[P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (#7022 ) ### What this PR does / why we need it? Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-10 23:59:20 +08:00
pppeng	0f289fa2a8	Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109 ) ### What this PR does / why we need it? The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not support `ssm_state` inputs in float32 format, we temporarily retain the _forward_core implementation with triton for Qwen3_5 --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>	2026-03-10 23:28:58 +08:00
Canlin Guo	a78a00e0b1	[Doc][ReleaseNote] Add release notes for v0.16.0rc1 (#7067 ) Add release notes for v0.16.0rc1 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-10 22:45:05 +08:00
Li Wang	881c38d210	[Misc] Download on both hk and guiyang region (#7129 ) ### What this PR does / why we need it? Since the PVC files for Guiyang and Hong Kong are not shared, we need to trigger the download of both regions simultaneously when downloading the model to ensure that the models in all regions are synchronized. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-10 19:22:32 +08:00
shaopeng-666	6e8d3681ae	[bugdix] The problem that the w4a8 weight fails to be loaded when the EP is not enabled is resolved. (#7090 ) ### What this PR does / why we need it? This is a bug fix to resolve the issue where the MOE model fails to load quantized weights in w4a8 format when EP is not enabled.The parameters ["weight_scale_second", "weight_offset_second", "scale_bias"] shall be parsed in per-group mode, regardless of other conditions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-03-10 16:57:05 +08:00
lilinsiman	a5ea699e29	[eagle][cp] fix eagle_cp enable bug2 (#7079 ) ### What this PR does / why we need it? Fix acceptance and high-concurrency bug in eagle3 and cp enabled ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-10 16:32:49 +08:00
zhangxinyuehfad	67d40f23fd	[CI]Upgrade niglty multi-node-tests max-parallel to 2 (#7035 ) ### What this PR does / why we need it? 1. Increase nightly multi-node test max-parallel from 1 to 2, and fix resource conflicts that arise when tests run concurrently. 2. Fix parse-trigger job: Add an if condition so it only runs on schedule, workflow_dispatch, or PRs labeled nightly-test 3. Adjust nightly schedule: Shift trigger time from 24:00 to 23:45 (UTC+8) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-03-10 16:25:51 +08:00
pu-zhe	5df450bca4	[Feat] [310p] Support w8a8sc quantization method (#7075 ) ### What this PR does / why we need it? New Quantization Method: Introduced support for the W8A8SC static linear quantization scheme specifically for 310P hardware, enabling more efficient model compression. Refactored the save_sharded_state_310.py to avoid multi-process issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8SC quant E2E test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-10 16:13:20 +08:00
Frank Chen	14c71b19e1	[Doc][CPU binding] Add user/developer guide for CPU binding (#7045 ) ### What this PR does / why we need it? This PR adds comprehensive documentation for the CPU binding feature on Ascend NPUs. It includes: - A detailed developer guide (`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering the design, internal logic, allocation examples, and troubleshooting for the CPU binding mechanism. - A concise user guide (`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the core concepts, usage, and common issues for end-users. - An update to `additional_config.md` to use consistent terminology for binding strategies (`global-slicing` and `topo-affinity`). This documentation is needed to help both developers and users understand, use, and debug the CPU binding feature, which is critical for performance on ARM+Ascend platforms. ### Does this PR introduce _any_ user-facing change? No. This is a documentation-only update. ### How was this patch tested? The documentation has been reviewed for clarity and technical accuracy. The examples and descriptions align with the implementation in `vllm_ascend/cpu_binding.py`. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Signed-off-by: c00818886 <chenchuwei@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-03-10 15:59:31 +08:00

1 2 3 4 5 ...

2602 Commits