xc-llm-ascend

Author	SHA1	Message	Date
meihanc	bff4fbfca5	upgrade to 0.18.0 (#7502 ) ### What this PR does / why we need it? 1. upgrade to 0.18.0 2. ensure kernel_block_sizes is int for Eagle drafter ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-03-21 16:05:38 +08:00
SILONG ZENG	a1f321a556	[Doc]Refresh model tutorial examples and serving commands (#7426 ) ### What this PR does / why we need it? Main updates include: - update model IDs and default model paths in serving / offline inference examples - adjust some command snippets and notes for better copy-paste usability - replace `SamplingParams` argument usage from `max_completion_tokens` to `max_tokens`（Offline inference currently does not support the "max_completion_tokens"） ``` bash Traceback (most recent call last): File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module> sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: Unexpected keyword argument 'max_completion_tokens' [ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception ``` - refresh Qwen3-Omni-30B-A3B-Thinking recommended environment variable ``` bash export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE=AIV ``` ``` bash EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048, epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine = 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + (maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB. [FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984] ``` - fix Qwen3-reranker example usage to match the current pooling runner interface and score output access ``` python model = LLM( model=model_name, task="score", # need fix hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], ``` ---> ``` python model = LLM( model=model_name, runner="pooling", hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], ``` - modify PaddleOCR-VL parameter `TASK_QUEUE_ENABLE` from `2` to `1` ``` bash (EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0. ``` These changes are needed because several documentation examples had drifted from the current runtime behavior and recommended invocation patterns, which could confuse users when following the tutorials directly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-20 11:34:18 +08:00
Nengjun Ma	ee804ce23e	Main2main upgrade vllm to 0318 commit (#7412 ) ### What this PR does / why we need it? Upgrade vllm commit to 0318. Main content: Added a pre-operation for cleaning up and waiting(default max 50s) for the completion of the clean up of the NPU memory to some test cases that failed due to the failure to release the NPU memory in a timely manner when the previous test cases were executed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-03-19 17:17:36 +08:00
aipaes	5e65062973	[doc] Fix issues in the GLM4.7 documentation (#7457 ) ### What this PR does / why we need it? Fix issues in the GLM4.7 documentation and add some missing explanations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? document test - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-19 16:42:59 +08:00
pz1116	6fc190b44a	[Doc][KV Pool]Revision KV Pool User Guide [2/2] (#7456 ) ### What this PR does / why we need it? Revise the KV Pool user guide: 4. Revise parameters for Memcache for better clarity, at notification that currently heterogeneous protocol setting is not supported (e.g. enable `device_rdma` and `device_sdma` at the same time, a example scenario would be data transfer by memcache across different super pods) 5. Modify the condition for Mooncakestore warmup, warmup is now needed only when `ASCEND_BUFFER_POOL` is enabled. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-03-19 16:17:34 +08:00
wangxiyuan	8e0ebb470a	[Misc] Drop Prefetch MLP Env (#7357 ) ### What this PR does / why we need it? remove deprecated environment variables related to MLP prefetching ### Does this PR introduce _any_ user-facing change? yes, the deprecated env vars can not be used then. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-19 14:27:27 +08:00
pz1116	3effc4bc70	[Doc][KV Pool]Revision KV Pool User Guide (#7434 ) ### What this PR does / why we need it? Revise the KV Pool user guide: 1. Revise Mooncake environment variables and kvconnector extra configs. 2. Delete `use_ascend_direct` in kv connector extra config as it is deprecated 3. Delete `kv_buffer_device` and `kv_rank` in P2P mooncake config 4. Unifies default `max-model-len` and `max-num-batch-tokens` in examples given. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-03-19 10:13:13 +08:00
meihanc	ab9cd2e305	[CI]Add CI summary log (#7202 ) ### What this PR does / why we need it? This PR adds a new CI log summarizer, `ci_log_summary.py`, and wires it into unit-test and e2e workflows so failed jobs publish a structured failure summary to the GitHub step summary. Examples: - `python3 .github/workflows/scripts/ci_log_summary.py --log-file /tmp/unit-test.log --mode ut --step-name "Unit test"` - `python3 .github/workflows/scripts/ci_log_summary.py --run-id 23127187822 --format json` A maintenance note is added to `ci_utils.py` to clarify that the `START` / `PASSED` / `FAILED (exit code X)` log lines are parsed by `ci_log_summary.py`, so any future format changes must be coordinated with the corresponding summarizer regexes. 🤖 Generated with [Codex]<noreply@openai.com> - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: meihanc <jcccx.cmh@gmail.com> Co-authored-by: Codex <noreply@openai.com>	2026-03-19 09:32:06 +08:00
Nengjun Ma	8b79d4de52	Main2main upgrade to vllm 0317 afternoon (#7409 ) ### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](https://github.com/vllm-project/vllm/pull/35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](https://github.com/vllm-project/vllm/pull/36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](https://github.com/vllm-project/vllm/pull/37031) 4.fix [Support multiple KV groups in OffloadingSpec ](https://github.com/vllm-project/vllm/pull/36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](https://github.com/vllm-project/vllm/pull/36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: `8a680463fa` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>	2026-03-18 23:24:27 +08:00
SparrowMu	fb8e22ec00	[DOC] MiniMax-M2.5 model intro (#7296 ) ### What this PR does / why we need it? 1. Add nightly test on MiniMax-M2.5 with deployment method on A3 2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-18 20:14:36 +08:00
SILONG ZENG	adc57c5951	[release] Add GLM5 known issue for 2-node PD mixed deployment (#7436 ) ### What this PR does / why we need it? Documented an issue in the 2-node PD mixed deployment scenario where inference may hang when concurrency exceeds 8.(GLM5) Noted that the issue has been fixed in PR: - #7235 - #7290. --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-18 10:03:18 +00:00
LoganJane	565868a2a6	[doc] add doc for Kimi-K2.5.md (#7371 ) ### What this PR does / why we need it? Upload doc for Kimi-K2.5 on Ascend Base on vllm-ascend:v0.17.0rc1 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com> Signed-off-by: LoganJane <loganJane73@hotmail.com> Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>	2026-03-18 17:16:35 +08:00
liuhy1213-cell	58725b8b24	[doc] add Prefill-Decode Disaggregation doc for GLM5.md (#7300 ) ### What this PR does / why we need it? add Prefill-Decode Disaggregation doc for GLM5.md w8a8 65k-1.5k Concurrency: 80 prefixcache: 90% tps: 2054 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-18 17:00:31 +08:00
Nagisa125	6bc68c55d0	[doc] Refresh the documentation for DeepSeek-V3.2 (#7403 ) ### What this PR does / why we need it? Updated the DSV32 document. 1. Changed the PD separation boot mode to layerwise. 2. Changed max-num-batched-tokens to a multiple of the TP to avoid triggering a verification error. 3. Added a link to help users adjust the configuration. - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: wyh145 <1987244901@qq.com>	2026-03-18 14:59:48 +08:00
zhangyiming	1c954ff264	[main2main] upgrade vllm to 0308 (#7213 ) ### What this PR does / why we need it? Update main2main to vllm 0308. breaks: * https://github.com/vllm-project/vllm/pull/30681 * https://github.com/vllm-project/vllm/pull/35552 remove self.cudagraph_batch_sizes * https://github.com/vllm-project/vllm/pull/35158 clear_metadata -> defer_finalize * https://github.com/vllm-project/vllm/pull/36006 remove CacheConfig.cpu_offload_gb * https://github.com/vllm-project/vllm/pull/35472 * https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder * https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens * https://github.com/vllm-project/vllm/pull/28053 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-03-18 09:24:43 +08:00
aipaes	3b3dd2a889	[doc] Refresh the documentation for GLM-4.7 (#7292 ) ### What this PR does / why we need it? Refresh the documentation for GLM4.7. --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-17 23:09:12 +08:00
pppeng	a457d0f0e8	[doc] Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend (#7313 ) ### What this PR does / why we need it? Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend Base on vllm-ascend:v0.17.0rc1 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>	2026-03-17 22:54:57 +08:00
Mengqing Cao	e20f0b1a0d	[ReleaseNote] Add release note for v0.17.0rc1 (#7240 ) ### What this PR does / why we need it? This pull request adds the release notes for `v0.17.0rc1`. It also updates version numbers across various documentation files, including `README.md`, `README.zh.md`, `docs/source/community/versioning_policy.md`, and `docs/source/conf.py` to reflect the new release. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e`	2026-03-15 22:47:47 +08:00
bazingazhou233-hub	c69291eefc	[Doc] Add USE_MODELSCOPE_HUB=0 to lm-eval guide (#7279 ) ## Summary - Add `USE_MODELSCOPE_HUB=0` to both Online and Offline lm-eval sections - Add explanatory notes about Docker containers launching with `VLLM_USE_MODELSCOPE=True` The Docker containers set `VLLM_USE_MODELSCOPE=True`, which causes lm-eval to download datasets from ModelScope instead of HuggingFace, resulting in "Repo not exists" errors. Setting `USE_MODELSCOPE_HUB=0` disables this behavior. Fixes #607 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com> Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>	2026-03-14 22:41:02 +08:00
bazingazhou233-hub	9e6c547d98	[Doc] Replace deprecated full_cuda_graph with cudagraph_mode in Qwen2.5-Omni (#7286 ) ## Summary - Replace `full_cuda_graph: 1` with `cudagraph_mode: FULL_DECODE_ONLY` in both single-NPU and multi-NPU examples - `full_cuda_graph` is deprecated and falls back to `NONE` on NPU Fixes #4696 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com> Co-authored-by: bazingazhou233-hub <bazingazhou233-hub@users.noreply.github.com>	2026-03-14 22:38:36 +08:00
NJX	bb506a1c99	[Doc][Installation] Clarify SOC_VERSION for CPU-only source builds (#7278 ) ### What this PR does / why we need it? - Clarify that `SOC_VERSION` must be set when building from source in a CPU-only environment where `npu-smi` is unavailable. - Add concrete `SOC_VERSION` examples (A2/A3/300I/A5) and point users to `Dockerfile*` defaults. - Improve the `setup.py` error message so users get actionable guidance when `SOC_VERSION` is missing. Fixes #6816. ### Does this PR introduce _any_ user-facing change? - Yes. Documentation is updated and the build-time error message is more informative. ### How was this patch tested? - (Local) Syntax check: `python -m compileall setup.py`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-14 22:38:25 +08:00
Junyuan	6852a2e267	[feat] add LMCacheAscendConnector (#6882 ) ### What this PR does / why we need it? LMCache-Ascend is LMCache's solution on the Ascend platform and one of the KVCache pooling solutions for Ascend. We hope to integrate LMCache-Ascend into the vLLM-Ascend community as one of the official KVCache pooling solutions for vLLM-Ascend. We added a new LMCacheAscendConnector in vLLM-Ascend and registered it. ### Does this PR introduce _any_ user-facing change? Users can specify the kvconnector using `--kv-transfer-config`, allowing them to freely choose which kvconnector to use, without any user-facing change. ### How was this patch tested? Test by specifying `--kv-transfer-config '{"kv_connector":"LMCacheAscendConnector","kv_role":"kv_both"}'` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: chloroethylene <jjysama@gmail.com>	2026-03-13 17:41:35 +08:00
Mengqing Cao	986cd45397	[Version] Drop 0.16.0 support (#7153 ) ### What this PR does / why we need it? Drop 0.16.0 support in main - Fix eagle proposer break introduced by https://github.com/vllm-project/vllm/pull/34552. Mainly change to use the draft attention group to initialize the attention metadata builder. - Fix the `ModelRunner` has no attribute `cudagraph_capture_sizes` error, which is a bug in vLLM v0.17.0, and fixed by a later pr https://github.com/vllm-project/vllm/pull/30515 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-03-13 16:14:15 +08:00
shaopeng-666	592661e787	[Doc] EPD doc and load-balance proxy example (#6221 ) Add EPD doc and load-balance proxy example - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-03-12 16:17:17 +08:00
herizhen	e5024d0264	[doc] Add Ascend PyTorch Profiler section (#7117 ) ### What this PR does / why we need it? add Ascend PyTorch Profiler section ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Documentation Format Checks Technical Content Validation Build Verification Version Compatibility - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: herizhen <1270637059@qq.com>	2026-03-12 15:51:00 +08:00
MengLong Chen	bbffe58b63	[Doc] fix DSV3.1 PD configs (#7187 ) ### What this PR does / why we need it? Modify the `kv_port` and `engine_id` config of DeepSeek-V3.1/R1 in the 2P1D scenario - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2026-03-12 14:24:49 +08:00
Canlin Guo	a78a00e0b1	[Doc][ReleaseNote] Add release notes for v0.16.0rc1 (#7067 ) Add release notes for v0.16.0rc1 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-03-10 22:45:05 +08:00
Frank Chen	14c71b19e1	[Doc][CPU binding] Add user/developer guide for CPU binding (#7045 ) ### What this PR does / why we need it? This PR adds comprehensive documentation for the CPU binding feature on Ascend NPUs. It includes: - A detailed developer guide (`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering the design, internal logic, allocation examples, and troubleshooting for the CPU binding mechanism. - A concise user guide (`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the core concepts, usage, and common issues for end-users. - An update to `additional_config.md` to use consistent terminology for binding strategies (`global-slicing` and `topo-affinity`). This documentation is needed to help both developers and users understand, use, and debug the CPU binding feature, which is critical for performance on ARM+Ascend platforms. ### Does this PR introduce _any_ user-facing change? No. This is a documentation-only update. ### How was this patch tested? The documentation has been reviewed for clarity and technical accuracy. The examples and descriptions align with the implementation in `vllm_ascend/cpu_binding.py`. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Signed-off-by: c00818886 <chenchuwei@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-03-10 15:59:31 +08:00
NJX	bb7ed759d4	[Doc] Fix broken chunked-prefill URL in supported features (#6963 ) ## What this PR does / why we need it? Fixes the broken URL for chunked-prefill in the supported features documentation page. The chunked prefill documentation URL was moved from `performance/optimization.html` to `configuration/optimization.html` in upstream vLLM docs. This PR updates the link to point to the correct location. Before: https://docs.vllm.ai/en/stable/performance/optimization.html#chunked-prefill (404) After: https://docs.vllm.ai/en/stable/configuration/optimization.html#chunked-prefill (working) ## Does this PR introduce _any_ user-facing change? Yes - fixes a broken documentation link that users encounter when clicking 'Chunked Prefill' in the supported features page. ## How was this patch tested? - Verified the new URL resolves correctly - Documentation change only Closes #4217 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-10 10:10:07 +08:00
Yikun Jiang	326fd359aa	[Docs] add and publish llms.txt for LLM discovery (#6886 ) ### What this PR does / why we need it? - move llms.txt under docs/source and publish it at /llms.txt via html_extra_path - rewrite llms.txt to an LLM-friendly link index - use _sources markdown links and include missing entry points such as FAQs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2026-03-10 10:06:27 +08:00
ZKSU	bdad11e9a8	[doc] Update GLM4.x.md, add GLM4.x multi-node deploy tutorial (#6872 ) ### What this PR does / why we need it? This PR updates the GLM4.x documentation by adding multi-node like 2 × Atlas 800 A2 (64G × 8) deployment tutorial. - What changed: Added instructions for deploying GLM-4.X models across multiple nodes, including environment variables and example commands. - Why needed: Although the previous tutorial stated that multi-node deployment on Atlas 800 A2 (64GB × 8) is not recommended, but we still face some situation that must deploy GLM-4.7 on 2 × Atlas 800 A2 (64G × 8). And we successfully run GLM-4.7 on 2 nodes and it works fine, so we think it might be the time to update this part. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Verified that the new documentation renders correctly in Markdown format. - Tested the multi-node deployment steps on 2 × Atlas 800 A2 (64G × 8) to ensure the commands work as described. - Confirmed that existing GLM4.x documentation links and structure remain intact. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: ZKSU <zksu@outlook.com>	2026-03-10 10:01:53 +08:00
Shaoxu Cheng	ba1c82e758	[DOC] Add explaination of 310p special param: max-model-len (#7065 ) ### What this PR does / why we need it? This PR updates the documentation for running vLLM on Atlas 300I series (310p) hardware. It adds a warning to explicitly set `--max-model-len` to prevent potential Out-of-Memory (OOM) errors that can occur with the default configuration. The example commands and Python scripts for online and offline inference have been updated to: - Include `--max-model-len 4096` (or `max_model_len=4096`). - Remove the `compilation-config` parameter, which is no longer necessary for 310p devices. These changes ensure users have a clearer and more stable experience when using vLLM on Atlas 300I hardware. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? The changes are to documentation and do not require testing. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-09 16:54:43 +08:00
wangxiyuan	482d39c1b0	[commuinty]update contributor and refresh tool (#7072 ) ### What this PR does / why we need it? This PR refactors the `tools/collect_user_first_contribution.sh` script to improve how we track and update our contributors list. Key changes include: - Incremental Updates: The script can now perform incremental updates by storing and reading the last processed commit hash from `docs/source/community/contributors.md`. This is much more efficient than re-processing all commits every time. - Full Refresh Option: A `--full` flag is added to allow forcing a full recalculation of all contributors, useful for correcting errors or initial setup. - Improved Usage: Replaced positional arguments with command-line flags (`--repo`, `--file`, `--full`) for better usability and clarity. - Robust Contributor-ID detection: Improved logic to find a contributor's GitHub login, including a fallback to parse it from `noreply` email addresses. - In-place File Updates: The script now directly updates the `contributors.md` file with new contributors and correct numbering, automating the entire process. These changes make the process of maintaining the contributors list more automated, reliable, and efficient. ### Does this PR introduce _any_ user-facing change? No, this only changes a developer tool and does not affect the vLLM library's public API or behavior. ### How was this patch tested? The script can be tested locally by running it against the repository. For an incremental update: `GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh` For a full refresh: `GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh --full` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-09 15:19:35 +08:00
pz1116	a7820d20f4	[Doc][KV Pool]Update Memcache local service config example: increase default world size to 256 and update description (#7025 ) ### What this PR does / why we need it? Update Memcache local service config example: increase default world size to 256 and update the description for better clarity. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>	2026-03-06 10:23:55 +08:00
LI SHENGYONG	ccd00798f3	[EPLB] Display the expert hotness comparison before and after eplb. (#6877 ) ### What this PR does / why we need it? To intuitively show the effect of the eplb algorithm, we print the expert heat before and after eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-28_17-23-42](https://github.com/user-attachments/assets/db1dadd1-cf96-44da-af34-57d41ccf412f) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-06 09:53:29 +08:00
SILONG ZENG	bd571cf6d6	[Main2Main] Upgrade vLLM to 0303 (#6944 ) ### What this PR does / why we need it? break: - https://github.com/vllm-project/vllm/pull/34102 Disable_full param replaced with valid_modes/invalid_modes API - https://github.com/vllm-project/vllm/pull/35503 Now must return float compilation_time - https://github.com/vllm-project/vllm/pull/35564 New sequence_lengths param added - https://github.com/vllm-project/vllm/pull/33807 A check was performed (if runner_backend != "auto") - https://github.com/vllm-project/vllm/pull/34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - https://github.com/vllm-project/vllm/pull/35274 Important change: - https://github.com/vllm-project/vllm/pull/28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>	2026-03-06 09:08:52 +08:00
fems14	ae394767d4	【main】ADXL/HIXL supports FabricMem Mode (#6806 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-03-05 21:04:11 +08:00
wangxiyuan	13777bf3f0	[Spec Decode]clean up spec decode interface (#6947 ) This pull request refactors the speculative decoding proposer interface to align with upstream vLLM, removing the local `Proposer` interface and renaming methods to `propose`. This is the first step. In the future we should remove the class register and just add few Ascend specified method once the arch in vLLM is ready. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-05 14:30:10 +08:00
Ronald	77e009d9fc	[Feature] Add docs of batch invariance and make some extra operators patch (#6910 ) ### What this PR does / why we need it? This PR add docs of batch invariance and make some extra operators according to validation result. please see https://github.com/vllm-project/vllm-ascend/issues/5487 to track progress. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-05 09:12:40 +08:00
NJX	c7fd7a25f7	[Doc][Misc] Fix msprobe_guide.md documentation issues (#6965 ) ## What this PR does / why we need it? Fixes several documentation issues in the msprobe debugging guide as reported in #6065: 1. Remove unnecessary `cat` heredoc wrapper: The example configuration section used a `cat <<'JSON'` bash wrapper around the JSON config. Simplified to a plain JSON code block. 2. Fix duplicate chapter numbering: Two sections were both numbered '2'. Renumbered sections sequentially (0-6). 3. Fix msprobe command: Changed `msprobe graph_visualize` to `msprobe -f pytorch graph` in section 5.2 Visualization. 4. Remove backward-related content: Since vllm is inference-only (no training), removed all backward pass references including backward tensor examples, parameter gradient examples, and backward descriptions from dump.json explanations. ## Does this PR introduce _any_ user-facing change? Documentation improvement only. No code changes. ## How was this patch tested? Manual review of the markdown file to verify all 4 issues from #6065 are addressed. Closes #6065 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-04 10:28:31 +08:00
zzzzwwjj	f19f7b1fe2	[doc] fix supported_models (#6930 ) ### What this PR does / why we need it? Add Experimental supported model/feature for supported_models.md ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2026-03-03 09:47:50 +08:00
Xiaoshuang Wang	f7a8befc20	[CI] Upgrade CANN to 8.5.1 (#6897 ) ### What this PR does / why we need it? [CI] Upgrade CANN to 8.5.1 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wxsIcey <1790571317@qq.com>	2026-03-03 09:02:42 +08:00
whx	16c879cdf7	[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 ) ### What this PR does / why we need it? Add muls_add triton kernel with related fusion pass. What's more, this PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-02 17:54:25 +08:00
zyz111222	81fb7d5779	[Doc] add 310P3 guidance of PaddleOCR-VL (#6837 ) ### What this PR does / why we need it? add 310P3 guidance of PaddleOCR-VL model, refresh PaddleOCR-VL.md in the docs/source/tutorials/ ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by CI - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: zouyizhou <zouyizhou@huawei.com>	2026-02-28 16:03:07 +08:00
wangxiyuan	3d563292f3	clean 0.15.0 support (#6852 ) Clean up vllm 0.15.0 related code - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-28 09:20:57 +08:00
wangxiyuan	9cd0d6c33d	[Doc][Misc] Update release notes for v0.15.0rc1 (#6859 ) ### What this PR does / why we need it? This PR updates the release notes for `v0.15.0rc1` to: - Mark the `310P MoE and W8A8 Support` feature as experimental. - Add a note for `Kimi-K2.5 Model Support` clarifying that it has known issues in vLLM 0.15.0 and requires manual patching to work correctly. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? N/A (documentation change). - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-27 22:35:09 +08:00
Canlin Guo	e4458b2d2b	[Main2Main] Upgrade vLLM to 0226 (#6813 ) ### What this PR does / why we need it? Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-02-27 16:05:21 +08:00
starmountain1997	80316c5824	[DOC] enable both flashcomm1 and cudagraph (#6807 ) ## What this PR does / why we need it? This PR updates the DeepSeek-V3.2 documentation to include the latest performance optimizations and configuration improvements. ### Changes - Enable FlashComm1: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1` environment variable across all deployment scenarios to enable FlashComm1 for improved communication performance - Layer Sharding: Added `--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for better memory distribution - CUDA Graph Optimization: Updated cudagraph capture sizes from `[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40, 48]` - Speculative Decoding: Increased `num_speculative_tokens` from 2 to 3 - Documentation Links: Fixed request forwarding documentation to use proper GitHub repository links ## Does this PR introduce _any_ user-facing change? Yes, users can now follow the updated documentation to enable FlashComm1 and layer sharding for improved DeepSeek-V3.2 performance. ## How was this patch tested? Existing documentation examples have been validated to ensure configuration consistency across all deployment scenarios. --- - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-27 14:52:55 +08:00
wangxiyuan	3d43ed997e	add release note for 0.15.0rc1 (#6839 ) Add release note for 0.15.0rc1 - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-27 11:55:55 +08:00
wangxiyuan	a95c0b8b82	[Doc] fix the nit in docs (#6826 ) Refresh the doc, fix the nit in the docs - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-27 11:50:27 +08:00

1 2 3 4 5 ...

625 Commits