xc-llm-ascend

Author	SHA1	Message	Date
Marck	17da96658f	[ModelLoader][Feature] Add rfork support for fast model loading (#7392 ) ### What this PR does / why we need it? Support an new load format: RFORK For implementation details of this feature, please refer to #7441 ### Does this PR introduce _any_ user-facing change? add an new options for load-format: rfork e.g. ```bash vllm serve /workspace/models/Qwen3-8B --load-format rfork ``` ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: Marck <1412354149@qq.com>	2026-03-25 16:40:30 +08:00
liziyu	568b6d0601	[P/D] Check wildcard address for layerwise connector (#7389 ) ### What this PR does / why we need it? Check wildcard address address for layerwise connector - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-03-24 15:50:06 +08:00
Ronald	bfd049aa2c	[Lint] fix typos error in epd_load_balance_proxy_layerwise_server_example.py (#7199 ) ### What this PR does / why we need it? his PR fixes a typo in two function names in the `epd_load_balance_proxy_layerwise_server_example.py` example script. The function names `aquire_aborted_pd_requests` and `aquire_aborted_prefiller_requests` were misspelled and have been corrected to `acquire_aborted_pd_requests` and `acquire_aborted_prefiller_requests` respectively. This improves code readability and correctness. Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-12 17:04:38 +08:00
shaopeng-666	592661e787	[Doc] EPD doc and load-balance proxy example (#6221 ) Add EPD doc and load-balance proxy example - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-03-12 16:17:17 +08:00
pu-zhe	5df450bca4	[Feat] [310p] Support w8a8sc quantization method (#7075 ) ### What this PR does / why we need it? New Quantization Method: Introduced support for the W8A8SC static linear quantization scheme specifically for 310P hardware, enabling more efficient model compression. Refactored the save_sharded_state_310.py to avoid multi-process issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8SC quant E2E test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-10 16:13:20 +08:00
ZRJ026	a398fa6a0b	[Bugfix]: correct streaming content-type in load balance proxy server (#6985 ) Set proper 'text/event-stream; charset=utf-8' media type for streaming requests instead of hardcoded 'application/json' ### What this PR does / why we need it? This PR fixes an issue in the disaggregated prefill proxy server where streaming requests (`"stream": true`) were always returned with a hardcoded `Content-Type: application/json`, even when the backend vLLM servers correctly returned Server-Sent Events (SSE) with `Content-Type: text/event-stream; charset=utf-8`. Specifically, the proxy used `StreamingResponse` with a fixed `media_type` of `application/json`, which caused FastAPI to override the response headers and break proper SSE semantics. As a result, clients (e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not reliably receive token-by-token streaming output. In addition, this incorrect response type causes compatibility issues with benchmarking and load-testing tools such as EvalScope. When streaming is enabled, these tools expect SSE-formatted responses to correctly parse token usage information. With the incorrect `application/json` content type, EvalScope fails to parse the response and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR: Failed to parse usage from response: list index out of range. Response: []` This PR updates the proxy to: - Detect whether the incoming request is a streaming request (`stream=true`) - Use `text/event-stream; charset=utf-8` for streaming responses - Preserve `application/json` for non-streaming responses This aligns the proxy behavior with native vLLM prefill/decoder servers and the OpenAI-compatible streaming API contract. Fixes incorrect streaming response headers that prevented proper real-time token delivery. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? This change was tested manually using a disaggregated prefill + decode setup with the proxy server. ### Test Steps 1. Start prefiller and decoder vLLM servers: ```bash vllm serve --host 0.0.0.0 --port 8001 ... vllm serve --host 0.0.0.0 --port 8002 ... ``` 2. Start the proxy server: ```bash python load_balance_proxy_server_example.py \ --host 127.0.0.1 --port 8000 \ --prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \ --decoder-hosts 127.0.0.1 --decoder-ports 8002 ``` 3. Send a streaming completion request through the proxy: ```bash curl -i -X POST http://127.0.0.1:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "test", "prompt": "hello", "max_tokens": 3, "stream": true }' ``` 4. Verify the following: - The response header is Content-Type: text/event-stream; charset=utf-8 - Tokens are streamed incrementally as SSE data: events - Non-streaming requests still return application/json No automated tests were added because this change affects an example proxy server and is limited to HTTP response headers. The behavior is directly verifiable using standard SSE-compatible clients. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: zrj026 <zhangrunjiang026@gmail.com> Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>	2026-03-10 10:11:35 +08:00
NJX	9b30d4e774	[Doc][Misc] Add metrics usage documentation and example (#6962 ) ## What this PR does / why we need it? This PR addresses issue #5027 where users find that `output.metrics` returns `None` when using the vLLM offline inference API. Root Cause: vLLM disables log stats by default (`disable_log_stats=True`), which causes `output.metrics` to be `None`. Changes: 1. Added a NOTE comment in `examples/offline_inference_npu.py` explaining how to enable metrics 2. Created a new example `examples/offline_inference_metrics.py` demonstrating how to access request-level metrics (`first_token_time`, `finished_time`, etc.) by setting `disable_log_stats=False` ## Does this PR introduce _any_ user-facing change? Yes - adds documentation and example code to help users understand how to access output metrics. ## How was this patch tested? - Documentation/example change only - Verified example code follows the same patterns as existing examples Closes #5027 - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: NJX-njx <3771829673@qq.com>	2026-03-10 10:09:50 +08:00
pu-zhe	5899438a86	[Feat][310p] 310P support w8a8s quantization and saving w8a8sc state (#6878 ) ### What this PR does / why we need it? This pull request introduces significant enhancements for 310P device support, primarily by enabling W8A8S quantization and facilitating the saving of models with W8A8SC state outputs. It provides an example script for saving sharded and compressed model states, implements the core W8A8S quantization method, and integrates metadata generation within the 310P worker to accurately describe the quantization types of saved parameters. These changes aim to improve efficiency and compatibility for quantized models on 310P hardware. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8S accuarcy test and W8A8SC states save. <img width="886" height="184" alt="image" src="https://github.com/user-attachments/assets/e9bcac54-1f69-4d3a-a5b8-221a147ef99d" /> - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-02 20:09:15 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
liziyu	e5f0e0eaf7	[P/D] layerwise connector support recompute scheduler (#5900 ) ### What this PR does / why we need it? layerwise connector support recompute scheduler. NOTE： Triggering recompute will invoke the tokenizer again, which may lead to precision fluctuations. [RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support https://github.com/vllm-project/vllm-ascend/issues/4842 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-02-07 15:24:42 +08:00
wangyu	c63b7a1188	[Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301 ) ### What this PR does / why we need it? This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test by running ci - vLLM version: release/v0.12.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>	2026-02-06 17:30:17 +08:00
LHXuuu	45a573cff1	[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>	2026-02-02 16:39:32 +08:00
Li Wang	43be004379	[Lint] Fix mypy issue to make CI happy (#6272 ) ### What this PR does / why we need it? The variables `self.prefiller_heap` `self.decoder_heap` are used as `List[tuple[float, int, ServerState]]` but defined as `List[tuple[int, int, ServerState]]`, which leads to the failed of mypy, see https://github.com/vllm-project/vllm-ascend/actions/runs/21351411010/job/61448739554?pr=6265 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 17:54:00 +08:00
yuxinshan	7d119df2a9	[Feat] proxy delay to remove instances (#5934 ) ### What this PR does / why we need it? For the proxy, we should remove instances when the proxy are not processing requests. But sometimes, We need to isolate some faulty nodes when a large number of requests are coming in. So we support to isolate faulty nodes by lowering their priority and deleted them when the proxy does not process requests. ### Does this PR introduce _any_ user-facing change? For `examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`, when using `/instances/remove` API to delete the node from the proxy server: ```txt curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }' ``` There are 2 situations: * 【New】When the proxy is processing requests, isolate the nodes and remove them when the proxy is free. ```txt {"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` * When the proxy is free, remove the nodes directly. ```txt {"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']} ``` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: yuxinshan <syx_ctyg@126.com>	2026-01-26 16:29:45 +08:00
Li Wang	c26ad78f86	[CI][lint] Add rule `codespell` back (#6236 ) ### What this PR does / why we need it? After removing codepsell a while, we discovered that typo had a problem correctly recognizing certain misspelled words, so I suggested adding it back. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 14:12:33 +08:00
SILONG ZENG	4811ba62e0	[Lint]Style: reformat markdown files via markdownlint (#5884 ) ### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-15 09:06:01 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
wangxiyuan	e5c46bf169	[CI] Fix lint CI (#5880 ) Quick fix for lint CI - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-14 11:23:38 +08:00
LHXuuu	0415e694cd	[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-01-14 09:17:26 +08:00
SILONG ZENG	78d5ce3e01	[Lint]Style: Convert `example` to `ruff format` (#5863 ) ### What this PR does / why we need it? This PR fixes linting issues in the `example/` to align with the project's Ruff configuration. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-13 20:46:50 +08:00
wangxiyuan	492173cf89	[Misc] Cleanup useless print and logger (#5220 ) 1. Remove useless print 2. use vLLM logger 3. change useless INFO to DEBUG level - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-22 11:28:26 +08:00
zxr2333	073a3a6e6c	[Doc][P/D] Fix MooncakeConnector's name (#5172 ) ### What this PR does / why we need it? vLLM community has integrated their MooncakeConnector. The original scripts will now find this MooncakeConnector instead of the one from vLLM-Ascend. All scripts that involve using the MooncakeConnector need to be modified to another name. ### Does this PR introduce _any_ user-facing change? Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector. ### How was this patch tested? By CI. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-12-18 22:29:19 +08:00
yuxinshan	b0376abd4c	[feat] proxy support elastic scaling (#5063 ) [RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool: https://github.com/vllm-project/vllm-ascend/issues/3380 ### What this PR does / why we need it? Support elastic scaling for P/D instances based on mooncake conncetor deplayment. Support API routes * `/instances/add`: add prefill nodes or decode nodes to the list. * `/instances/remove`: remove prefill nodes or decode nodes from the list. Support functions * Support adding prefill nodes or decode nodes. - If prefill or decode server deployed after the proxy deployed, server can use `/instances/add` API to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available. * Support removing prefill nodes or decode nodes: - Support using `/instances/remove` API to delete the node from the proxy server. ### Does this PR introduce _any_ user-facing change? For `examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`: Add 2 params When adding nodes to the proxy, the proxy will wait the nodes to be started util retrying a certain of times. \| name \| type \| default \| help \| \| ----- \| ---- \| ---- \| ---- \| \| max-waiting-retries \| int \| 3 \| Maximum number of retries for waiting nodes to be started \| \| waiting-retry-interval \| float \| 10 \| Check interval (seconds) for waiting nodes to be started \| For example: ```shell python load_balance_proxy_server_example.py \ --host 0.0.0.0 --port 9000 \ --prefiller-hosts 127.0.0.1 127.0.0.1 \ --prefiller-ports 8100 8101 \ --decoder-hosts 127.0.0.1 127.0.0.1 \ --decoder-ports 8200 8201 \ --max-waiting-retries 3 \ --waiting-retry-interval 10 ``` Add 2 API routings * Add instances: `instances/add` For example, add 2 prefiller instances: ```shell curl -X POST http://localhost:9000/instances/add \ -H "Content-Type: application/json" \ -d '{ "type": "prefill", "instances": ["127.0.0.1:8102", "127.0.0.1:8103"] }' ``` Response: ```shell {"message": "add prefill instances: ['127.0.0.1:8102', '127.0.0.1:8103'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102', '127.0.0.1:8103'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` If the node '127.0.0.1:8103' has not benn started: ```shell {"message": "add prefill instances: ['127.0.0.1:8102']. Instances ['127.0.0.1:8103'] are waiting to be added.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101', '127.0.0.1:8102'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']} ``` * Remove instances: `instances/remove` For example, remove 1 decoder instance: ```shell curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }' ``` Response: ```shell {"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']} ``` ### How was this patch tested? Run proxy and using `/instances/add` API to add nodes and `/instances/remove` API to remove nodes * vLLM version: v0.11.0.rc3 * vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0.rc3 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2025-12-18 14:29:53 +08:00
zhangxinyuehfad	18d2395f5e	[Bugfix] fix fastapi version (#5047 ) ### What this PR does / why we need it? fix fastapi version == 0.123.10(<0.124.0) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-16 15:58:27 +08:00
Li Wang	2497bbbaf6	[Misc] Update pooling example (#5002 ) ### What this PR does / why we need it? Since the param `task` has been depprecated, we should use the latest unified standard parameters for pooling models, this should be more clear - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-15 08:36:19 +08:00
wangxiyuan	42ceaf08a1	add release note for 0.12.0 (#4995 ) Add release note for v0.12.0rc1 Update deepseek3.2 tutorial doc - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-13 22:09:59 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
wangxiaoteng888	a77045f355	[P/D][main]Offline the llmdatadist connector related parts of the code and files. (#4780 ) ### What this PR does / why we need it? As support for the mooncake connector is now available, the llmdatadist connector is no longer being maintained, so the llmdatadist-related files need to be retired. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-12-09 22:36:43 +08:00
linfeng-yuan	56f01820e8	[Docs]fix the configuration conflicts in documentation (#4823 ) ### What this PR does / why we need it? Fix configuration error in our documentations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-09 15:37:38 +08:00
lhp-deep	b230e7e987	[MOE]move weight transpose to wakeup for RL secnarios (#4626 ) ### What this PR does / why we need it? In reinforcement learning scenarios, the current inference applies a transpose operation to the weights. For a cleaner architecture, the weight transpose module was moved to wakeup. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: lhp-deep <liuhaopeng1@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-08 20:34:52 +08:00
wangxiyuan	0b65ac6c4b	remove useless patch (#4699 ) patach_config is useless now. Let's remove it - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-08 11:02:42 +08:00
Li Wang	283bc5c7ba	[Nightly] Optimize nightly CI (#4509 ) ### What this PR does / why we need it? 1. Optimize multi-node waiting logic 2. Remove the `tee` pipeline for logs, which will lead to hang issue ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-04 22:31:07 +08:00
wangxiyuan	cb33b09179	[Doc]clean up ascend scheduler config from doc (#4612 ) clean up ascend scheduler config from doc - vLLM version: v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-02 14:22:56 +08:00
Wang Kunpeng	a9c4b8604a	[main][bugfix] bugfix for qwen3 moe quantization (#4599 ) ### What this PR does / why we need it? Fix the issue where the qwen3 moe service cannot be started due to upgrading the vllm version Error info: AttributeError: 'AscendFusedMoE' object has no attribute 'use dp chunking' ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.11.2 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-12-01 23:48:57 +08:00
Mengqing Cao	517fd9272d	Revert "drop ascend scheduler" (#4580 ) Reverts vllm-project/vllm-ascend#4498 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-11-29 22:20:48 +08:00
wangxiyuan	f10acddb78	drop ascend scheduler (#4498 ) Ascend scheduler was added for non chunk prefill case before, since that the npu ops didn't work well with chunked prefill. Now the ops with chunked prefill work better, it's time to remove the ascend scheduler to use vLLM default scheduler. - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 16:18:34 +08:00
LHXuuu	bdc66972db	[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig in vllm. 2. Support CompressedTensorsW8A8 static weight. - weight: per-channel, int8, symmetric; activation: per-tensor, int8, symmetric. 4. Support CompressedTensorsW8A8Dynamic weight. - weight: per-channel, int8, symmetric; activation: per-token, int8, symmetric, dynamic. 5. Modify the override_quantization_method in AscendQuantConfig. Co-authored-by: taoqun110 taoqun@huawei.com Co-authored-by: chenxi-hh chen464822955@163.com - vLLM version: v0.11.2 --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: chenxi-hh <chen464822955@163.com> Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com> Co-authored-by: chenxi-hh <chen464822955@163.com> Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>	2025-11-28 14:09:39 +08:00
zzzzwwjj	136ea9ff56	[refact] unified soc_version code (#4359 ) ### What this PR does / why we need it? Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. ### Does this PR introduce _any_ user-facing change? When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-11-26 14:28:55 +08:00
wangxiyuan	a1f142b7ad	Drop 0.11.0 support (#4377 ) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-24 17:08:20 +08:00
whx	a5554b6661	[Feat][Doc] Add a load_balance_dp_proxy in examples and external dp doc. (#4265 ) ### What this PR does / why we need it? This PR adds a load-balance dp proxy server which can be used in external DP scenario without Disaggregated-Prefill enabled. What's more, add a doc of external dp and load-balance dp proxy server. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See the new doc. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-11-21 16:33:23 +08:00
liziyu	e98543267a	[bugfix] fix proxy hen host ip using domain name (#4243 ) ### What this PR does / why we need it? fix proxy when host ip using domain name - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-18 16:30:51 +08:00
liziyu	a30261f779	[P/D] pd proxy support ipv6 (#4161 ) ### What this PR does / why we need it? pd proxy support ipv6, mooncake connector check whether the IPv6 address is used and notify the user. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-11-18 11:01:13 +08:00
wangxiyuan	f811a24bf0	Remove VLLM_USE_V1 (#4086 ) Drop VLLM_USE_V1 usage. This env has been removed from vLLM already. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-11 15:43:39 +08:00
Canlin Guo	de49fb3deb	[Feature][Build] Upgrade the minimum version to 3.10 (#3926 ) ### What this PR does / why we need it? Closes #3728, #3657. The main branch is now aligned with the vllm `releases/v0.11.1` branch, which no longer supports `Python 3.9`. Check it [here](https://github.com/vllm-project/vllm/blob/releases/v0.11.1/pyproject.toml). ### Does this PR introduce _any_ user-facing change? The newest version of vllm-ascend don't support Python 3.9. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-11-10 11:50:12 +08:00
zxr2333	1d81a289d0	[P/D][BugFix]Fix proxy format processing errors & Layerwise connector performance optimization (#4043 ) ### What this PR does / why we need it? 1. Fix proxy format processing errors. 2. Layer-wise connector performance optimization. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-11-08 18:44:06 +08:00
zxr2333	b206e831e9	[P/D]Make kv-transfer env variable take effect & Fix load-balance proxy (#3981 ) ### What this PR does / why we need it? Make kv-transfer env variable take effect and Fix load-balance proxy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2025-11-06 12:02:47 +08:00
pz1116	e0c23cb011	[docs] Add kv pool developer guide (#3752 ) ### What this PR does / why we need it? Add kv pool developer guide ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Signed-off-by: pz1116 <zpbzpb123123@gmail.com>	2025-11-05 18:03:36 +08:00
wangxiyuan	cc2cd42ad3	Upgrade CANN to 8.3.rc1 (#3945 ) ### What this PR does / why we need it? This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version check logic. TODO: we notice that UT runs failed with CANN 8.3 image. So the base image for UT is still 8.2. We'll fix it later. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-03 20:21:07 +08:00
wangxiyuan	fcc9a0eaeb	Update torch-npu version to 2.7.1 (#3896 ) ### What this PR does / why we need it? Upgrade torch-npu to the official release version 2.7.1 - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-31 17:16:31 +08:00
wangxiaoteng888	a2b325ee00	[bugfix]cancel tokenize for layerwise_proxy (#3914 ) ### What this PR does / why we need it? cancel tokenize for layerwise_proxy ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? by ci - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-10-30 23:54:46 +08:00

1 2 3

132 Commits