xc-llm-ascend

Author	SHA1	Message	Date
huangxialu	875a86cbe9	ut: add example and e2e test for sleepmode in external_launcher (#2152 ) ### What this PR does / why we need it? This pr add e2e testcase to make sure sleep mode in external_launcher is ok. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `74333ae2f6` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-08-06 11:11:53 +08:00
wangxiyuan	36e450eb0f	[Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097 ) we recently added disaggregated_prefill and ascend_forward_context feature by `ba3dfbd59e` and `df0ec55162`. This PR fix some nit introduced by them to make the code clear. 1. drop `current_platform` usage. It'll lead unknown circular import error in some case 2. update `set_ascend_forward_context` function to make the logic clear. for example, remove V0 support in this function. 3. Remove useless `self.local_rank_across_dp` in worker 4. Remove `soc_info.py` to use `get_ascend_soc_version` instead. - vLLM version: v0.10.0 - vLLM main: `02f82fe438` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 08:39:02 +08:00
hucong	e38fab011d	[Doc][PD] Restore the default configuration items in examples/disaggregate_prefill_v1/README.md (#2165 ) ### What this PR does / why we need it? - In the D node, the max-num-batched-tokens parameter can be set to a smaller value since the D node processes at most max-num-seqs batches concurrently. As the profile_run only needs to handle max-num-seqs sequences at a time, we can safely set max-num-batched-tokens equal to max-num-seqs. This optimization will help reduce activation memory consumption. - Restore the default configuration items for PD separation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `61dcc280fa` Signed-off-by: underfituu <hzhucong@163.com>	2025-08-04 20:30:53 +08:00
Pleaplusone	4b3a210c33	Implementation of simple load balance routing proxy server (#1953 ) (#2124 ) ### What this PR does / why we need it? The PR is the cherry-pick from v0.9.1 https://github.com/vllm-project/vllm-ascend/pull/1953 This PR introduce a new load balance proxy server example implementation for disaggregated pd, which support simple token&kv_cache aware load balance routing strategy for the disaggregated pd system compared with origin round robin toy_proxy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tested on real workload and unittest - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-04 10:35:53 +08:00
xleoken	bea3d5bbb4	[Bug] Fix run bug in run_dp_server.sh (#2139 ) ### What this PR does / why we need it? For `Qwen2.5-0.5B-Instruct` model - the model's total number of attention heads (14) must be divisible by tensor parallel size. (4 -> 2) - the model does not support enable-expert-parallel ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test. - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` Signed-off-by: xleoken <xleoken@163.com>	2025-08-02 16:52:12 +08:00
yangqinghao-cmss	47f688a2f0	Change retrieving remote files to local retrieval. (#2141 ) ### What this PR does / why we need it? Using vllm's AudioAsset class to retrieve remote audio files(https://vllm-public-assets.s3.us-west-2.amazonaws.com) is not feasible in some cases; it is recommended to switch to local retrieval. ### How was this patch tested? vllm:main vllm:ascend:main results: ```bash Adding requests: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:04<00:00, 4.62s/it] Processed prompts: 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:03<00:00, 3.01s/it, est. speed input: 79.03 toks/s, output: 6.31 toks/s] generated_text: The sport referenced is soccer, and the nursery rhyme is 'Hey Diddle Diddle'. ``` - vLLM version: v0.10.0 - vLLM main: `ad57f23f6a` --------- Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-08-02 16:51:22 +08:00
22dimensions	8cf97d8310	[Misc] Add extra checking to torchair_graph_config. (#1939 ) ### What this PR does / why we need it? cherry-pick #1675 to main This PR adds validation checking to torchair_graph_config for better reliability. Co-authored-by: whx-sjtu <2952154980@qq.com> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-01 09:24:11 +08:00
yangqinghao-cmss	99fa0ac882	[BugFix] update the kv transfer config (#2121 ) ### What this PR does / why we need it? The functions KVTransferConfig.from_cli and AscendHcclConnector are missing in the latest vLLM version. To resolve this, I propose modifying the kv_connector to use LLMDataDistCMgrConnector, which depends on [PR #2079](https://github.com/vllm-project/vllm-ascend/pull/2079) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vllm:main vllm-ascend:mian results: ```bash Adding requests: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 374.27it/s] Processed prompts: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 66.06it/s, est. speed input: 449.08 toks/s, output: 66.51 toks/s] Prefill node is finished. INFO 07-31 09:18:30 [model_runner_v1.py:2282] Graph capturing finished in 36 secs, took 0.21 GiB INFO 07-31 09:18:30 [core.py:201] init engine (profile, create kv cache, warmup model) took 52.49 seconds INFO 07-31 09:18:30 [factory.py:74] Creating v1 connector with name: LLMDataDistCMgrConnector and engine_id: 28c8ced8-575c-4f87-840a-48d04d0edf7e INFO 07-31 09:18:30 [platform.py:157] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode INFO 07-31 09:18:30 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 76 INFO 07-31 09:18:30 [utils.py:359] No adjustment needed for ACL graph batch sizes: Qwen2ForCausalLM model (layers: 24) with 67 sizes INFO 07-31 09:18:30 [llm.py:293] Supported_tasks: ['generate'] Waiting for prefill node to finish... Adding requests: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 709.70it/s] Processed prompts: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:00<00:00, 16.23it/s, est. speed input: 109.70 toks/s, output: 260.01 toks/s] Prompt: 'Hello, how are you today?', Generated text: " I'm a computer program, so I don't have feelings. But I can" Prompt: 'Hi, what is your name?', Generated text: ' I am a computer programmer. I have a question about the programming language I am' Prompt: 'Tell me a very long story.', Generated text: ' I want to read it. I want to read it. I want to read' Prompt: 'what is your favourite book?', Generated text: " I'm sorry, but as an AI language model, I don't have personal" Cleanup prefill resources All process done ``` - vLLM version: v0.10.0 - vLLM main: `9cb497bfa3` Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-08-01 08:56:55 +08:00
Ronald1995	cb0a303080	ut:add e2e test for external launcher (#2091 ) ### What this PR does / why we need it? This pr add e2e testcase to make sure initialize LLM by external_launcher method is ok. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `2836dd73f1` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-07-31 20:37:42 +08:00
Li Wang	f60bb474f9	[CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065 ) ### What this PR does / why we need it? Currently our workflow run time takes about 3 hours in total, which seriously affects the developer experience, so it is urgent to have a optimization, after this pr, It is expected that the running time of the full CI can be shortened to 1h40min. - Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB) - Change TP4 ---> TP2 * 2 max-parallel - Move DeepSeek-V2-Lite-W8A8 to single card test ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `a2480251ec` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-29 18:59:05 +08:00
wangxiyuan	0190b68f51	[Misc]Remove PD v0 code (#2047 ) Cleanup V0 disaggregated prefill code for V0 Engine. part of https://github.com/vllm-project/vllm-ascend/issues/1620 TODO: enable v1 e2e test. - vLLM version: v0.10.0 - vLLM main: `2cc571199b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-28 19:09:22 +08:00
zzzzwwjj	ba3dfbd59e	[main][refactor] Refactoring forward_context and model_runner_v1 (#1979 ) ### What this PR does / why we need it? A refactoring of forward_context and model_runner_v1, add some context which is necessary in model inference into forward_context, and refactor dummy_run logic, make it more reasonable. Some details for this PR: Add `ascend_forward_context`; Update mc2_v2 op, and support `active_mask` param; Update scripts in examples dir; refactor `dummy_run` logic; Add soc_version for A2 and A3; ### Does this PR introduce _any_ user-facing change? No change at user-facing. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `57c22e57f9` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-07-28 14:06:20 +08:00
Pleaplusone	df0ec55162	Disaggregate prefill for kv cache register style (#950 ) ### What this PR does / why we need it? This PR adopt `LLMDataDist` for kv cache register and `pull_blocks` style disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 . This PR can be test with the following step: - Generate the rank table for all machine. - execute`toy_proxy.py` to launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, port - Run the prefill server and decode server. - send the request to the disaggregate prefill proxy ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `8d0a01a5f2` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Signed-off-by: liziyu179 <3475441767@qq.com> Signed-off-by: underfitc <hucong24@huawei.com> Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com> Co-authored-by: liziyu179 <3475441767@qq.com> Co-authored-by: underfitc <hucong24@huawei.com> Co-authored-by: zouyida2052 <zouyida@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Co-authored-by: underfituu <hzhucong@163.com>	2025-07-26 17:15:47 +08:00
Shanshan Shen	a66ef39bb6	[Misc][V0 Deprecation] Remove Redundant Offline Distributed Inference Example (#1899 ) ### What this PR does / why we need it? The file `offline_distributed_inference_npu.py` is the same as `offline_inference_npu_tp2.py`, thus we delete one of them. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-21 12:01:45 +08:00
Mengqing Cao	8cfd257992	[Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681 ) ### What this PR does / why we need it? Remove ETP/EP maintained in branch main. We drop this as there is no relevant scenarios to use ETP now, and we may subsequently advocate implementing expert tensor parallelism in vLLM to support scenarios where the expert is needed to be sliced This is a part of #1422 backport. Fixes https://github.com/vllm-project/vllm-ascend/issues/1396 https://github.com/vllm-project/vllm-ascend/issues/1154 ### Does this PR introduce _any_ user-facing change? We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in vllm instead. ### How was this patch tested? CI passed with new added and existing test. - vLLM version: v0.9.2 - vLLM main: `fe8a2c544a` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-21 09:08:04 +08:00
leo-pony	2ee90461d0	Fix e2e data parallel test: add resource release code (#1881 ) ### What this PR does / why we need it? Fix e2e data parallel test: add resource release code and give more time to engine to pause their processing loops before exiting. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.9.2 - vLLM main: `5895afd780` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-07-19 11:39:48 +08:00
Shanshan Shen	8a91e6e59c	[Misc][V0 Deprecation] Remove V0 Related Custom Ops (#1871 ) ### What this PR does / why we need it? This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `ca4eb82bcb` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-18 23:06:03 +08:00
Shanshan Shen	aeb5aa8b88	[Misc][V0 Deprecation] Add `__main__` guard to all offline examples (#1837 ) ### What this PR does / why we need it? Add `__main__` guard to all offline examples. - vLLM version: v0.9.2 - vLLM main: `76b494444f` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-17 14:13:30 +08:00
Shanshan Shen	f96100fad5	[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805 ) ### What this PR does / why we need it? Remove V0 related codes of test, example, platform. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `235bfd5dfe` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-15 19:58:55 +08:00
wangxiyuan	787010a637	[Test] Remove VLLM_USE_V1 in example and tests (#1733 ) V1 is enabled by default, no need to set it by hand now. This PR remove the useless setting in example and tests - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 12:49:57 +08:00
Yikun Jiang	eff4b5791c	Recover offline_inference_npu.py to make doctest passed (#1756 ) ### What this PR does / why we need it? Rename offline_inference_npu_v1.py to offline_inference_npu.py to recover doctest ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.9.2 - vLLM main: `a8593237c0` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-12 12:36:35 +08:00
ttanzhiqiang	60519c71bd	shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395 ) ### What this PR does / why we need it? When all_reduce_merge is in progress, shared_experts does not do all_reduce in mlp, but waits until shared_experts+router_experts are completed before doing all_reduce In prefill and decode, as long as shared_experts+router_experts are all_reduce, there will be benefits. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? bash examples/run_dp_attention_etp16.sh bash examples/run_dp_attention_etp16_benmark.sh - vLLM version: v0.9.1 - vLLM main: `977180c912` --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-07-10 12:07:05 +08:00
Mengqing Cao	b1c66b211f	[CI] Fix lint in CI (#1712 ) ### What this PR does / why we need it? Fix lint in CI - vLLM version: v0.9.1 - vLLM main: `49e8c7ea25` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-10 10:47:18 +08:00
xleoken	3ef45d0cc2	feat: Improve the offline_inference npu v0/v1 scripts (#1669 ) ### What this PR does / why we need it? Improve - Keep the same file name format as v1, `offline_inference_npu_v0.py`, `offline_inference_npu_v1.py` - Use `VLLM_USE_V1` = 0/1 clearly in py scripts - Fix some run errors in `offline_inference_npu_v1.py`, e.g. `deepseekv3-lite-base-latest` not exists in modescope or hf. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `baed180aa0` Signed-off-by: xleoken <xleoken@163.com>	2025-07-09 17:03:53 +08:00
Zheng Wengang	9c886d0a1f	[EPLB] support deepseek eplb strategy (#1196 ) ### What this PR does / why we need it? This PR implements the DeepSeek Expert Parallel Load Balancing (EPLB) strategy to optimize expert distribution in vllm-ascend. The implementation: - Adapts the expert-map format to work with vllm-ascend's architecture - Provides DeepSeek-provided mechanism to balance expert workload across devices ### Does this PR introduce _any_ user-facing change? This PR adds a new script that allows users to: - Generate expert map configurations based on workload analysis - Optimize expert distribution for their specific use case ### How was this patch tested? To use this feature: 1. First collect expert heat information during model execution 2. Run the provided script to generate the expert map configuration 3. Apply the generated configuration to your vllm-ascend deployment User example: ```bash # expert_load_view.pt: dumped expert heat info file python3 examples/eplb/eplb_strategy.py --exp_name 'deepseek_demo' \ --input_path expert_load_view.pt --output_path examples/eplb/results/demo \ --num_nodes 4 ``` --------- Signed-off-by: ZhengWG <zwg0606@gmail.com>	2025-07-07 17:22:08 +08:00
Yikun Jiang	6d7cb14a24	Fix lint in examples/offline_embed.py (#1618 ) ### What this PR does / why we need it? Fix lint ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-03 21:40:29 +08:00
xleoken	e511ddd67d	[Bug] Fix wrong modescope env set order (#1611 ) ### What this PR does / why we need it? The `os.environ["VLLM_USE_MODELSCOPE"] = "True"` should be placed before module imports if not ``` The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/xleoken/projects/vllm-ascend/examples/offline_embed.py", line 48, in <module> model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 243, in __init__ self.llm_engine = LLMEngine.from_engine_args( File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 494, in from_engine_args vllm_config = engine_args.create_engine_config(usage_context) File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1018, in create_engine_config model_config = self.create_model_config() File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 910, in create_model_config return ModelConfig( File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 120, in __init__ s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/config.py", line 528, in __post_init__ hf_config = get_config(self.hf_config_path or self.model, File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 321, in get_config config_dict, _ = PretrainedConfig.get_config_dict( File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs) File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 649, in _get_config_dict resolved_config_file = cached_file( File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 266, in cached_file file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], kwargs) File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 491, in cached_files raise OSError( OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. [ERROR] 2025-07-03-15:27:10 (PID:333665, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local. Signed-off-by: xleoken <xleoken@163.com>	2025-07-03 18:50:53 +08:00
Li Wang	5f8241c25c	[V1][ModelRunner] Support pooling model for v1 engine (#1359 ) ### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-30 16:31:12 +08:00
Shanshan Shen	4e2daf5ab7	[Doc] Add qwen2-audio eager mode tutorial (#1371 ) ### What this PR does / why we need it? Add qwen2-audio eager mode tutorial. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-26 16:56:05 +08:00
Li Wang	15df8be937	[Doc] Add sleep mode doc (#1295 ) ### What this PR does / why we need it? Add sleep related doc and example --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-25 14:07:14 +08:00
Mengqing Cao	52317f92cb	[DP] Tiny fix of dp and update example (#1273 ) ### What this PR does / why we need it? Add `max_num_tokens_across_dp` to AscendMetadata to fix dp This pr fixes the bug introduced by https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg `max_num_tokens_across_dp` when dp_size > 1. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-25 11:03:04 +08:00
liziyu	6ed3f00427	[Doc] remove environment variable VLLM_ENABLE_MC2 (#1406 ) ### What this PR does / why we need it? remove unused environment variable VLLM_ENABLE_MC2 Signed-off-by: liziyu <liziyu16@huawei.com>	2025-06-24 21:18:10 +08:00
zhuo97	f5404dc650	Fix the device error when using ray as vllm-acend backend (#884 ) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com>	2025-06-16 21:03:16 +08:00
wangyanhui-cmss	c6e2a5fb40	[fix] fix bug in 1p1d disaggregated_prefill example (#1184 ) ### What this PR does / why we need it? fix bug in 1p1d disaggregated_prefill example ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with python find_device_ips.py and run disaggregated_prefill example <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>	2025-06-12 19:40:58 +08:00
ttanzhiqiang	980cd81466	etp best a2 (#1101 ) ### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-06-11 10:40:50 +08:00
zxdukki	87ebaef4e4	[perf]: support dual-batch overlap(dbo) for deepseek (#941 ) ### What this PR does / why we need it? Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer. Compared with the previously proposed [draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources. ref: [overlap for llama](https://github.com/vllm-project/vllm/pull/15787/files) ref: [dbo in sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de) ### Does this PR introduce _any_ user-facing change? Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1" See /examples/offline_dualbatch_overlap_npu.py for more info. ### How was this patch tested? This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though). Any advice/discussion is welcome. ### Performance Benchmark We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap. `python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'` and run benchmark with the parameters of : `--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90` 1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8) 2. test with the version using alltoall: prefill qps: 0.90 -> 1.01 Mean TTFT：8226->7432ms The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com>	2025-06-07 16:46:58 +08:00
Li Wang	11a7df4270	[ModelRunner] Support embedding inputs (#916 ) ### What this PR does / why we need it? - Adds support for passing prompt_embeds to LLM.generate as ```bash llm.generate({"prompt_embeds": input_embeds}, sampling_params) ``` or ```bash llm.generate( [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params ) ``` - Add `prompt_embeds` to examples ### How was this patch tested? CI passed with new added/existing test. and I have test with the example script in this pr, and the output seems looks good: ```bash [Single Inference Output] ------------------------------ The capital of France is Paris. Paris is the largest city in France and is ------------------------------ Adding requests: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 3/3 [00:00<00:00, 3966.87it/s] Processed prompts: 100%\|█████████████████████████████████████████████████████████████████████████\| 3/3 [00:00<00:00, 3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s] [Batch Inference Outputs] ------------------------------ Q1: Please tell me about the capital of France. A1: The capital of France is Paris. It is located in the northern part of the Q2: When is the day longest during the year? A2: The day is longest during the year at the summer solstice. This typically occurs Q3: Where is bigger, the moon or the sun? A3: The sun is significantly bigger than the moon. The sun has a diameter of ------------------------------ ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-06 20:21:13 +08:00
wangxiyuan	e1ab6d318e	[Misc] Refactor additional_config (#1029 ) More and more config options are added to additional_config. This PR provide a new AscendConfig to manage these config options by an easier way to make code cleaner and readable. This PR also added the `additional_config` doc for users. Added the test_ascend_config.py to make sure the new AscendConfig works as expect. TODO: Add e2e test with torchair and deepseek once the CI resource is available. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-05 16:28:01 +08:00
Mengqing Cao	6eddbd2521	[CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889 ) Initialize PD Disaggreate UT --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-05-29 10:17:12 +08:00
Wan_Danfeng	5cf9ff18e9	[Performance]: Custom AscendC Kernel of Multi-Step Prepare Input (#814 ) ### What this PR does / why we need it? - According to https://github.com/vllm-project/vllm-ascend/issues/807, we pull request for customer ascendc kernel of multi-step. - also a bug we found in multi_step_runner.py is fixed when we use multi-step on V0 Engine. ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file and offline inference file to test the custom ascendc kernel. See test/ops/test_multi_step.py and examples/offline_multi_step.py --------- Signed-off-by: wan_danfeng <wonderful199082@126.com>	2025-05-20 09:31:30 +08:00
wangxiyuan	6193ba679b	[CI] add codespell CI and fix format.sh (#827 ) 1. Fix format check error to make format.sh work 2. Add codespell check CI 3. Add the missing required package for vllm-ascend. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-12 22:04:48 +08:00
whx	8b194ad12e	[Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist (#694 ) ### What this PR does / why we need it? - This PR proposes a P2P version of Disaggregated Prefill based on llm_datadist which manages data transfer. - This solution reconstructs previous offline single-node Disaggregated Prefill solution, and supports multi-node and online serveing now. - Currently this solution supports 1P1D situation of Deepseek hybrid parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered in the solution design, and will be supported soon within v1 engine. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: ganyi <pleaplusone.gy@gmail.com>	2025-05-01 22:31:36 +08:00
wangxiyuan	b917361ca5	[MISC] Clean up torch_npu (#688 ) torch_npu 2.5.1 support autoload now. This patch does: 1. remove useless torch_npu import 2. replace `torch_npu.npu` to `torch.npu`. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-29 18:03:38 +08:00
wangxiyuan	0dae55a9a3	[MISC] fix format check error (#654 ) This pr makes format.sh works as expect. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-29 11:14:19 +08:00
zzzzwwjj	5c6d05a59e	support deepseek quant & mix-parallel with graphmode (#585 ) ### What this PR does / why we need it? 1. support deepseek with w8a8 quant; 2. support deepseek with mix-parallel(multi-DP, EP+TP); 3. support deepseek with graphmode. --------- Signed-off-by: wen-jie666 <wenjie39@huawei.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wen-jie666 <wenjie39@huawei.com>	2025-04-23 16:23:25 +08:00
Pleaplusone	1a1f9a6d89	port deepseekv2 and mtp to main branch (#429 ) ### What this PR does / why we need it? This PR ports all the deepseek graph mode code and mtp code from v0.7.3 to the main branch --------- Signed-off-by: SidaoY <1024863041@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: mengwei805 <mengwei25@huawei.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: q00832892 <qiaoyang19@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: SidaoY <1024863041@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-04-19 17:38:18 +08:00
hfadzxy	9935d45728	[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 ) ### What this PR does / why we need it? Add model basic accuracy test(Qwen2.5-0.5B-Instruct) Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-17 14:59:56 +08:00
eeethenQ	44a8301424	[Feature] Add PD separation feature (#432 ) ### What this PR does / why we need it? Adapt Disaggregated Prefill feature onto Ascend device ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The test usage has been provided alongwith the PR, in examples/offline_disaggregated_prefill_npu.py To run it, do this ``` export PROMPT_DEVICE_ID=0,1 export DECODE_DEVICE_ID=2,3 python examples/offline_disaggregated_prefill_npu.py ``` --------- Signed-off-by: ZihuiQian <qianzihui@huawei.com> Co-authored-by: ZihuiQian <qianzihui@huawei.com>	2025-04-15 15:11:35 +08:00
Mengqing Cao	c18fb09b55	[MISC] set default model to qwen in example (#87 ) - Set default model to Qwen2.5-0.5B-Instruct in example - Remove Ultravox 0.3 because it is not tested currently Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-18 17:09:59 +08:00
Mengqing Cao	7006835977	[attn] fix device of tensors in attention (#25 ) ### What this PR does / why we need it? Fix device of tensors created in `AscendAttentionBackendImpl`. While specifying device to cards except card-0, there'll cause an device conflict because the tensors (such as `attn_mask`) will be put on card-0 by default. This pr creates these tensors on the correct card corresponding to the input. ### Does this PR introduce _any_ user-facing change? User could specify device with local rank by this pr, and a modify on vLLM is also needed, will related to this pr when created. ### How was this patch tested? This is tested by the following code locally. Will add a test case when the modify in vLLM is also completed. ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.0) # Create an LLM. llm = LLM(model="~/.cache/modelscope/hub/Qwen/Qwen2___5-7B-Instruct", device="npu:1") # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-10 19:20:29 +08:00

1 2

51 Commits