Commit Graph

51 Commits

Author SHA1 Message Date
huangxialu
875a86cbe9 ut: add example and e2e test for sleepmode in external_launcher (#2152)
### What this PR does / why we need it?
This pr add e2e testcase to make sure sleep mode in external_launcher is
ok.

### Does this PR introduce _any_ user-facing change?
not involved

### How was this patch tested?
not involved


- vLLM version: v0.10.0
- vLLM main:
74333ae2f6

Signed-off-by: huangxialu <huangxialu1@huawei.com>
2025-08-06 11:11:53 +08:00
wangxiyuan
36e450eb0f [Misc] Nit fix for disaggregated_prefill and ascend_forward_context (#2097)
we recently added disaggregated_prefill and ascend_forward_context
feature by
ba3dfbd59e
and
df0ec55162.
This PR fix some nit introduced by them to make the code clear.
1. drop `current_platform` usage. It'll lead unknown circular import
error in some case
2. update `set_ascend_forward_context` function to make the logic clear.
for example, remove V0 support in this function.
3. Remove useless `self.local_rank_across_dp` in worker
4. Remove `soc_info.py` to use `get_ascend_soc_version` instead.
 

- vLLM version: v0.10.0
- vLLM main:
02f82fe438

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-08-05 08:39:02 +08:00
hucong
e38fab011d [Doc][PD] Restore the default configuration items in examples/disaggregate_prefill_v1/README.md (#2165)
### What this PR does / why we need it?
- In the D node, the max-num-batched-tokens parameter can be set to a
smaller value since the D node processes at most max-num-seqs batches
concurrently. As the profile_run only needs to handle max-num-seqs
sequences at a time, we can safely set max-num-batched-tokens equal to
max-num-seqs. This optimization will help reduce activation memory
consumption.
- Restore the default configuration items for PD separation.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.0
- vLLM main:
61dcc280fa

Signed-off-by: underfituu <hzhucong@163.com>
2025-08-04 20:30:53 +08:00
Pleaplusone
4b3a210c33 Implementation of simple load balance routing proxy server (#1953) (#2124)
### What this PR does / why we need it?
The PR is the cherry-pick from v0.9.1
https://github.com/vllm-project/vllm-ascend/pull/1953

This PR introduce a new load balance proxy server example implementation
for disaggregated pd, which support simple token&kv_cache aware load
balance routing strategy for the disaggregated pd system compared with
origin round robin toy_proxy.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested on real workload and unittest

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-08-04 10:35:53 +08:00
xleoken
bea3d5bbb4 [Bug] Fix run bug in run_dp_server.sh (#2139)
### What this PR does / why we need it?

For `Qwen2.5-0.5B-Instruct` model
- the model's total number of attention heads (14) must be divisible by
tensor parallel size. (4 -> 2)
- the model does not support enable-expert-parallel

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local Test.

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

Signed-off-by: xleoken <xleoken@163.com>
2025-08-02 16:52:12 +08:00
yangqinghao-cmss
47f688a2f0 Change retrieving remote files to local retrieval. (#2141)
### What this PR does / why we need it?
Using vllm's AudioAsset class to retrieve remote audio
files(https://vllm-public-assets.s3.us-west-2.amazonaws.com) is not
feasible in some cases; it is recommended to switch to local retrieval.

### How was this patch tested?
vllm:main
vllm:ascend:main
results:
```bash
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.62s/it]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.01s/it, est. speed input: 79.03 toks/s, output: 6.31 toks/s]
generated_text: The sport referenced is soccer, and the nursery rhyme is 'Hey Diddle Diddle'.
```

- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a

---------

Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
2025-08-02 16:51:22 +08:00
22dimensions
8cf97d8310 [Misc] Add extra checking to torchair_graph_config. (#1939)
### What this PR does / why we need it?

cherry-pick #1675  to main
This PR adds validation checking to torchair_graph_config for better
reliability.

Co-authored-by: whx-sjtu <2952154980@qq.com>

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-08-01 09:24:11 +08:00
yangqinghao-cmss
99fa0ac882 [BugFix] update the kv transfer config (#2121)
### What this PR does / why we need it?
The functions KVTransferConfig.from_cli and AscendHcclConnector are
missing in the latest vLLM version. To resolve this, I propose modifying
the kv_connector to use LLMDataDistCMgrConnector, which depends on [PR
#2079](https://github.com/vllm-project/vllm-ascend/pull/2079)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
vllm:main
vllm-ascend:mian
results:
```bash
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 374.27it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 66.06it/s, est. speed input: 449.08 toks/s, output: 66.51 toks/s]
Prefill node is finished.
INFO 07-31 09:18:30 [model_runner_v1.py:2282] Graph capturing finished in 36 secs, took 0.21 GiB
INFO 07-31 09:18:30 [core.py:201] init engine (profile, create kv cache, warmup model) took 52.49 seconds
INFO 07-31 09:18:30 [factory.py:74] Creating v1 connector with name: LLMDataDistCMgrConnector and engine_id: 28c8ced8-575c-4f87-840a-48d04d0edf7e
INFO 07-31 09:18:30 [platform.py:157] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 07-31 09:18:30 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 76
INFO 07-31 09:18:30 [utils.py:359] No adjustment needed for ACL graph batch sizes: Qwen2ForCausalLM model (layers: 24) with 67 sizes
INFO 07-31 09:18:30 [llm.py:293] Supported_tasks: ['generate']
Waiting for prefill node to finish...
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 709.70it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.23it/s, est. speed input: 109.70 toks/s, output: 260.01 toks/s]
Prompt: 'Hello, how are you today?', Generated text: " I'm a computer program, so I don't have feelings. But I can"
Prompt: 'Hi, what is your name?', Generated text: ' I am a computer programmer. I have a question about the programming language I am'
Prompt: 'Tell me a very long story.', Generated text: ' I want to read it. I want to read it. I want to read'
Prompt: 'what is your favourite book?', Generated text: " I'm sorry, but as an AI language model, I don't have personal"
Cleanup prefill resources
All process done
```

- vLLM version: v0.10.0
- vLLM main:
9cb497bfa3

Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
2025-08-01 08:56:55 +08:00
Ronald1995
cb0a303080 ut:add e2e test for external launcher (#2091)
### What this PR does / why we need it?
This pr add e2e testcase to make sure initialize LLM by
external_launcher method is ok.

### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-31 20:37:42 +08:00
Li Wang
f60bb474f9 [CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (#2065)
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.

- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test

### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.10.0
- vLLM main:
a2480251ec

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-29 18:59:05 +08:00
wangxiyuan
0190b68f51 [Misc]Remove PD v0 code (#2047)
Cleanup V0 disaggregated prefill code for V0 Engine.

part of https://github.com/vllm-project/vllm-ascend/issues/1620

TODO: enable v1 e2e test.

- vLLM version: v0.10.0
- vLLM main:
2cc571199b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 19:09:22 +08:00
zzzzwwjj
ba3dfbd59e [main][refactor] Refactoring forward_context and model_runner_v1 (#1979)
### What this PR does / why we need it?

A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:

Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;

### Does this PR introduce _any_ user-facing change?

No change at user-facing.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
57c22e57f9

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
Pleaplusone
df0ec55162 Disaggregate prefill for kv cache register style (#950)
### What this PR does / why we need it?
This PR adopt `LLMDataDist` for kv cache register and `pull_blocks`
style disaggregate prefill implementation. The interface implementation
mainly follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.

This PR can be test with the following step:
- Generate the rank table for all machine.
- execute`toy_proxy.py` to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
- Run the prefill server and decode server.
- send the request to the disaggregate prefill proxy

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Signed-off-by: liziyu179 <3475441767@qq.com>
Signed-off-by: underfitc <hucong24@huawei.com>
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Co-authored-by: liziyu179 <3475441767@qq.com>
Co-authored-by: underfitc <hucong24@huawei.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: underfituu <hzhucong@163.com>
2025-07-26 17:15:47 +08:00
Shanshan Shen
a66ef39bb6 [Misc][V0 Deprecation] Remove Redundant Offline Distributed Inference Example (#1899)
### What this PR does / why we need it?
The file `offline_distributed_inference_npu.py` is the same as
`offline_inference_npu_tp2.py`, thus we delete one of them.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
8188196a1c

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-21 12:01:45 +08:00
Mengqing Cao
8cfd257992 [Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-21 09:08:04 +08:00
leo-pony
2ee90461d0 Fix e2e data parallel test: add resource release code (#1881)
### What this PR does / why we need it?
Fix e2e data parallel test: add resource release code and give more time
to engine to pause their processing loops before exiting.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.9.2
- vLLM main:
5895afd780

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-07-19 11:39:48 +08:00
Shanshan Shen
8a91e6e59c [Misc][V0 Deprecation] Remove V0 Related Custom Ops (#1871)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
ca4eb82bcb

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-18 23:06:03 +08:00
Shanshan Shen
aeb5aa8b88 [Misc][V0 Deprecation] Add __main__ guard to all offline examples (#1837)
### What this PR does / why we need it?
Add `__main__` guard to all offline examples.

- vLLM version: v0.9.2
- vLLM main:
76b494444f

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-17 14:13:30 +08:00
Shanshan Shen
f96100fad5 [Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805)
### What this PR does / why we need it?
Remove V0 related codes of test, example, platform.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:58:55 +08:00
wangxiyuan
787010a637 [Test] Remove VLLM_USE_V1 in example and tests (#1733)
V1 is enabled by default, no need to set it by hand now. This PR remove
the useless setting in example and tests

- vLLM version: v0.9.2
- vLLM main:
9ad0a4588b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 12:49:57 +08:00
Yikun Jiang
eff4b5791c Recover offline_inference_npu.py to make doctest passed (#1756)
### What this PR does / why we need it?
Rename offline_inference_npu_v1.py to offline_inference_npu.py to
recover doctest

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed









- vLLM version: v0.9.2
- vLLM main:
a8593237c0

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-12 12:36:35 +08:00
ttanzhiqiang
60519c71bd shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)
### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-10 12:07:05 +08:00
Mengqing Cao
b1c66b211f [CI] Fix lint in CI (#1712)
### What this PR does / why we need it?
Fix lint in CI
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-10 10:47:18 +08:00
xleoken
3ef45d0cc2 feat: Improve the offline_inference npu v0/v1 scripts (#1669)
### What this PR does / why we need it?

Improve
- Keep the same file name format as v1, `offline_inference_npu_v0.py`,
`offline_inference_npu_v1.py`
- Use `VLLM_USE_V1` = 0/1 clearly in py scripts
- Fix some run errors in `offline_inference_npu_v1.py`, e.g.
`deepseekv3-lite-base-latest` not exists in modescope or hf.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
baed180aa0

Signed-off-by: xleoken <xleoken@163.com>
2025-07-09 17:03:53 +08:00
Zheng Wengang
9c886d0a1f [EPLB] support deepseek eplb strategy (#1196)
### What this PR does / why we need it?

This PR implements the DeepSeek Expert Parallel Load Balancing (EPLB)
strategy to optimize expert distribution in vllm-ascend. The
implementation:
- Adapts the expert-map format to work with vllm-ascend's architecture
- Provides DeepSeek-provided mechanism to balance expert workload across
devices

### Does this PR introduce _any_ user-facing change?

This PR adds a new script that allows users to:
- Generate expert map configurations based on workload analysis
- Optimize expert distribution for their specific use case

### How was this patch tested?

To use this feature:
1. First collect expert heat information during model execution
2. Run the provided script to generate the expert map configuration
3. Apply the generated configuration to your vllm-ascend deployment

User example:

```bash
# expert_load_view.pt:  dumped expert heat info file
python3 examples/eplb/eplb_strategy.py --exp_name 'deepseek_demo' \
    --input_path expert_load_view.pt  --output_path examples/eplb/results/demo \
    --num_nodes 4
```

---------

Signed-off-by: ZhengWG <zwg0606@gmail.com>
2025-07-07 17:22:08 +08:00
Yikun Jiang
6d7cb14a24 Fix lint in examples/offline_embed.py (#1618)
### What this PR does / why we need it?
Fix lint

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-03 21:40:29 +08:00
xleoken
e511ddd67d [Bug] Fix wrong modescope env set order (#1611)
### What this PR does / why we need it?
The `os.environ["VLLM_USE_MODELSCOPE"] = "True"` should be placed before
module imports

if not 
```
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/xleoken/projects/vllm-ascend/examples/offline_embed.py", line 48, in <module>
    model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 494, in from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1018, in create_engine_config
    model_config = self.create_model_config()
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 910, in create_model_config
    return ModelConfig(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 120, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/config.py", line 528, in __post_init__
    hf_config = get_config(self.hf_config_path or self.model,
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 321, in get_config
    config_dict, _ = PretrainedConfig.get_config_dict(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 649, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 266, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 491, in cached_files
    raise OSError(
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
[ERROR] 2025-07-03-15:27:10 (PID:333665, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local.

Signed-off-by: xleoken <xleoken@163.com>
2025-07-03 18:50:53 +08:00
Li Wang
5f8241c25c [V1][ModelRunner] Support pooling model for v1 engine (#1359)
### What this PR does / why we need it?
Change as little existing code as possible to add v1 pooling task's
support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to
vllm-ascend, Considering the frequent changes in upstream interfaces, in
order to decouple, so i move it here
### How was this patch tested?
CI passed with new added/existing test, and I have a simple test was
first conducted locally which is adapted from
https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like
bellow:
```python
import os

import torch
from vllm import LLM


os.environ["VLLM_USE_MODELSCOPE"]="True"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]
```
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-30 16:31:12 +08:00
Shanshan Shen
4e2daf5ab7 [Doc] Add qwen2-audio eager mode tutorial (#1371)
### What this PR does / why we need it?
Add qwen2-audio eager mode tutorial.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-26 16:56:05 +08:00
Li Wang
15df8be937 [Doc] Add sleep mode doc (#1295)
### What this PR does / why we need it?
Add sleep related doc and example

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-25 14:07:14 +08:00
Mengqing Cao
52317f92cb [DP] Tiny fix of dp and update example (#1273)
### What this PR does / why we need it?
Add `max_num_tokens_across_dp` to AscendMetadata to fix dp

This pr fixes the bug introduced by
https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg
`max_num_tokens_across_dp` when dp_size > 1.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-25 11:03:04 +08:00
liziyu
6ed3f00427 [Doc] remove environment variable VLLM_ENABLE_MC2 (#1406)
### What this PR does / why we need it?
remove unused environment variable VLLM_ENABLE_MC2


Signed-off-by: liziyu <liziyu16@huawei.com>
2025-06-24 21:18:10 +08:00
zhuo97
f5404dc650 Fix the device error when using ray as vllm-acend backend (#884)
1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES
2. Add lazy init for vllm_ascend_C

Signed-off-by: zhuo97 <1103045176@qq.com>
2025-06-16 21:03:16 +08:00
wangyanhui-cmss
c6e2a5fb40 [fix] fix bug in 1p1d disaggregated_prefill example (#1184)
### What this PR does / why we need it?
fix  bug in 1p1d  disaggregated_prefill  example

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested with python find_device_ips.py and run disaggregated_prefill
example

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-06-12 19:40:58 +08:00
ttanzhiqiang
980cd81466 etp best a2 (#1101)
### What this PR does / why we need it?
Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best
performance

rely on:
vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341
vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc
+ https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce
_npu_flash_attention mask to 128x128 for memory savings]
https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory
usage by splitting tokens in fused_experts]


---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-06-11 10:40:50 +08:00
zxdukki
87ebaef4e4 [perf]: support dual-batch overlap(dbo) for deepseek (#941)
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.

ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)

### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.

### How was this patch tested?

This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).



Any advice/discussion is welcome.

### Performance Benchmark

We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.

`python -m vllm.entrypoints.openai.api_server \
 --model=DeepSeek-R1-W8A8 \
 --trust-remote-code \
 --distributed-executor-backend=mp \
 -tp=16 \
 --port 8006 \
 --max-num-seqs 390 \
 --max-model-len 32768 \
 --max-num-batched-tokens 65536 \
 --block-size 128 \
 --compilation_config 0 \
 --gpu-memory-utilization 0.90 \
 --disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`

and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`

1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)

2. test with the version using alltoall: 

prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms

The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm

---------

Signed-off-by: zhuohuan <zxdu1997@gmail.com>
2025-06-07 16:46:58 +08:00
Li Wang
11a7df4270 [ModelRunner] Support embedding inputs (#916)
### What this PR does / why we need it?
- Adds support for passing prompt_embeds to LLM.generate as
```bash
llm.generate({"prompt_embeds": input_embeds}, sampling_params)
```
or
```bash
llm.generate(
    [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)
```
- Add `prompt_embeds` to examples

### How was this patch tested?
CI passed with new added/existing test.
and I have test with the example script in this pr, and the output seems
looks good:
```bash

[Single Inference Output]
------------------------------
The capital of France is Paris. Paris is the largest city in France and is
------------------------------
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3966.87it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s]

[Batch Inference Outputs]
------------------------------
Q1: Please tell me about the capital of France.
A1: The capital of France is Paris. It is located in the northern part of the

Q2: When is the day longest during the year?
A2: The day is longest during the year at the summer solstice. This typically occurs

Q3: Where is bigger, the moon or the sun?
A3: The sun is significantly bigger than the moon. 

The sun has a diameter of

------------------------------
```

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-06 20:21:13 +08:00
wangxiyuan
e1ab6d318e [Misc] Refactor additional_config (#1029)
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.

 This PR also added the `additional_config` doc for users.

Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.

TODO: Add e2e test with torchair and deepseek once the CI resource is
available.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-05 16:28:01 +08:00
Mengqing Cao
6eddbd2521 [CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889)
Initialize PD Disaggreate UT

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-29 10:17:12 +08:00
Wan_Danfeng
5cf9ff18e9 [Performance]: Custom AscendC Kernel of Multi-Step Prepare Input (#814)
### What this PR does / why we need it?

- According to https://github.com/vllm-project/vllm-ascend/issues/807,
we pull request for customer ascendc kernel of multi-step.
- also a bug we found in multi_step_runner.py is fixed when we use
multi-step on V0 Engine.


### Does this PR introduce _any_ user-facing change?

no user-facing change


### How was this patch tested?
we add Unit Test file and offline inference file to test the custom
ascendc kernel. See test/ops/test_multi_step.py and
examples/offline_multi_step.py

---------

Signed-off-by: wan_danfeng <wonderful199082@126.com>
2025-05-20 09:31:30 +08:00
wangxiyuan
6193ba679b [CI] add codespell CI and fix format.sh (#827)
1. Fix format check error to make format.sh work
2. Add codespell check CI 
3. Add the missing required package for vllm-ascend.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-12 22:04:48 +08:00
whx
8b194ad12e [Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist (#694)
### What this PR does / why we need it?
- This PR proposes a P2P version of Disaggregated Prefill based on
llm_datadist which manages data transfer.

- This solution reconstructs previous offline single-node Disaggregated
Prefill solution, and supports multi-node and online serveing now.

- Currently this solution supports 1P1D situation of Deepseek hybrid
parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered
in the solution design, and will be supported soon within v1 engine.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
2025-05-01 22:31:36 +08:00
wangxiyuan
b917361ca5 [MISC] Clean up torch_npu (#688)
torch_npu 2.5.1 support autoload now. This patch does:
1. remove useless torch_npu import
2. replace `torch_npu.npu` to `torch.npu`.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-29 18:03:38 +08:00
wangxiyuan
0dae55a9a3 [MISC] fix format check error (#654)
This pr makes format.sh works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-29 11:14:19 +08:00
zzzzwwjj
5c6d05a59e support deepseek quant & mix-parallel with graphmode (#585)
### What this PR does / why we need it?
1. support deepseek with w8a8 quant;
2. support deepseek with mix-parallel(multi-DP, EP+TP);
3. support deepseek with graphmode.
---------

Signed-off-by: wen-jie666 <wenjie39@huawei.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: wen-jie666 <wenjie39@huawei.com>
2025-04-23 16:23:25 +08:00
Pleaplusone
1a1f9a6d89 port deepseekv2 and mtp to main branch (#429)
### What this PR does / why we need it?
This PR ports all the deepseek graph mode code and mtp code from v0.7.3
to the main branch
---------

Signed-off-by: SidaoY <1024863041@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: mengwei805 <mengwei25@huawei.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: SidaoY <1024863041@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
2025-04-19 17:38:18 +08:00
hfadzxy
9935d45728 [CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460)
### What this PR does / why we need it?
Add model basic accuracy test(Qwen2.5-0.5B-Instruct)

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-04-17 14:59:56 +08:00
eeethenQ
44a8301424 [Feature] Add PD separation feature (#432)
### What this PR does / why we need it?
Adapt Disaggregated Prefill feature onto Ascend device

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

The test usage has been provided alongwith the PR, in
examples/offline_disaggregated_prefill_npu.py
To run it, do this
```
export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py
```

---------

Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
2025-04-15 15:11:35 +08:00
Mengqing Cao
c18fb09b55 [MISC] set default model to qwen in example (#87)
- Set default model to Qwen2.5-0.5B-Instruct in example
- Remove Ultravox 0.3 because it is not tested currently

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-18 17:09:59 +08:00
Mengqing Cao
7006835977 [attn] fix device of tensors in attention (#25)
### What this PR does / why we need it?
Fix device of tensors created in `AscendAttentionBackendImpl`.

While specifying device to cards except card-0, there'll cause an
**device conflict** because the tensors (such as `attn_mask`) will be
put on card-0 by default.

This pr creates these tensors on the correct card corresponding to the
input.

### Does this PR introduce _any_ user-facing change?
User could specify device with local rank by this pr, and a modify on
vLLM is also needed, will related to this pr when created.

### How was this patch tested?
This is tested by the following code locally. Will add a test case when
the modify in vLLM is also completed.
```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(model="~/.cache/modelscope/hub/Qwen/Qwen2___5-7B-Instruct", device="npu:1")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-02-10 19:20:29 +08:00