Commit Graph

330 Commits

Author SHA1 Message Date
Li Wang
a2552e10e4 [Worker][V1] Support sleep mode for v1 (#1084)
### What this PR does / why we need it?
 Support sleep mode for v1

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-06 21:54:02 +08:00
wangxiyuan
0395ab30be [Doc] Add graph mode user doc (#1083)
Add graph mode user guide doc.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 21:14:34 +08:00
ApsarasX
9a4eb94ca9 [Misc] Adjust the default profiler configuration (#1097)
### What this PR does / why we need it?
When profiling, it is often necessary to disable the call stack to
reduce profiling overhead, and adjust the profiler_level to level1 to
obtain more detailed operator and communication information.

Therefore, it is recommended to modify the default profiling
configuration.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-06-06 20:25:59 +08:00
Shanshan Shen
5d0e9fd19a [Misc] Add ACL_OP_INIT_MODE env var and set default to 1 (#597)
### What this PR does / why we need it?
Fix the bug in torch 2.5.1 that raising segment fault when enable
`pin_memory` while creating a tensor using `torch.tensor`.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-06 20:22:51 +08:00
Li Wang
11a7df4270 [ModelRunner] Support embedding inputs (#916)
### What this PR does / why we need it?
- Adds support for passing prompt_embeds to LLM.generate as
```bash
llm.generate({"prompt_embeds": input_embeds}, sampling_params)
```
or
```bash
llm.generate(
    [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)
```
- Add `prompt_embeds` to examples

### How was this patch tested?
CI passed with new added/existing test.
and I have test with the example script in this pr, and the output seems
looks good:
```bash

[Single Inference Output]
------------------------------
The capital of France is Paris. Paris is the largest city in France and is
------------------------------
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3966.87it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s]

[Batch Inference Outputs]
------------------------------
Q1: Please tell me about the capital of France.
A1: The capital of France is Paris. It is located in the northern part of the

Q2: When is the day longest during the year?
A2: The day is longest during the year at the summer solstice. This typically occurs

Q3: Where is bigger, the moon or the sun?
A3: The sun is significantly bigger than the moon. 

The sun has a diameter of

------------------------------
```

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-06 20:21:13 +08:00
NeverRaR
c7f1c59911 feat: support compile multiple batch graph (#1085)
### What this PR does / why we need it?

support compile multiple batch graph with different code object to avoid
cache invalidation

### How was this patch tested?

```
export VLLM_ENABLE_MC2=0
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --no-enforce-eager \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config": {"enabled": true,"use_cached_graph": true,"graph_batch_sizes": [8,16,24]},"ascend_scheduler_config": {"enabled":true,"chunked_prefill_enabled":false},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.95 &> run.log &
disown
```

Signed-off-by: boying <897013703@qq.com>
2025-06-06 20:17:51 +08:00
Mengqing Cao
c46632439a [Bugfix][DP] Add with_prefill_across_dp to AscendMetadata to fix dp (#1094)
### What this PR does / why we need it?
Add `with_prefill_across_dp` to AscendMetadata to fix dp

This pr fixes the bug introduced by #1012, which add an arg
`with_prefill_across_dp` when dp_size > 1.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-06 19:20:33 +08:00
hahazhky
0b12c2acf7 [Kernel] Remove cumsum in groupedmatmul (#987)
### What this PR does / why we need it remove cumsum operator in MOE to improve performance

### How was this patch tested?
it should be tested on a case with mc2 operator and graph mode enabled

Signed-off-by: zhky <hahazhky@163.com>
Co-authored-by: 洪炜杰 <hongweijie1@huawei.com>
2025-06-06 19:17:27 +08:00
wangxiyuan
dab19d5dca [BugFix] Fix ascend config check (#1092)
Fix the ascend config check logic:
1. refactor check_ascend_config to make it clear:
    1. torchair graph should not work with enforce_eager=True
    2. aclgraph should not work with torchair graph
3. add refresh config for rlhf case
4. fix a typo in model runner
5. change expert_tensor_parallel_size default to 0 to keep the same as
before

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 18:54:37 +08:00
wangxiyuan
973f993a13 [Misc] fix initialize_kv_cache (#1102)
KV cache manger has been changed by
f8a1a2d108

This PR adapt the change into vllm-ascend to make ci happy

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 16:46:23 +08:00
wangxiyuan
c94afd79ce [Doc] Update the description for env (#1079)
Add the description for env to make it more clear for users

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-06 09:48:43 +08:00
depeng1994
6b094a2bd4 [ModelRunner]Add profile execute duration observation (#1013)
### What this PR does / why we need it?
We need to **observe the time consumed in each stage of inference
(including pre-processing, model forward, etc.), without any performance
loss**.
Therefore, we use the event timestamp mechanism of the NPU to mark any
stage during the execution of the NPU device (this marking operation is
executed asynchronously, with no performance loss).
Additionally, we provide a blocking synchronization API
`pop_captured_sync` to be called at an appropriate time, to print the
time consumed in all observed stages.

**model_runner_v1.py file only changed 5 lines, all of which were
`ProfileExecuteDuration()` calls, and nothing else was changed, while
more changes were showed due to the alignment issue.**

### Does this PR introduce _any_ user-facing change?
Use  env `VLLM_MODEL_EXECUTE_TIME_OBSERVE `to enable this feature

### How was this patch tested?

Tested in deepseek model,Print like this:
```
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms

```

---------

Signed-off-by: depeng1994 <depengzhang@foxmail.com>
2025-06-06 09:29:34 +08:00
David9857
78431b3469 [perf]Support MOE Multi-stream in Deepseek (#947)
### What this PR does / why we need it?
Support MOE inner Multi-stream for Deepseek. 
This feature requires graph mode with mc2 enabled.

---------

Signed-off-by: David9857 <985700846@qq.com>
2025-06-05 23:39:38 +08:00
sherie
908a851a77 optimize the funtion of computing topk and topp in sampler. (#970)
### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.

### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

### How was this patch tested?
pytest test_sampler.py

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
2025-06-05 16:42:18 +08:00
wangxiyuan
e1ab6d318e [Misc] Refactor additional_config (#1029)
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.

 This PR also added the `additional_config` doc for users.

Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.

TODO: Add e2e test with torchair and deepseek once the CI resource is
available.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-05 16:28:01 +08:00
zhangxinyuehfad
7737aaa40f [CI] Add accuracy test for Qwen2.5-VL-3B-Instruct (#766)
### What this PR does / why we need it?
Add accuracy test for Qwen2.5-VL-3B-Instruct


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-05 15:09:20 +08:00
Li Wang
b4cb0eecb6 [CI] Hotfix on benchmark results path (#1076)
### What this PR does / why we need it?
Fix benchmark results path

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-05 12:53:46 +08:00
Yikun Jiang
fd136e6762 Add vLLM Ascend project governance docs (#1070)
### What this PR does / why we need it?
Add vLLM Ascend project governance and first contributors docs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Closes: https://github.com/vllm-project/vllm-ascend/issues/828
Closes: https://github.com/vllm-project/vllm-ascend/issues/929

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-05 11:56:51 +08:00
Li Wang
31dd471574 [CI] Add workflow_dispatch and use main benchmarks directly (#1071)
### What this PR does / why we need it?

This is for the benchmark iteration, which will change the benchmark
scripts while checkouting each commit. So we need ensure the benchmark
scripts always available.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manaully

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-05 10:29:30 +08:00
Yikun Jiang
9e855b70be Adjust concurrency group for each npu workflow (#1068)
### What this PR does / why we need it?
Adjust concurrency group for each npu workflow
- for pd and benchmarks share the static-08-01, so only one job can runs
on
- other job one PR/schedule should have only 1 job runs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-05 09:17:04 +08:00
Mengqing Cao
afc4c0cd03 [Bugfix] Fix deepseek percision issue and add acc ci for it (#905)
### What this PR does / why we need it?
Fix deepseek percision issue on V0 and add acc ci for it
Fixes https://github.com/vllm-project/vllm-ascend/issues/1062
### How was this patch tested?
CI passed with new added test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-04 20:26:44 +08:00
NeverRaR
da9acfca60 feat: support data parallel for deepseek (#1012)
### What this PR does / why we need it?
feat: support data parallel for deepseek

### Does this PR introduce _any_ user-facing change?
Yes, support dp for deepseek

### How was this patch tested?

```
export VLLM_ENABLE_MC2=0
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

nohup python -m vllm.entrypoints.openai.api_server
--model=/path/to/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --max-num-seqs 24 \
    --max-model-len 4096 \
    --max-num-batched-tokens 4096 \
    --block-size 128 \
    -O 0 \
    --no-enable-prefix-caching \
--additional-config
'{"torchair_graph_batch_sizes":[24],"expert_tensor_parallel_size":16,"ascend_scheduler_config":{},"enable_graph_mode":true}'
\
    --gpu-memory-utilization 0.95 &> run.log &
disown
```

Signed-off-by: boying <897013703@qq.com>
2025-06-04 18:31:41 +08:00
Li Wang
517811449e [CI] Re-enable sleep mode test and skip failure breaking CI (#990)
### What this PR does / why we need it?

- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream
[change](https://github.com/vllm-project/vllm/pull/18654)
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-04 16:24:16 +08:00
Li Wang
eb2701e0b2 [CI] Remove workflow_dispatch and change schedule time (#1056)
### What this PR does / why we need it?

- Remove workflow_dispatch 
-  Change schedule time to 2:00 UTC+8
### Does this PR introduce _any_ user-facing change?


### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-04 01:19:20 +08:00
Li Wang
06fb5a8d81 [CI][Bugfix] Upgrade escli to v0.2.1 to fix benchmark deps (#1055)
### What this PR does / why we need it?

Update escli-tool to v.0.2.1 to fix deps bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangli <858794774@qq.com>
2025-06-04 01:03:56 +08:00
Li Wang
76dacf3fa0 [CI][Benchmark] Optimize performance benchmark workflow (#1039)
### What this PR does / why we need it?

This is a post patch of #1014, for some convenience optimization
- Set cached dataset path for speed
- Use pypi to install escli-tool
- Add benchmark results convert script to have a developer-friendly
result
- Patch the `benchmark_dataset.py` to disable streaming load for
internet
- Add more trigger ways for different purpose, `pr` for debug,
`schedule` for daily test, `dispatch` and `pr-labled` for manual testing
of a single(current) commit
- Disable latency test for `qwen-2.5-vl`, (This script does not support
multi-modal yet)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-03 23:38:34 +08:00
wangxiyuan
543380ceae [CI] Add merge conflict label job (#1050)
Add bot to label merge conflicts, it helps developer and maintainer to
do code review and update clear.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-03 17:32:31 +08:00
Yikun Jiang
f24375f318 Enable accuracy test for PR labeled with "*accuracy-test" (#1040)
### What this PR does / why we need it?
This PR enable accuracy test for PR labeled with "*accuracy-test" and
workflow_dispatch.

Only one model test running for each type test to reduce excution time.

- The dense test costs about `25mins` to complete (gsm8k 7mins, ~mmlu
3h24mins,~ cEval 18mins)
- The vl test costs about `40mins` to complete


In futute, we might consider enable all job test as nightly schedule
job.

Below is mainly changes:
- the dense/vl accuracy test will be triggered by lableling
`accuracy-test` and `ready-for-test`
- the dense accuracy test will be triggered by lableling
`dense-accuracy-test` and `ready-for-test`
- the vl accuracy test will be triggered by lableling `vl-accuracy-test`
and `ready-for-test`
- accuracy test will also be triggered by workflow_dispatch
- Support V1 and V0 for qwen and V0 for VL

For PR test we also generate summary in test summary.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed with accuracy-test label
- Preview:
https://github.com/vllm-project/vllm-ascend/actions/runs/15407628722?pr=1040

Closes: https://github.com/vllm-project/vllm-ascend/pull/953

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2025-06-03 15:38:13 +08:00
Shanshan Shen
068c3a0167 [Bugfix] Add verification for quant_action.choices to avoid TypeError (#1046)
### What this PR does / why we need it?

When I run vllm-ascend, I get this error msg:

```bash
Traceback (most recent call last):
  File "/home/sss/software/miniconda3/envs/vllm-v1/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/cli/main.py", line 50, in main
    cmd.subparser_init(subparsers).set_defaults(
  File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/cli/serve.py", line 101, in subparser_init
    serve_parser = make_arg_parser(serve_parser)
  File "/home/sss/github/vllm-project/vllm/vllm/entrypoints/openai/cli_args.py", line 254, in make_arg_parser
    parser = AsyncEngineArgs.add_cli_args(parser)
  File "/home/sss/github/vllm-project/vllm/vllm/engine/arg_utils.py", line 1582, in add_cli_args
    current_platform.pre_register_and_update(parser)
  File "/home/sss/github/vllm-project/vllm-ascend/vllm_ascend/platform.py", line 80, in pre_register_and_update
    if ASCEND_QUATIZATION_METHOD not in quant_action.choices:
TypeError: argument of type 'NoneType' is not iterable
[ERROR] 2025-06-03-02:53:42 (PID:6005, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

This is because the `choices` attribute in `quant_action` can be `None`
and we don't check it.

```bash
# quant_action
_StoreAction(option_strings=['--quantization', '-q'], dest='quantization', nargs=None, const=None, default=None, type=<class 'str'>, choices=None, required=False, help='Method used to quantize the weights. If `None`, we first check the\n`quantization_config` attribute in the model config file. If that is\n`None`, we assume the model weights are not quantized and use `dtype` to\ndetermine the data type of the weights.', metavar=None)
```

Thus, I have added check for the `choices` to handle the scenario of
`choices=None`.

### Does this PR introduce _any_ user-facing change?
yes, vllm server with ascend quantization works now.

### How was this patch tested?
by `vllm server --quantization ascend` command.

Related: https://github.com/vllm-project/vllm/issues/19004

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-03 11:44:45 +08:00
Shanshan Shen
93860574bb [ModelRunner][MultiModal] Remove legacy input mapper/processor from V0 (#951)
### What this PR does / why we need it?
Remove legacy input mapper/processor from V0.

Find more details at
https://github.com/vllm-project/vllm-ascend/issues/673 and
https://github.com/vllm-project/vllm/pull/15686.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
Launch online service:

```bash
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--max_model_len 32768 \
--max-num-batched-tokens 32768
```

Query the server:

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustrate?"}
    ]}
    ]
    }'
```

Result:

```bash
{"id":"chatcmpl-619e70733ed148b3be3a0b6524ee0ef3","object":"chat.completion","created":1748226332,"model":"/home/sss/.cache/modelscope/hub/models/Qwen/Qwen2___5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The text in the illustration reads \"TONGYI Qwen.\"","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"pro
```

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-03 11:32:03 +08:00
NINGBENZHE
6ec64a3f96 [bugfix] some bugs maybe fail to run (#896)
### What this PR does / why we need it?
Solve the bug that the graph mode is the same as p and d, and some other
bugs.
### Does this PR introduce _any_ user-facing change?
Wouldn't be
### How was this patch tested?
Follow the end-to-end test

Signed-off-by: ningbenzhe1 <ningbenzhe@huawei.com>
2025-06-03 11:07:33 +08:00
Yikun Jiang
92bc5576d8 Skip benchmarks/** in vllm ascend test (#1041)
### What this PR does / why we need it?
Skip benchmarks/** in vllm ascend test to reduce CI cost

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-01 19:01:26 +08:00
NeverRaR
507ae627ca feat: support compile torchair graph while warming up (#839)
### What this PR does / why we need it?
feat: support compile torchair graph while warming up

Signed-off-by: boying <897013703@qq.com>
2025-05-31 06:03:03 +08:00
Li Wang
d9fb027068 [CI] Add benchmark workflows (#1014)
### What this PR does / why we need it?

Add benchmark workflows

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Run locally

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-05-30 22:42:44 +08:00
yiz-liu
5a1689fc64 [Fix] Fix update_aclgraph_sizes when running MoE models (#913)
### What this PR does / why we need it?
Fix update_aclgraph_sizes when running MoE models.

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-05-30 15:17:11 +08:00
XWFAlone
3442fbdb23 [1/N][UT][v1 MTP] add basic v1 mtp features (#890)
### What this PR does / why we need it?
add basic v1 mtp features
please merge it after
https://github.com/vllm-project/vllm-ascend/pull/874 and
https://github.com/vllm-project/vllm-ascend/pull/844.

### Does this PR introduce _any_ user-facing change?
now, we supported basic v1 mtp, only supported tp only、eager mode and
k=1
we will continue to expand more scenarios.

### How was this patch tested?
local tested

Signed-off-by: XWFAlone <xuewenfei2@huawei.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: JC-ut0 <xuyexiong@huawei.com>
2025-05-30 08:59:58 +08:00
wangxiyuan
5903547d09 [doc] add 0.7.3.post1 release note (#1008)
Add release note for 0.7.3.post1
Add the missing release note back for 0.7.3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-29 17:38:34 +08:00
22dimensions
c464c32b81 add doc for offline quantization inference (#1009)
add example for offline inference with quantized model

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-05-29 17:32:42 +08:00
zouyida2052
05a471001b bugfix for qwen2_5_vl (#805)
### What this PR does / why we need it?
the interface of qwen2.5vl changes from column linear to qkv linear,
this makes our weight pad func become abnormal, thus we optimize
split_qkv func to fix this bug.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
with CI

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-05-29 17:20:39 +08:00
Mengqing Cao
a93bed4535 [aclgraph] implentment NPUPiecewiseBackend to enable aclgraph (#836)
### What this PR does / why we need it?
1. Implentment `NPUPiecewiseBackend` to enable aclgraph
2. Eable aclgraph by default in V1, but raise error when running
deepseek and raise warning when running models except for qwen

### How was this patch tested?
CI pass with the new ut

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-29 11:58:26 +08:00
Mengqing Cao
cc74b97f74 [Bugfix][V1] Fix deepseek with v1 (#958)
### What this PR does / why we need it?
Fix deepseek with v1, this error is introdeced by
https://github.com/vllm-project/vllm-ascend/pull/945. and this pr fix
the block table of mla

### How was this patch tested?
CI passed with new addedtest.

Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-05-29 11:57:43 +08:00
ApsarasX
e3c7f71462 [Perf] Refactor tensor disposal logic to reduce memory usage (#966)
### What this PR does / why we need it?
1. In previous PRs https://github.com/vllm-project/vllm-ascend/pull/580
https://github.com/vllm-project/vllm-ascend/pull/784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: https://github.com/sgl-project/sglang/pull/6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-05-29 11:48:26 +08:00
Mengqing Cao
6eddbd2521 [CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889)
Initialize PD Disaggreate UT

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-29 10:17:12 +08:00
wangxiyuan
f6e5decc10 [CI] upgrade to vllm 0.9.0 (#959)
Upgrade to vllm 0.9.0.
0.8.5 will not be supported any more.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-28 21:18:41 +08:00
wangxiyuan
e2a0c19cea [CI] Refactor CI (#952)
1. remove some useless test func and file
2. fix format.sh problem
3. enable full test for singlecard and multicard
4. move long term test to long_term folder. For this kind of test, it
only runs by labeled and daily test. Include: spec decode、accuracy test

## After refactor:
There are 4 test modules
- `singlecard`: contains the test running on one NPU. It'll be run for
each PR and daily test.
- `multicard`: contains the test running on multi NPUs. It'll be run for
each PR and daily test.
- `long_term`: contains the test that cost much time(Now include `spec
decode` and `accuracy` test). It'll be run for the PR with
`long-term-test` labeled and daily test.
- `e2e`: contains the test for doc and pd feature. It'll be run for the
PR with `pd-test` labeled and daily test.

## Todo:
1. some test are skipped, they should be fixed and reenabled in the
future.
2. pyhccl test for multicard doesn't work at all. It should be enabled
as well.
3. ensure long-term-test pass by daily test.

### Know issue
Now, `ready` labels is required to start pd test or long term test. And
when `long-term-test` or `pd-test` is labeled after another one, the old
labeled test will be re-run again. So the labeled test should be ran in
the following step:

1. decide which test need run, then label it. `long-term-test` or
`pd-test` or both.
2. add `ready-for-test` label, then the test will be ran.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-28 06:31:35 +08:00
Angazenn
9f5ab59e30 [WIP][BugFix]Fix accuracy issues caused by wrong etp_size passed into FusedMoEParallelConfig when using vLLM 0.9.0 (#961)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR fix accuracy issues incurred by codes that adapt to
`FusedMoEParallelConfig` in vLLM 0.9.0 version. The `tp_size` used to
split weights are wrongly passed. The root cause is that vLLM community
and vLLM-Ascend are using different methods to decide whether to use
Expert Parallel.

vLLM:
vLLM use a flag `enable_expert_parallel` to indicate whether to use EP
and use the following codes to decide `ep_size`:
```
        use_ep = (dp_size_ * tp_size_ > 1
                  and vllm_parallel_config.enable_expert_parallel)

        dp_size = dp_size_
        dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0
        tp_size, tp_rank = flatten_tp_across_dp(dp_rank)

        if not use_ep:
            return FusedMoEParallelConfig(tp_size=tp_size,
                                          tp_rank=tp_rank,
                                          dp_size=dp_size,
                                          dp_rank=dp_rank,
                                          ep_size=1,
                                          ep_rank=0,
                                          use_ep=False)
        # DP + EP / TP + EP / DP + TP + EP
        assert use_ep
        # In EP, each device owns a set of experts fully. There is no tensor
        # parallel update tp_size, tp_rank, ep_size and ep_rank to reflect that.
        ep_size = tp_size
        ep_rank = tp_rank
        return FusedMoEParallelConfig(tp_size=1,
                                      tp_rank=0,
                                      dp_size=dp_size,
                                      dp_rank=dp_rank,
                                      ep_size=ep_size,
                                      ep_rank=ep_rank,
                                      use_ep=True)
```

vLLM-Ascend:
vLLM-Ascend uses `etp` to specify Tensor Parallel in MoE.
```
            self.ep_size = get_ep_group().world_size
            self.tp_size = get_etp_group().world_size
            self.dp_size = (dp_size if dp_size is not None else
                            get_dp_group().world_size)
```

So there will be conflicts if we simply combine these codes together.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-05-27 15:16:17 +08:00
Shuqiao Li
01e3d59eae add workflow to build and release wheel (#775)
### What this PR does / why we need it?

This is a continuing work of #716.
This PR add workflow to build and release wheel, and also release source
to PYPI.
We have 3 conditions to trigger the workflow:

1. PR to `main` and `*-dev`
2. push to `main` and `*-dev`
3. push tag with name of `v*`

Release to PYPI will only be done under condition 3. Under condition 1
and 2, it will generate .tar.gz and build .whl, upload to github
artifacts but will not release.

update:
Will build .whl and upload to github artifacts with scheduled task.


### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
All triggered conditions are well tested with my fork repo.

---------

Signed-off-by: Shuqiao Li <celestialli@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-26 14:18:26 +08:00
Mengqing Cao
a0c3e9ba50 [Bugfix] Adjust inputbatch to be compatible with latest vllm (#945)
Adjust inputbatch to be compatible with latest vllm, as kvcache group
feature has been redo in https://github.com/vllm-project/vllm/pull/18593

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-26 10:33:28 +08:00
Angazenn
1f9fb869ad [BugFix] Fix accuracy bugs for unquantized deepseekv3 models (#897)
### What this PR does / why we need it?
This PR fixes two accuracy bugs incurred by PR #819 when running
deepseekv3 series models:
1. #819 adds `all_to_all` communication in quantized cases, but
`all_gather` && `reduce_scatter` are removed in both of quantized and
unquantized cases. When running unquantized deepseekv3 models with
`ep_size == world_size`, the moe modules fail to communicate. Therefore,
this PR adds `all_to_all` communication on unquantized situation to
solve this accuracy issue.
2. Use `ep_size` rather than `dp_size` to decide whether to use
`all_to_all` in moe.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-05-24 14:29:36 +08:00
yiz-liu
17f05b1089 [Feature] Add CustomQwen3MoeForCausalLM model (#925)
Tweak packed_modules_mapping to support W8A8 weights.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-05-23 15:50:48 +08:00