1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Make accuarcy CI and report work
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manaully review
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Update 0.9.0rc1 contributors info
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Fix incompatibility problem for non-EPLB scenarios in #1116
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with online serving and e2e CI.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
1. Update 0.9.0rc1 release date
2. Update feature and model support list
3. Add DP known issue to release note
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add EPLB expert map import capabilities
### Does this PR introduce _any_ user-facing change?
When importing the EPLB expert map you need import expert map file by
vllm args additional_config
### How was this patch tested?
1.You need to collect expert hotness and generate an expert placement
file based on the hotness and the EPLB algorithm, or you can directly
use an existing expert placement table.
2.When launching vLLM, enable EC2 and pass the configuration via the
command-line argument:
--additional-config '{"expert_map_path": "/xxx/xxx/xx.json"}
Co-authored-by: songshanhu07 <1763685535@qq.com>
---------
Signed-off-by: songshanhu07 <1763685535@qq.com>
Signed-off-by: Yuxiao-Xu <664988918@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: songshanhu07 <1763685535@qq.com>
Co-authored-by: Xu Yuxiao <xuyuxiao2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Remove `spec_decode.metrics` patch as this has been resolved in
https://github.com/vllm-project/vllm/pull/16983 (include in vllm
`v0.9.0`).
Returns a CUDA event recording when the copy is complete **--after
modified-->** Returns a device event (NPU Event for vllm-ascend)
recording when the copy is complete.
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
With this PR, we can migrate to the native `data_parallel.py` in vllm
examples and remove the version in vllm-ascend.
At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable
difficulties; therefore, we must employ a temporary workaround and
manually specify the device.
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Set `ACL_OP_INIT_MODE` env var default to `0`, since vllm-ascend may
have problems in some scenarios when setting it to `1`.
Plus, the guide https://github.com/vllm-project/vllm-ascend/issues/734
has also been updated.
Signed-off-by: shen-shanshan <467638484@qq.com>
Add unpadded Qwen2.5-VL for verl scenario.
When using vllm-ascend for verl scenario, set `USE_OPTIMIZED_QWEN2_5_VL`
(default `1`) to `0` to use unpadded Qwen2.5-VL to avoid errors.
This is cherry-picked from 0.7.3-dev
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
### What this PR does / why we need it?
Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
The current vllm-ascend is not support the multimodal model in
vllm-ascend v1 yet. So I change the `model_runner_v1.py` file with using
MRoPE feature and so on to support this feature. It currently still not
perfect since the Ascend operator is not support the `window/full attn`
to reduce Memcpy operations, so it would out of memory if the input
embedding is too large, so We can't use `self._profile_multimodal()` for
profile since it use a big dummy input (i.e. images) as the multimodal
input.
Fixes: https://github.com/vllm-project/vllm-ascend/issues/514
### Does this PR introduce _any_ user-facing change?
No, this feature not need change the user-facing
### How was this patch tested?
I test this offline using my machine 910B3 and my own fork, and it works
well.
---------
Signed-off-by: cty <ctynb@qq.com>
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.
ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)
### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.
### How was this patch tested?
This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).
Any advice/discussion is welcome.
### Performance Benchmark
We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.
`python -m vllm.entrypoints.openai.api_server \
--model=DeepSeek-R1-W8A8 \
--trust-remote-code \
--distributed-executor-backend=mp \
-tp=16 \
--port 8006 \
--max-num-seqs 390 \
--max-model-len 32768 \
--max-num-batched-tokens 65536 \
--block-size 128 \
--compilation_config 0 \
--gpu-memory-utilization 0.90 \
--disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`
and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`
1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)
2. test with the version using alltoall:
prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms
The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm
---------
Signed-off-by: zhuohuan <zxdu1997@gmail.com>
### What this PR does / why we need it?
View optimization in torchair (defaulted to on for Transpose with any of
its axis being 1) prevents the weight Transpose to be fused with later
GroupedMatmul, which decrease the performance of MoE layer when expert
parallelism equals the total number of experts (e.g. EP256 for DSKv3).
Add an option to solve this problem by disabling the optimization.
### Does this PR introduce _any_ user-facing change?
Controlled by
`additional_config.torchair_graph_config.enable_view_optimize`,
defaulted to `True`.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it?
- Set default values to fix spec decode
- To avoid oom, we need to run the test in a single process
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- CI passed, espcecially multicards CI
- For spec decode test, long term CI passed
Closes: https://github.com/vllm-project/vllm-ascend/pull/1105
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
[CI]Moe alltoall communication optimization
The DeepSeek V3/R1 model has 256 routing experts. During parallel
inference, if the load of an EP rank is high, the overall communication
and computing time is slowed down, which becomes a weakness of parallel
inference because the load is unevenly distributed. However, the data
volume in the prefill phase is large, and the inter-card communication
time consumption/calculation time consumption and the data volume are
closely related to each other. Therefore, less non-linear precision loss
can be used to obtain a near-linear performance improvement.
During parallel inference, global synchronization occurs during
communication. As a result, the card with low load completes the
calculation first and waits for the card with the highest load to
complete the calculation. Therefore, if the load is unbalanced, the card
with high load slows down the overall time consumption. Significant
performance gains can be achieved by discarding a small number of
tokens, which is unacceptable in some precision-sensitive scenarios.
However, similar to quantification, it is a solution that uses an
acceptable precision loss in some scenarios for performance. In
addition, a trade-off between performance and precision can be achieved
by configuring a proportion of discarded tokens.
Perform the test on A3. The batch size is 8 (B), the prompt length is
3.5K tokens (S), and the parallel configuration is as follows: AttnDP=2,
AttnTP=8, MoeTP=1, and MoeEP=16. In this sence, we got a 10%-15%
performance gain.
Plus, the next version, we'll have an alltoallv moe.
---------
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
When profiling, it is often necessary to disable the call stack to
reduce profiling overhead, and adjust the profiler_level to level1 to
obtain more detailed operator and communication information.
Therefore, it is recommended to modify the default profiling
configuration.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No
Signed-off-by: ApsarasX <apsarax@outlook.com>
### What this PR does / why we need it?
Fix the bug in torch 2.5.1 that raising segment fault when enable
`pin_memory` while creating a tensor using `torch.tensor`.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
- Adds support for passing prompt_embeds to LLM.generate as
```bash
llm.generate({"prompt_embeds": input_embeds}, sampling_params)
```
or
```bash
llm.generate(
[{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)
```
- Add `prompt_embeds` to examples
### How was this patch tested?
CI passed with new added/existing test.
and I have test with the example script in this pr, and the output seems
looks good:
```bash
[Single Inference Output]
------------------------------
The capital of France is Paris. Paris is the largest city in France and is
------------------------------
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3966.87it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.99it/s, est. speed input: 177.08 toks/s, output: 63.91 toks/s]
[Batch Inference Outputs]
------------------------------
Q1: Please tell me about the capital of France.
A1: The capital of France is Paris. It is located in the northern part of the
Q2: When is the day longest during the year?
A2: The day is longest during the year at the summer solstice. This typically occurs
Q3: Where is bigger, the moon or the sun?
A3: The sun is significantly bigger than the moon.
The sun has a diameter of
------------------------------
```
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Add `with_prefill_across_dp` to AscendMetadata to fix dp
This pr fixes the bug introduced by #1012, which add an arg
`with_prefill_across_dp` when dp_size > 1.
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it remove cumsum operator in MOE to improve performance
### How was this patch tested?
it should be tested on a case with mc2 operator and graph mode enabled
Signed-off-by: zhky <hahazhky@163.com>
Co-authored-by: 洪炜杰 <hongweijie1@huawei.com>
Fix the ascend config check logic:
1. refactor check_ascend_config to make it clear:
1. torchair graph should not work with enforce_eager=True
2. aclgraph should not work with torchair graph
3. add refresh config for rlhf case
4. fix a typo in model runner
5. change expert_tensor_parallel_size default to 0 to keep the same as
before
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
KV cache manger has been changed by
f8a1a2d108
This PR adapt the change into vllm-ascend to make ci happy
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
We need to **observe the time consumed in each stage of inference
(including pre-processing, model forward, etc.), without any performance
loss**.
Therefore, we use the event timestamp mechanism of the NPU to mark any
stage during the execution of the NPU device (this marking operation is
executed asynchronously, with no performance loss).
Additionally, we provide a blocking synchronization API
`pop_captured_sync` to be called at an appropriate time, to print the
time consumed in all observed stages.
**model_runner_v1.py file only changed 5 lines, all of which were
`ProfileExecuteDuration()` calls, and nothing else was changed, while
more changes were showed due to the alignment issue.**
### Does this PR introduce _any_ user-facing change?
Use env `VLLM_MODEL_EXECUTE_TIME_OBSERVE `to enable this feature
### How was this patch tested?
Tested in deepseek model,Print like this:
```
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
```
---------
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
### What this PR does / why we need it?
Support MOE inner Multi-stream for Deepseek.
This feature requires graph mode with mc2 enabled.
---------
Signed-off-by: David9857 <985700846@qq.com>
### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.
### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler
### How was this patch tested?
pytest test_sampler.py
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.
This PR also added the `additional_config` doc for users.
Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.
TODO: Add e2e test with torchair and deepseek once the CI resource is
available.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix benchmark results path
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This is for the benchmark iteration, which will change the benchmark
scripts while checkouting each commit. So we need ensure the benchmark
scripts always available.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manaully
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Adjust concurrency group for each npu workflow
- for pd and benchmarks share the static-08-01, so only one job can runs
on
- other job one PR/schedule should have only 1 job runs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
- Remove workflow_dispatch
- Change schedule time to 2:00 UTC+8
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
CI passed
---------
Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
### What this PR does / why we need it?
Update escli-tool to v.0.2.1 to fix deps bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: wangli <858794774@qq.com>
### What this PR does / why we need it?
This is a post patch of #1014, for some convenience optimization
- Set cached dataset path for speed
- Use pypi to install escli-tool
- Add benchmark results convert script to have a developer-friendly
result
- Patch the `benchmark_dataset.py` to disable streaming load for
internet
- Add more trigger ways for different purpose, `pr` for debug,
`schedule` for daily test, `dispatch` and `pr-labled` for manual testing
of a single(current) commit
- Disable latency test for `qwen-2.5-vl`, (This script does not support
multi-modal yet)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Add bot to label merge conflicts, it helps developer and maintainer to
do code review and update clear.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>