This PR added the unit test framework to enable ut for vLLM Ascend. Unit
test runs on CPU machines. It'll be ran once lint check is passed the
same as e2e test.
For unit test, this PR created a new folder called `ut` under `tests`
module. All the test file in `ut` should keep the same with the code in
`vllm-ascend`. The file name should be start with `test_` prefix. For
example, in this PR. the `test_ascend_config.py` is added for
`ascend_config.py` test.
A new fille `worker/test_worker_v1.py` is also added as the placeholder.
This file should be the unit test for `vllm-ascend/worker/worker_v1.py`.
Additional, a new `fake_weight` folder is added, it contains the
config.json from `facebook/opt-125m`, so that the test will not always
visit huggingface.
TODO:
We should add all the unit test file one by one in the future.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add ut for torchair graph mode on DeepSeekV3
### How was this patch tested?
CI passed with new added test.
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Contains on #1111 for completeness.
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Implement multi-stream parallelism for MoE layers with shared experts,
where computation of shared experts will be overlapped with expert token
dispatch and combine. Also, when multi-stream is enabled, weights of
shared experts will be force to replicate across all cards, regardless
of any tensor parallelism configurations, to avoid AllReduce operations.
With the expected overlaping being:
```
| shared gate_up | shared act | | shared down |
| dispatch | routed gate_up, act, down | combine |
```
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
No.
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it?
Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.
ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)
### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.
### How was this patch tested?
This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).
Any advice/discussion is welcome.
### Performance Benchmark
We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.
`python -m vllm.entrypoints.openai.api_server \
--model=DeepSeek-R1-W8A8 \
--trust-remote-code \
--distributed-executor-backend=mp \
-tp=16 \
--port 8006 \
--max-num-seqs 390 \
--max-model-len 32768 \
--max-num-batched-tokens 65536 \
--block-size 128 \
--compilation_config 0 \
--gpu-memory-utilization 0.90 \
--disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`
and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`
1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)
2. test with the version using alltoall:
prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms
The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm
---------
Signed-off-by: zhuohuan <zxdu1997@gmail.com>
### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.
### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler
### How was this patch tested?
pytest test_sampler.py
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.
This PR also added the `additional_config` doc for users.
Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.
TODO: Add e2e test with torchair and deepseek once the CI resource is
available.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
1. Implentment `NPUPiecewiseBackend` to enable aclgraph
2. Eable aclgraph by default in V1, but raise error when running
deepseek and raise warning when running models except for qwen
### How was this patch tested?
CI pass with the new ut
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Fix deepseek with v1, this error is introdeced by
https://github.com/vllm-project/vllm-ascend/pull/945. and this pr fix
the block table of mla
### How was this patch tested?
CI passed with new addedtest.
Signed-off-by: Mengqing Cao <cmq0113@163.com>
1. remove some useless test func and file
2. fix format.sh problem
3. enable full test for singlecard and multicard
4. move long term test to long_term folder. For this kind of test, it
only runs by labeled and daily test. Include: spec decode、accuracy test
## After refactor:
There are 4 test modules
- `singlecard`: contains the test running on one NPU. It'll be run for
each PR and daily test.
- `multicard`: contains the test running on multi NPUs. It'll be run for
each PR and daily test.
- `long_term`: contains the test that cost much time(Now include `spec
decode` and `accuracy` test). It'll be run for the PR with
`long-term-test` labeled and daily test.
- `e2e`: contains the test for doc and pd feature. It'll be run for the
PR with `pd-test` labeled and daily test.
## Todo:
1. some test are skipped, they should be fixed and reenabled in the
future.
2. pyhccl test for multicard doesn't work at all. It should be enabled
as well.
3. ensure long-term-test pass by daily test.
### Know issue
Now, `ready` labels is required to start pd test or long term test. And
when `long-term-test` or `pd-test` is labeled after another one, the old
labeled test will be re-run again. So the labeled test should be ran in
the following step:
1. decide which test need run, then label it. `long-term-test` or
`pd-test` or both.
2. add `ready-for-test` label, then the test will be ran.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add V1Engine LoRA support.
Add LoRA e2e test on single card and multiple cards.
### Does this PR introduce _any_ user-facing change?
support lora for V1
### How was this patch tested?
CI passed with new added test
---------
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: paulyu <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: jesse <szxfml@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
This PR add new function of : aclgraph_batch_size can dynamic adjust to
different model; before this PR, the aclgraph_batch_sizes given from
vllm to vllm-ascend always too large, and that may result in ERROR while
running on different, with the information: "The resources are
insufficient".
Now, with this PR, the code can dynamic adjust aclgraph_batch_sizes
depend on the model hidden_layer_nums and parallel config, for example:
a. for Qwen2.5-7B, the aclgraph_batch_size length is 33 total;
b. for Qwen2.5-72B, the aclgraph_batch_size length is 11 total;
Signed-off-by: chris668899 <15105191595@126.com>
### What this PR does / why we need it?
Enforce eager mode in the V1 engine ahead of the upcoming CANN and
torch_npu releases.
### Does this PR introduce _any_ user-facing change?
After this change, users will no longer need to manually set
enforce_eager=True.
### How was this patch tested?
Test it with regular offline inference examples.
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
This PR supports the access of vllm-acend to the piecewise_graph feature
provided by the v1 engine.
1. register unifiled_ascend_attention_with_output for piecewise_graph to
split graph.
2. support NPUGraph to accelerate kernel launch.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
support npugraph to default, Users can disenable the npugraph feature by
configuring enforce_eager.
This has corresponding requirements for the versions of torch_npu and
CANN, and they need to support graph capture.
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
it turn to default
---------
Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Backport: https://github.com/vllm-project/vllm-ascend/pull/252
This support speculative decoding in Ascend, including speculating with
a draft model、by matching n-grams in the prompt、using MLP speculators
and using EAGLE based draft models.
Backport: https://github.com/vllm-project/vllm-ascend/pull/423
spec decode MultiStepWorker support TP1DraftModelRunner fully, support
run the draft_model_runner with multi-step prepare on the NPU directly
and support draft_model_runner use MLA.
1. before this pr, `MultiStepWorker` would not step into the branch
using NPU prepare, but only into the branch using CPU prepare (`line 52`
of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has
`no effect` on the `correct operation` of speculative decoding and the
performance of the two branches is basically the same as of the current
version, I support entering this branch in this PR. In general, there
are two main changes in `patch_multi_step_worker.py`: first, the
`is_cuda_like()` check is removed and the `TP1DraftModelRunner`
rewritten in vllm_ascend is used; second, the
`supports_gpu_multi_step()` function is made to return true on NPU
devices when outer Multi_step_worker could work correct.
3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU,
but not MLA. The relevant adaptation is in
`vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why
the `input_positions` of `model_input.attn_metadata` in vllm-ascend
needs to be added in `execute_model`, it is done in `model_runner.py`,
so I also made corresponding changes. Otherwise, when atten_backend is
MLA, it will prompt that input_positions cannot be found.
4. I commented out two lines in `draft_model_runner.py` in `line118` to
support the scenario of K>1.
```
# lora_mapping=model_input.lora_mapping,
# lora_requests=model_input.lora_requests,
```
I added comments. In the future, when vllm-ascend supports lora feature,
the changes here can be restored.
TODO:
- [ ] revert the patch when the related issues are addressed in vllm
### How was this patch tested?
CI passed with new added test.
- e2e test for medusa proposer:
tests/singlecard/spec_decode/e2e/test_medusa_correctness.py
- e2e test for mlp proposer:
tests/singlecard/spec_decode/e2e/test_mlp_correctness.py
- e2e test for n-gram proposer:
tests/singlecard/spec_decode/e2e/test_ngram_correctness.py
Tests for patched files:
- tests/singlecard/spec_decode/test_dynamic_spec_decode.py
- tests/singlecard/spec_decode/test_multi_step_worker.py
- tests/singlecard/spec_decode/test_ngram_worker.py
- tests/singlecard/spec_decode/test_spec_decode_worker.py
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
### What this PR does / why we need it?
This PR enable custom ops build by default.
### Does this PR introduce _any_ user-facing change?
Yes, users now install vllm-ascend from source will trigger custom ops
build step.
### How was this patch tested?
By image build and e2e CI
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>