Commit Graph

27 Commits

Author SHA1 Message Date
Shanshan Shen
103654ccd6 [Misc] Remove redundant imported envs, using envs_ascend instead (#2193)
### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.

```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```

- vLLM version: v0.10.0
- vLLM main:
71683ca6f6

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-14 09:33:39 +08:00
Ronald1995
b2598c3271 enable mm allreduce test (#2192)
### What this PR does / why we need it?
This PR is to add e2e test for using npu_mm_all_reduce_base fusion
kernel.
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
not involved

- vLLM version: v0.10.0
- vLLM main:
5d5d419ca6

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-08-07 17:19:23 +08:00
leo-pony
807f0895b2 Bump torch version to 2.7.1 (#1562)
### What this PR does / why we need it?
Bump torch version to 2.7.1, and cleanup infer schema patch
https://github.com/vllm-project/vllm-ascend/commit/857f489
(https://github.com/vllm-project/vllm-ascend/pull/837), this patch
depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974

### Does this PR introduce any user-facing change?
No

#### How was this patch tested?
CI passed

torch-npu 2.7.1rc1 install guide:
https://gitee.com/ascend/pytorch/tree/v2.7.1/
install depending:
```
pip3 install pyyaml
pip3 install setuptools
```
install torch-npu:

Closes: https://github.com/vllm-project/vllm-ascend/issues/1866
Closes: https://github.com/vllm-project/vllm-ascend/issues/1390


- vLLM version: v0.10.0
- vLLM main:
9af654cc38

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-08-05 08:43:24 +08:00
wangxiyuan
9b67c87b14 [Refactor]Refactor sampler (#2050)
Refactor Sampler implementation from patch way to inherit from vLLM
Sampler interface.

Next step: Make the op `TopKTopPSampler` in vLLM support custom ops
register mechanism

- vLLM version: v0.10.0
- vLLM main:
61a6905ab0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-30 08:47:22 +08:00
Ronald1995
32a9c5f694 [Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)
### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-28 15:13:37 +08:00
Shanshan Shen
f9e2e9bb31 [Misc][V0 Deprecation] Remove Draft Model Runner Used for V0 Spec Decode (#1810)
### What this PR does / why we need it?
Remove draft model runner used for V0 spec decode.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
34cda778a0

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 10:51:23 +08:00
Shanshan Shen
a929699e98 [Misc][V0 Deprecation] Remove multi-step worker (#1809)
### What this PR does / why we need it?
Remove multi-step worker

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:48:47 +08:00
Pr0Wh1teGivee
d13fb0766e [Perf] add patch to optimize apply_topk_topp (#1732)
### What this PR does / why we need it?
Performance optimization for apply_top_k_top_p
### Does this PR introduce _any_ user-facing change?
Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature
### How was this patch tested?
e2e & ut

















- vLLM version: v0.9.2
- vLLM main:
6a9e6b2abf

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-07-11 15:32:02 +08:00
wangxiyuan
a45dfde283 [CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602)
Make CI happy

1.
c1909e7e8c
changed moeConfig init way
2.
48fb076cbc
changed input batch logic.

This PR address these change to vllm-ascend.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1600

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-03 18:36:17 +08:00
Pr0Wh1teGivee
2fda60464c [Perf] Use fused ops npu_top_k_top_p (#1308)
### What this PR does / why we need it?
Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are
not None, otherwise fallback to the original one. The replacement will
take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` .

This patch are using `npu_top_k_top_p` which required
torch_npu>=2.5.1.post1.dev20250619

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested by DeepSeek R1 and UT passed

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-06-25 20:59:06 +08:00
yuancaoyaoHW
e7efc7e7e7 [BugFix] Remove not using patch_eagle.py for CI. (#1385)
### What this PR does / why we need it?
This PR aims to address a long-standing **CI bug** and remove unused
code. The specific changes include:

1. **Fixing CI Bug**: Resolves the root cause of CI test failures or
instability. This often stems from incorrect environment configurations,
dependency version conflicts, or flawed test script logic. This fix
ensures the reliability and consistency of the CI pipeline.
2. **Removing `patch_eagle.py`**: Deletes the `patch_eagle.py` file,
which is no longer utilized by the project. This file was likely legacy
code, experimental code, or its functionality has since been replaced by
other modules. Its removal helps reduce codebase complexity, improves
maintainability, and prevents potential confusion.

### Does this PR introduce _any_ user-facing change?
No, this PR primarily focuses on internal CI stability maintenance and
code cleanup. It does not introduce any user-visible changes to APIs,
interfaces, or other behaviors.

### How was this patch tested?
CI passed. Specifically:

1. **Existing CI Pipelines Passed**: After fixing the CI bug, all
existing CI tests and pipelines were verified to run correctly and pass
successfully.
2. **Code Cleanup Verified**: Following the removal of `patch_eagle.py`,
it was ensured that any related functional modules (if applicable)
continue to work as expected, without introducing new regressions. This
was typically verified by running the project's main test suite.

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-25 20:36:05 +08:00
wemaster
339d6894f6 [CI/UT][bugfix] fix v0 spec decode (#1321)
### What this PR does / why we need it?
1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove it in this pr.

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-06-23 09:05:13 +08:00
Shanshan Shen
d2f87ed9cc [Patch] Remove spec_decode.metrics patch (#1016)
### What this PR does / why we need it?
Remove `spec_decode.metrics` patch as this has been resolved in
https://github.com/vllm-project/vllm/pull/16983 (include in vllm
`v0.9.0`).

Returns a CUDA event recording when the copy is complete **--after
modified-->** Returns a device event (NPU Event for vllm-ascend)
recording when the copy is complete.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-09 15:05:11 +08:00
linfeng-yuan
b80a484864 Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE (#1112)
### What this PR does / why we need it?
Fix typo of VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI passed

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-06-07 19:45:33 +08:00
Yikun Jiang
8d00775fce [SpecDecode][CI] Set default values to fix spec decode and fix multicard CI (#1109)
### What this PR does / why we need it?
- Set default values to fix spec decode
- To avoid oom, we need to run the test in a single process

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed, espcecially multicards CI
- For spec decode test, long term CI passed

Closes: https://github.com/vllm-project/vllm-ascend/pull/1105

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
2025-06-07 11:23:30 +08:00
sherie
908a851a77 optimize the funtion of computing topk and topp in sampler. (#970)
### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.

### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

### How was this patch tested?
pytest test_sampler.py

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
2025-06-05 16:42:18 +08:00
jiangpeng
df58fb80ee Spec decode support for V1 Engine (#874)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Make spec decode support for V1 Engine
- Currently, Ascend does not support the triton kernel. PyTorch is used
to rewrite the `rejection_sampler.py` triton kernel. However, PyTorch is
not as good as Triton. Therefore, ascend c is used to implement the
function in the future.
- Currently, spec decode supports only the ngram algorithm. The eagle
algorithm needs to be further adapted.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
Not change user facing.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
test by `tests/singlecard/spec_decode/e2e/test_v1_spec_decode.py` and
`tests/sample/test_rejection_sampler.py`, test base function of
rejection sampler and e2e function of spec decode.

Signed-off-by: ponix-j <657511300@qq.com>
2025-05-23 14:25:46 +08:00
Angazenn
1e67089bc9 [BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
1. This PR introduces native `all_to_all` communication operator to fix
`allgather` bugs when dp_size > 1. Besides, it adds a naive
implementation of force-load-balance when doing profile runs.
2. The operator `npu_dequant_swiglu_quant` only supports input
hidden_states with dtype `torch.int32`. This tensor occupies space of
`global_bs * seq_len * topk * hidden_size`, which might be very large as
`ep_size` grows. Therefore we need to disable this operator and use
original `swiglu` && `quantize`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By performing offline inference:

![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828)

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-05-15 09:19:55 +08:00
wangxiyuan
857f489cbf [CI] Patch torch.library.infer_schema for torch 2.5 backward compatibility (#837)
Patch torch.library.infer_schema for torch 2.5 backward compatibility

- Introduced a new module `patch_utils` under
`vllm_ascend/patch/worker/patch_common/`.
- Added a function `ascend_direct_register_custom_op` to handle custom
operator registration with backward compatibility for PyTorch < 2.7
(such as torch 2.5.1).
- Implemented type conversion logic for annotations to ensure
compatibility across different PyTorch versions.
- Registered the function `ascend_direct_register_custom_op` to
`utils.direct_register_custom_op`.

- Updated `__init__.py` to include `patch_utils` as the first import.
- Ensured `patch_utils` is available for use in other patch files and
skipped isort checks for `patch_utils` import.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-14 09:20:55 +08:00
wangxiyuan
f8350569e6 [CI] upgrade vllm to 0.8.5 (#715)
1. Upgrade vllm to 0.8.5
2. Drop 0.8.4 support
3. Keep doc to 0.8.4rc2 until we release 0.8.5

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-30 09:15:50 +08:00
wemaster
54c0e63df7 [MTP] follow custom deepseek modeling changes to support graph mode (#636)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?

As custom deepseek modeling do some changes to support graph mode in
https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to
change custom deepseek_mtp modeling.

And some modifications for k>1 were not carried over by the
https://github.com/vllm-project/vllm-ascend/pull/429, now i add it.

In order to better take care of the MTP feature in the vllm-ascend
repository, I added cases related to graph mode(torchair), but i skip it
since torchair can not correctly clean up memory in vllmrunner.

Also i add some case for MTP quantization weights, but test weight is
not ready, so i skip it and i will open it when test quant weights is
ready.

https://github.com/vllm-project/vllm-ascend/pull/648 did not completely
fix the sample
change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I
added the relevant changes.

### Does this PR introduce _any_ user-facing change?
now, u can use following method to use mtp in deepseek v3/r1 float or
quant weights with eager mode.
```python
llm = LLM(
    model="wemaster/deepseek_mtp_main_random_bf16",
    tensor_parallel_size=2,
    speculative_config={
        "num_speculative_tokens": 1,
    },
    enforce_eager=True,
    trust_remote_code=True,
    disable_log_stats=False,
    gpu_memory_utilization=0.8,
    max_model_len=64,
)
```

or use mtp in deepseek v3/r1 float or quant weights with graph
mode(torchair)
```python
llm = LLM(
    model="wemaster/deepseek_mtp_main_random_bf16",
    tensor_parallel_size=2,
    speculative_config={
        "num_speculative_tokens": 1,
    },
    trust_remote_code=True,
    additional_config={
        'enable_graph_mode': True,
    },
    disable_log_stats=False,
    gpu_memory_utilization=0.8,
    max_model_len=64,
)
```

add notes:
1. now, we support k>1, so u can set num_speculative_tokens > 1 if there
is sufficient redundant computing power;
2. MTP is not supported in V1, we will support it when vLLM does it in
https://github.com/vllm-project/vllm/issues/13500.
3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3
patch https://github.com/vllm-project/vllm-ascend/pull/236 file
`vllm_ascend/patch/patch_metrics.py` method
`__npu_async_metrics_collector_init__`

### How was this patch tested?
local tested passed and test by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-04-28 21:18:53 +08:00
Mengqing Cao
ba3d8aae94 [Model][MiniCPM] support MiniCPM (#645)
### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-27 11:27:24 +08:00
Shanshan Shen
4a0ce3660e [Misc] Remove some parts of metrics patch (#603)
### What this PR does / why we need it?
Remove some parts of metrics patch, since the `cuda` hard code has been
fixed by https://github.com/vllm-project/vllm/pull/14411.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-22 18:45:21 +08:00
wangxiyuan
538a69c145 [Patch] format patch module to make it more clear (#601)
Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-22 14:13:00 +08:00
wemaster
0ae9ee0f8a [BUGFIX] main-sd-bugfix && [UT] add mtp UT (#593)
### What this PR does / why we need it?
The pr will fix some bug about spec decode / MTP
The pr add a mtp e2e UT `test_mtp_correctness.py`

**vllm_ascend/attention/attention.py**
1. add support `self.attn_mask_cache` only has 1 element to cover scene
in which both spec docode and chunked prefill are enabled.

**vllm_ascend/distributed/parallel_state.py**
1. remove 2 assert because spec decode worker would use init_worker
twice

**vllm_ascend/models/deepseek_mtp.py**
1. remove unused params;
2. add support w8a8 in `CustomDeepSeekMTP`

**vllm_ascend/quantization/quant_config.py**
1. use `AscendUnquantizedFusedMoEMethod` instead of
`UnquantizedFusedMoEMethod`

**other**
1. replace `from vllm.logger import init_logger` to `from vllm.logger
import logger` all of the vllm-ascend project



### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-04-21 19:25:51 +08:00
Mengqing Cao
6ee7f5cf71 [SpecDecode] Add spec decode support (#500)
### What this PR does / why we need it?
Backport: https://github.com/vllm-project/vllm-ascend/pull/252
This support speculative decoding in Ascend, including speculating with
a draft model、by matching n-grams in the prompt、using MLP speculators
and using EAGLE based draft models.

Backport: https://github.com/vllm-project/vllm-ascend/pull/423
spec decode MultiStepWorker support TP1DraftModelRunner fully, support
run the draft_model_runner with multi-step prepare on the NPU directly
and support draft_model_runner use MLA.

1. before this pr, `MultiStepWorker` would not step into the branch
using NPU prepare, but only into the branch using CPU prepare (`line 52`
of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has
`no effect` on the `correct operation` of speculative decoding and the
performance of the two branches is basically the same as of the current
version, I support entering this branch in this PR. In general, there
are two main changes in `patch_multi_step_worker.py`: first, the
`is_cuda_like()` check is removed and the `TP1DraftModelRunner`
rewritten in vllm_ascend is used; second, the
`supports_gpu_multi_step()` function is made to return true on NPU
devices when outer Multi_step_worker could work correct.

3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU,
but not MLA. The relevant adaptation is in
`vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why
the `input_positions` of `model_input.attn_metadata` in vllm-ascend
needs to be added in `execute_model`, it is done in `model_runner.py`,
so I also made corresponding changes. Otherwise, when atten_backend is
MLA, it will prompt that input_positions cannot be found.

4. I commented out two lines in `draft_model_runner.py` in `line118` to
support the scenario of K>1.
  ```
  # lora_mapping=model_input.lora_mapping,
  # lora_requests=model_input.lora_requests,
  ```
I added comments. In the future, when vllm-ascend supports lora feature,
the changes here can be restored.

TODO:
- [ ] revert the patch when the related issues are addressed in vllm

### How was this patch tested?
CI passed with new added test.
- e2e test for medusa proposer:
tests/singlecard/spec_decode/e2e/test_medusa_correctness.py
- e2e test for mlp proposer:
tests/singlecard/spec_decode/e2e/test_mlp_correctness.py
- e2e test for n-gram proposer:
tests/singlecard/spec_decode/e2e/test_ngram_correctness.py

Tests for patched files:
- tests/singlecard/spec_decode/test_dynamic_spec_decode.py
- tests/singlecard/spec_decode/test_multi_step_worker.py
- tests/singlecard/spec_decode/test_ngram_worker.py
- tests/singlecard/spec_decode/test_spec_decode_worker.py

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
2025-04-17 20:16:32 +08:00
wangxiyuan
bbe7ccd366 [MISC] Add patch module (#526)
This PR added patch module for vllm
1. platform patch: the patch will be registered when load the platform
2. worker patch: the patch will be registered when worker is started.

The detail is:
1. patch_common: patch for main and 0.8.4 version
4. patch_main: patch for main verison
5. patch_0_8_4: patch for 0.8.4 version
2025-04-16 09:28:58 +08:00