14 Commits

Author SHA1 Message Date
linfeng-yuan
695e5c9ebc [0.11.0][ops] npu_top_k_top_p supports k and p only (#4153)
### What this PR does / why we need it?
With CANN 8.3 and corresponding PTA 2.7.1, `npu_top_k_top_p` supports
passing only k (1<=k<=1024) and p separately.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
E2E performance test with only `top_k` and `p` seperately. This pr gains
0.2ms improvements in TPOT with `batch_size=16`.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-12-09 15:45:40 +08:00
wangxiyuan
f12f76d7ba Drop 0.10.2 (#3284)
Drop v0.10.2 support, we support vLLM 0.11.0rc3 now.
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 10:28:38 +08:00
LeeWenquan
69cc99d004 Add restriction conditions to the ApplyTopPTopK operator (#3254)
### What this PR does / why we need it?
Add restriction conditions to the ApplyTopPTopK operator : 1 <= K <=1024
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-09-29 14:04:58 +08:00
Yikun Jiang
b8b68b3dfe [CI] Upgrade vLLM to 20250920 (c60e613) and address config break (#3067)
### What this PR does / why we need it?
Bump main to
c60e6137f0

- Updated imports in `vllm.config` to
`vllm.config.model`(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252

- Refactored `vllm_ascend/sample/sampler.py` to use string values for
`logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs
mode handling and improving compatibility with recent vLLM changes
(aed16879a9)
https://github.com/vllm-project/vllm/pull/25252

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


- vLLM version: v0.10.2
- vLLM main:
6d8246aaff

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-09-21 09:49:17 +08:00
CaranLic
168ad600b5 [main] add pd transfer for ascend scheduler (#2753)
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.

### How was this patch tested?

```
max_num_seqs=24,
additional_config={
    "ascend_scheduler_config":{
        "enabled": True,
        "enable_pd_transfer": True,
        "decode_max_num_seqs": 24,
        "enable_chunked_prefill": False
    }
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |


- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163

---------

Signed-off-by: CaranLic <740821011@qq.com>
2025-09-10 08:46:39 +08:00
Mengqing Cao
edf1f600ad [CI] Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1 (#2840)
### What this PR does / why we need it?
Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1

### Does this PR introduce _any_ user-facing change?
branch main of vllm-ascend will not be compatible with vllm v0.10.1 and
v0.10.1.1

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-09-10 08:43:10 +08:00
Yikun Jiang
175f6bc445 Support v0.10.1 (#2584)
### What this PR does / why we need it?
This patch also supports v0.10.1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed
- test 0.10.1: https://github.com/vllm-project/vllm-ascend/pull/2583
- vLLM version: v0.10.1.1
- vLLM main:
321938e9ac

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-08-28 18:47:53 +08:00
Mengqing Cao
b0403f8d8a [CI] fix ci (#2464)
### What this PR does / why we need it?
1. use action/checkout@v5 instead of v4
2. remove dbo test case because there is issue with it and will be
refactored later
3. make vllm-ascend compatible with vllm v0.10.1.1 and add CI for it
4. fix sampler api changes introduced by
https://github.com/vllm-project/vllm/pull/22387
6. fix qwen3 moe config changes intruoduced by
https://github.com/vllm-project/vllm/pull/20562
7. fix kvcache block changes introduced by
https://github.com/vllm-project/vllm/pull/23262

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.10.0
- vLLM main:
0c6e40bbaa

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-22 07:30:48 +08:00
whx
29aaba5f84 [Perf][MTP] Optimize reject sampler in greedy situation. (#2137)
This PR port optimization in PR #2002 to main and makes it cleaner.

- vLLM version: v0.10.0
- vLLM main:
afa5b7ca0b

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-08-11 17:37:49 +08:00
xuyexiong
26fc36b0e0 [V1] MTP supports torchair (#2145)
### What this PR does / why we need it?
Support MTP  with:

- [x]  V0 Scheduler
- [x]  TorchAir
- [x]  Single DP
- [x]  Multi DP
- [x]  Disaggregate PD

Known issues:
- [ ] Not support V1 Scheduler (chunked prefill), will be supported in a
few weeks
- [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now,
need to comment out the line 171-175 in file
`vllm/vllm/v1/metrics/loggers.py`
```
            if (len(self.engine_indexes) > 1
                and vllm_config.speculative_config is not None):
            raise NotImplementedError("Prometheus metrics with Spec Decoding "
                                      "with >1 EngineCore per AsyncLLM is not "
                                      "supported yet.")
```

To start an online server with torchair enabled, here is an example:
```
python -m vllm.entrypoints.openai.api_server \
 --model="/weights/DeepSeek-R1_w8a8/" \
 --trust-remote-code \
 --max-model-len 40000 \
 --tensor-parallel-size 4 \
 --data_parallel_size 4 \
 --max-num-seqs 16 \
 --no-enable-prefix-caching \
 --enable_expert_parallel \
 --served-model-name deepseekr1 \
 --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
 --quantization ascend \
 --host 0.0.0.0 \
 --port 1234 \
 --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \
 --gpu_memory_utilization 0.9 
``` 

offline example with torchair enabled
```
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=16, temperature=0)
# Create an LLM.
llm = LLM(
    model="/home/data/DeepSeek-R1_w8a8/",
    tensor_parallel_size=16,
    max_num_seqs=16,
    gpu_memory_utilization=0.9,
    distributed_executor_backend="mp",
    enable_expert_parallel=True,
    speculative_config={
        "method": "deepseek_mtp",
        "num_speculative_tokens": 1,
    },
    trust_remote_code=True,
    enforce_eager=False,
    max_model_len=2000,
    additional_config = {
       'torchair_graph_config': {
            'enabled': True,
            "graph_batch_sizes": [16],
            'enable_multistream_shared_expert': False,
        },
       "ascend_scheduler_config": {
            "enabled": True
        },
        # 'expert_tensor_parallel_size': 16,
    }
)

# Generate texts from the prompts.
# llm.start_profile()
outputs = llm.generate(prompts, sampling_params)
# llm.stop_profile()
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.0
- vLLM main:
302962e806

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-08-06 19:37:43 +08:00
leo-pony
c62f346f5d Fixed 310p failure when using the sampler feature (#2151)
### What this PR does / why we need it?
Fixed 310p failure when using the sampler feature.
The root cause is: torch_npu.npu_top_k_top_p uses the operator
aclnnApplyTopKTopP, but aclnnApplyTopKTopP currently does not support
310P.
First PR that has the issue is #1308.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.10.0
- vLLM main:
207b750e19

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-08-01 08:43:08 +08:00
wangxiyuan
9b67c87b14 [Refactor]Refactor sampler (#2050)
Refactor Sampler implementation from patch way to inherit from vLLM
Sampler interface.

Next step: Make the op `TopKTopPSampler` in vLLM support custom ops
register mechanism

- vLLM version: v0.10.0
- vLLM main:
61a6905ab0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-30 08:47:22 +08:00
wangxiyuan
34cfdf5520 [Misc] Fix logger bug (#2024)
1. Remove useless logger
2. Fix logger bug, same problem as
https://github.com/vllm-project/vllm-ascend/pull/515

- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 15:59:09 +08:00
jiangpeng
df58fb80ee Spec decode support for V1 Engine (#874)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Make spec decode support for V1 Engine
- Currently, Ascend does not support the triton kernel. PyTorch is used
to rewrite the `rejection_sampler.py` triton kernel. However, PyTorch is
not as good as Triton. Therefore, ascend c is used to implement the
function in the future.
- Currently, spec decode supports only the ngram algorithm. The eagle
algorithm needs to be further adapted.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
Not change user facing.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
test by `tests/singlecard/spec_decode/e2e/test_v1_spec_decode.py` and
`tests/sample/test_rejection_sampler.py`, test base function of
rejection sampler and e2e function of spec decode.

Signed-off-by: ponix-j <657511300@qq.com>
2025-05-23 14:25:46 +08:00