### What this PR does / why we need it?
Update max_tokens and prompt in qwen3 online doc
Before:
```
"'max_tokens' or 'max_completion_tokens' is too large: 4096. This model's maximum context length is 4096 tokens and your request has 18 input tokens (4096 > 4096 - 18). None"
```
After:
```
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct",
"messages": [
{"role": "user", "content": "Who are you?"}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32
}'
.{"id":"chatcmpl-8ddbd65c9ddc405397219a6792feb9a0","object":"chat.completion","created":1757985049,"model":"/root/.cache/modelscope/hub/models/Qwen-SGlang/Qwen3-Next-80B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to assist you in generating various","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":44,"completion_tokens":32,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Manually test on my local env
- CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
On main, AscendScheduler does not support Multimodels, becuse of lacking
of scheduled_encoder_inputs which is need on multimodels inference
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: main@93e28e6862669e3b5cf47cea9f782a65ec47e155
- vLLM version: v0.10.2rc2
- vLLM main:
15b8fef453
---------
Signed-off-by: fan2956 <zhoufan53@huawei.com>
Co-authored-by: zhoufan2956 <zhoufan2956@163.com>
### What this PR does / why we need it?
Bump vLLM version to v0.10.2
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.2rc3
- vLLM main:
15b8fef453
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This reverts commit 339fceb89c.
### Does this PR introduce _any_ user-facing change?
Yes, use 8.2rc1 image by default
### How was this patch tested?
CI passed
- vLLM version: v0.10.2rc2
- vLLM main:
cfa3234a5b
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR is used to adapt the hostname format for Mooncake when using
adxl. When Mooncake uses adxl, it is necessary to set
```USE_ASCEND_DIRECT``` to True in the file
```/Mooncake/mooncake-common/common.cmake``` during compilation. The
mooncake_connector obtains this config by calling
```vllm_config.kv_transfer_config.get_from_extra_config```, determines
whether Mooncake is using adxl, and selects the corresponding hostname
format.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: main
- vLLM main:
d21a36f5f9
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
Enable push trigger for image job
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Followup on https://github.com/vllm-project/vllm-ascend/pull/2864
- vLLM version: v0.10.2rc2
- vLLM main:
89e08d6d18
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Upgrade CANN version to 8.3.rc1.alpha001
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2rc2
- vLLM main:
89e08d6d18
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
**Background:**
There are two principles about operator registration in PyTorch
- The same namespace can be only registered once by `TORCH_LIBRARY`
- The operator signatures can be only registered once by `def`
Considering that all custom operators defined in the current repo are
only used by Ascend, instead of defining a common operator schema by
vLLM, all accelerators then follow this operator schema and complete the
implementation based on their respective hardware, which is conducive to
functional abstraction.
Therefore, we can rename the operator registration namespace to an
Ascend-specific namespace(**_C_ascend**).
Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742
- vLLM version: main
- vLLM main:
f592b3174b
Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
### What this PR does / why we need it?
Upgrade vLLM version to 0.10.2rc2
### Does this PR introduce _any_ user-facing change?
Yes, image will use 0.10.2rc2 vLLM
### How was this patch tested?
- vLLM version: main
- vLLM main:
f17c075884
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR enforces the forcible disabling of the chunked prefill feature
in Non-MLA models, as the performance of operators supporting this
functionality is currently suboptimal. Unless the user has enabled
chunked prefill in the ascend_scheduler_config, we would allow this
feature.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
Related: https://github.com/vllm-project/vllm-ascend/pull/2659
- vLLM version: main
- vLLM main:
d21a36f5f9
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
This PR fixed the bug in register_custom_ops without forward_context. We
set try-except to consider this situation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: main
- vLLM main:
7920de0a2a
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
- Enable label-based image test and use free runner to run lint
- soft revert
26f388ba08
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: main
- vLLM main:
404c85ca72
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
`torch_npu.npu_apply_rotary_pos_emb` only support head_size and
rotary_dim equal 128. Error occurs when running GLM
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: main
- vLLM main:
404c85ca72
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
modelslim will generate self.bias for rms norm in quantization, since
RMSNorm in vllm has no this parameter, so its nesscesary
to create a AscendQuantRmsNorm.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested by deepseek-v3.1-w8a8
<img width="2496" height="592" alt="image"
src="https://github.com/user-attachments/assets/004c6e76-3d7a-4a1f-b59f-a14304012663"
/>
- vLLM version: main
- vLLM main:
d6249d0699
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
This PR sets the default format of GMM w2_weight in w8a8_dynamic to be
NZ to improve performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: main
- vLLM main:
e40827280b
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
1. Move ops/comm_utils to ops/moe/comm_utils
2. Move distributed/tensor_parallel/gather_from_sequence_parallel_region
to ops/moe/comm_utils
3. Delete distributed/tensor_parallel
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut
- vLLM version: main
- vLLM main:
a1213fae5f
---------
Signed-off-by: wuweiqiang24 <1005334931@qq.com>
Signed-off-by: wuweiqiang24 <wuweiqiang11@huawei.com>
### What this PR does / why we need it?
This PR prefetchs the weight of mlp layers in Qwen Dense Models to
optimize the performance in Decode phase mainly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: main
- vLLM main:
a1213fae5f
Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Shuming19 <313093131@qq.com>
### What this PR does / why we need it?
[Feat]support dynamic quantization in allgather
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: main
- vLLM main:
5931b7e5d9
Signed-off-by: withHades <244036962@qq.com>
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
support qwen25 vl w8a8 quantization
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
62f66be1f7
---------
Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>
Co-authored-by: lijiaojiao <lijiaojiao990304@163.com>
This PR is based on top of
[#23569](https://github.com/vllm-project/vllm/pull/23569) and
[#24219](https://github.com/vllm-project/vllm/pull/24219).
### What this PR does / why we need it?
This PR allows the model runner to function asynchronously when using
async scheduling. This allows full overlap of the cpu operations
(including prepare_inputs) and the model forward pass. This diff is
functional and does not support speculative decoding, PP, or guided
decoding.
Expected speedup is 5-10% over the current async scheduling.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
server
```
python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\
--trust-remote-code --enforce-eager \
--distributed-executor-backend=mp \
-tp=4 \
--port 8006 \
--max-model-len 32000 \
--block-size 128 \
--gpu-memory-utilization 0.99
```
client
```
python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \
--dataset-name random --random-input-len 2048 --random-output-len 2048 \
--ignore-eos\
--num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \
--metric-percentiles 90 --base-url http://localhost:8006 --save-result \
--result-dir $PROFILER_DIR
```
benchmark test based on Qwen3-32B TPOT result:
||forward async| scheduler async |sync|
|-|-|-|-|
|avg|41.73|41.86|44.20|
|improve0|0.3%|0|0|
|improve1|5.58%|0|0|
benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result:
||forward async|sync|
|-|-|-|
|avg|23.22|29.16|
|improve|20.3%|0|
- vLLM version: main
- vLLM main:
e93f4cc9e3
Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com>
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com>
Co-authored-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
fix ascend fused moe spelling error
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
0ae43dbf8c
- vLLM version: main
- vLLM main:
fcc0a3130a
Signed-off-by: zhaozixin <zhaozixin1@huawei.com>
Co-authored-by: zhaozixin <zhaozixin1@huawei.com>
### What this PR does / why we need it?
Fix qwen torchair attention PrefillCacheHit
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
vLLM version: v0.10.1.1
vLLM main:
e599e2c65e
- vLLM version: main
- vLLM main:
0b9a612fa3
Signed-off-by: zhaozixin <zhaozixin1@huawei.com>
Co-authored-by: zhaozixin <zhaozixin1@huawei.com>
### What this PR does / why we need it?
Currently, when executing to the Linear layer of the model in
vLLM-Ascend, the weights input format is ND in unquantized case and
skipped ascend case, which is slower than FRACTAL_NZ.
This PR supplements the execution logic for Linear layer. When
VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights
of the Linear layer will be converted to FRACTAL_NZ, in both unquantized
case and skipped ascend case.
- vLLM version: main
- vLLM main:
267c80d31f
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
### What this PR does / why we need it?
LLMdatadist connector adapt the distributed KV aggregation for the main
branch. Change the P node from returning "finish sending" only when TP0
responds to returning "finish sending" as soon as each NPU receives it.
The D node will send a finish receive signal to the corresponding tp
rank of the P node.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
gsm8k test
2*A3 1P 1D
P: dp2 tp8 D:dp 4 tp4
P: dp2 tp8 D:dp 2 tp8
- vLLM version: main
- vLLM main:
cc99baf14d
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Fix UTs on register customop and warm up model
### How was this patch tested?
CI passed with existing test.
Co-authored-by: Icey <1790571317@qq.com>
- vLLM version: main
- vLLM main:
cc99baf14d
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
1. Solved the problem that in the Qwen3 Moe model case, opening DP would
use an extra stream, causing ACLgraph sizes capture error
2. After experimentation, it was found that in many cases, some
operators would occupy more streams than expected. Therefore, the buffer
area for streams in ACLgraph was not large enough. After discussion,
extra 120 streams were added as buffer.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: main
- vLLM main:
0ae43dbf8c
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Update pre_commit runner
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: main
- vLLM main:
0ae43dbf8c
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
**Background:**
A dynamic library named vllm_ascend_C.so will be generated when
compiling vLLM-Ascend, so when import vllm_ascend.vllm_ascend_C in
python, the interperter will search vllm_ascend_C.so and try to find a
external symbol named PyInit_vllm_ascend_C which is provided in
csrc/camem_allocator.cpp.
**Conclusion:**
The PyInit__C is redundent.
- vLLM version: v0.10.1.1
- vLLM main:
717fc00e98
Signed-off-by: FFFrog <ljw1101.vip@gmail.com>
### What this PR does / why we need it?
Adjust the position of warm_up_atb.
### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
CI passed with existing test.
- vLLM version: main
- vLLM main:
b23fb78623
Signed-off-by: huangxialu <huangxialu1@huawei.com>
### What this PR does / why we need it?
Register AscendMultiHeadLatentAttention as CustomOP, following vllm changes
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: main
- vLLM main:
b23fb78623
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
Remove compatibility maintenance for vllm v0.10.1 and v0.10.1.1
### Does this PR introduce _any_ user-facing change?
branch main of vllm-ascend will not be compatible with vllm v0.10.1 and
v0.10.1.1
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
In reinforcement learning scenarios, weight updates are required, but
the current inference applies a transpose operation to the weights,
altering their shape. This causes a shape mismatch with the training
weights, triggering an error during weight updates.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
6fb2788163
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Really strange that `register_oot` doesn't work with `SharedFusedMoE`,
so we have to add this patch, for now.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
This PR won't have any effect in DeepSeek since we currently still stick
with the old `CustomDeepseekV2`.
- vLLM version: v0.10.1.1
- vLLM main:
0cdd213641
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Optimize rope by caching sin and cos at the first layer in Qwen Models.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.1.1
- vLLM main:
562663a044
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: ZYang6263 <zy626375@gmail.com>
Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: ZYang6263 <51255902183@stu.ecnu.edu.cn>
Co-authored-by: ZYang6263 <zy626375@gmail.com>
1. Only run light e2e test before the PR is `ready` to reduce CI time.
2. Run full test once the PR is labled `ready` and `ready for test`
3. Run lint job on self host CPU container to avoid waiting much.
- vLLM version: v0.10.1.1
- vLLM main:
6910b56da2
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Flashcomm_v1 optim in Qwen Dense Models.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.1.1
- vLLM main:
5e537f45b4
Co-authored-by: 1024daniel <xxltju324@gmail.com>
### What this PR does / why we need it?
The current implementation will result in duplicate generation of
`sin_cos_cache` in rope when `kv_seqlen` > 4k, because the
initialization length of the `sin_cos_cache` is only 4k.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
After this PR merged, sin_cos_cache will not increase in forward func,
so `test_native_rope_deepseek_forward_cache_handling` is not necessary.
- vLLM version: v0.10.1.1
- vLLM main:
60f0843ef8
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM
- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
vllm-ascend won't contain model file anymore. Now pangu model file has
been moved to torchair module. The origin one can be removed.
Note: After this PR, pangu only works with torchair mode then.
- vLLM version: v0.10.1.1
- vLLM main:
8c892b1831
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>