rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391)

### What this PR does / why we need it?
Rollback causal_conv1d_fn ops from triton to torch version to fix
hanging issues,meanwhile update Qwen3Next doc

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
This commit is contained in:
LeeWenquan
2025-12-26 19:57:38 +08:00
committed by GitHub
parent 48854aef5c
commit 7685d0c239
2 changed files with 109 additions and 405 deletions

View File

@@ -92,10 +92,8 @@ source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh
Run the following script to start the vLLM server on multi-NPU:
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8.
```bash
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.8 --max-num-batched-tokens 4096 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
```
Once your server is started, you can query the model with input prompts.
@@ -170,11 +168,11 @@ Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I
1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `Qwen3-Next-80B-A3B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only.
2. After execution, you can get the result, here is the result of `Qwen3-Next-80B-A3B-Instruct` in `vllm-ascend:0.13.0rc1` for reference only.
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k | - | accuracy | gen | 96.3 |
| gsm8k | - | accuracy | gen | 95.53 |
## Performance
@@ -201,3 +199,15 @@ vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random
```
After about several minutes, you can get the performance evaluation result.
The performance result is:
**Hardware**: A3-752T, 2 node
**Deployment**: TP4 + Full Decode Only
**Input/Output**: 2k/2k
**Concurrency**: 32
**Performance**: 580tps, TPOT 54ms