rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391)

### What this PR does / why we need it? Rollback causal_conv1d_fn ops from triton to torch version to fix hanging issues，meanwhile update Qwen3Next doc - vLLM version: release/v0.13.0 - vLLM main: 254f6b9867 --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>
2025-12-26 19:57:38 +08:00
parent 48854aef5c
commit 7685d0c239
2 changed files with 109 additions and 405 deletions
--- a/docs/source/tutorials/Qwen3-Next.md
+++ b/docs/source/tutorials/Qwen3-Next.md
@@ -92,10 +92,8 @@ source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh

 Run the following script to start the vLLM server on multi-NPU:

-For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8.
-
 ```bash
-vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
+vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.8 --max-num-batched-tokens 4096 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
 ```

 Once your server is started, you can query the model with input prompts.
@@ -170,11 +168,11 @@ Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I

 1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.

-2. After execution, you can get the result, here is the result of `Qwen3-Next-80B-A3B-Instruct` in `vllm-ascend:0.11.0rc3` for reference only.
+2. After execution, you can get the result, here is the result of `Qwen3-Next-80B-A3B-Instruct` in `vllm-ascend:0.13.0rc1` for reference only.

 | dataset | version | metric | mode | vllm-api-general-chat |
 |----- | ----- | ----- | ----- | -----|
-| gsm8k | - | accuracy | gen | 96.3 |
+| gsm8k | - | accuracy | gen | 95.53 |

 ## Performance

@@ -201,3 +199,15 @@ vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct  --dataset-name random
 ```

 After about several minutes, you can get the performance evaluation result.
+
+The performance result is:  
+
+**Hardware**: A3-752T, 2 node
+
+**Deployment**: TP4 + Full Decode Only
+
+**Input/Output**: 2k/2k
+
+**Concurrency**: 32
+
+**Performance**: 580tps, TPOT 54ms