Commit Graph

18 Commits

Author SHA1 Message Date
starmountain1997
b6256e8bc9 Revert "[CI] fix DS3.2 single node cudagraph_sizes config (#6241)" (#6497)
# What this PR does / why we need it?
This PR reverts commit 8134146ab6, which
modified the DeepSeek V3.2 (W8A8) single-node nightly test
configuration. as there is no limit between tp_size and MTP.
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
N/A for a revert PR. The changes restore the previously known working
configuration.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-03 08:42:58 +08:00
starmountain1997
8134146ab6 [CI] fix DS3.2 single node cudagraph_sizes config (#6241)
# What this PR does / why we need it?
This PR fixes the single-node nightly test for DeepSeek V3.2 (W8A8)
model to ensure CI stability. The changes include:
1. Simplified nightly test matrix (nightly_test_a3.yaml):
- Temporarily reduced to only run deepseek3_2-w8a8 test case for
debugging
- Changed trigger from schedule/workflow_dispatch to support
push/pull_request for faster iteration
2. Updated DeepSeek V3.2 test configuration
(test_deepseek_v3_2_w8a8.py):
- Adjusted cudagraph_capture_sizes from [3, 6, 9, 12] to [8, 16, 24, 32]
for better performance
- Increased max-num-seqs from 4 to 8
- Increased gpu-memory-utilization from 0.92 to 0.98
- Increased num_speculative_tokens from 2 to 3
3. Added PR checkout step (_e2e_nightly_single_node.yaml):
- Added ability to checkout a specific PR (#6241) for testing
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
Mock nightly test has passed, see
[here](https://github.com/vllm-project/vllm-ascend/actions/runs/21574655952/job/62159656622?pr=6241).

<img width="1053" height="714" alt="a2f2ee359febb13e1f6330b1bd3c116b"
src="https://github.com/user-attachments/assets/3262ad0f-adec-4c71-871f-d9cf2db06fbc"
/>


- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-02 11:47:32 +08:00
InSec
86b6ecac4c [CI][BugFix] Import error fix. (#6293)
### What this PR does / why we need it?
Fix the **import error** of qwen3-next nightly test.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: InSec <1790766300@qq.com>
2026-01-28 22:07:47 +08:00
InSec
595b57c4d4 [CI][BugFix] Qwen3-Next nightly test fix. (#6247)
### What this PR does / why we need it?
Qwen3-Next nightly test fix. Temporarily avoid the accuracy issue in the
**full graph** mode.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: InSec <1790766300@qq.com>
2026-01-26 19:53:53 +08:00
starmountain1997
6c73b88dd6 [CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115)
### What this PR does / why we need it?

This PR enables FLASHCOMM1 communication optimization with layer
sharding for DeepSeek-V3.2 W8A8 model testing to
  validate PR #5702. The changes include:

  1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1
  improves performance for distributed inference
2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"]
4. Update baselines: Adjust performance baselines to reflect the
improvements from FLASHCOMM1 and layer sharding

### Does this PR introduce _any_ user-facing change?

No. This is a CI/test-only change that enables new communication
optimization features for testing purposes.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-01-23 19:48:37 +08:00
Nengjun Ma
ab676413e6 Default enable MLAPO (#5952)
### What this PR does / why we need it?
1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD
disagregation D Instance, for example: DeepSeekV3-W8A8,
DeepSeek-R1-W8A8.
2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models,
currently is DeepSeek-V3.2-W8A8.

### Does this PR introduce _any_ user-facing change?
Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO
feature for deepseek w8a8 model

The effect of enabling MLAPO SFA model deployed on a single A3 Node:
Test
with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py
dataset: gsm8k-lite,without set MTP, FULL GRAPH, has 19% promote:
未默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 14055.8836 ms   │
├─────────────────────────┤
│                ITL                         │ 66.8171 ms.          │
├─────────────────────────┤
│ Output Token Throughput  │ 104.9105 token/s │
├─────────────────────────┤
默认开启 MLAPO 时:
├─────────────────────────┤
│                TTFT                      │ 3753.1547 ms   │
├─────────────────────────┤
│                ITL.                        │ 61.4236  ms.       │
├─────────────────────────┤
│ Output Token Throughput  │ 125.2075 token/s│
├─────────────────────────┤

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-22 09:26:39 +08:00
Li Wang
839e03cbc9 [Nightly] Use Qwen repo for qwen3-next (#6064)
### What this PR does / why we need it?
Use Qwen repo for qwen3-next to make nightly test happy. see
https://github.com/vllm-project/vllm-ascend/actions/runs/21179025996/job/60915871441
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-21 10:39:12 +08:00
zhangxinyuehfad
750c06c78a [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#4633)
### What this PR does / why we need it?
Add DeepSeek-V3.2-W8A8 nightly ci test:

DeepSeek-V3.2-W8A8 1node DP2+TP8
:tests/e2e/nightly/models/test_deepseek_v3_2_w8a8.py

### Does this PR introduce _any_ user-facing change

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-20 21:05:15 +08:00
Icey
402872050a [Tests] move qwen3 performance test from nightly to e2e (#5980)
### What this PR does / why we need it?
Move the qwen3 performance test from nightly to e2e to intercept
performance degradation.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-20 17:08:43 +08:00
zhangxinyuehfad
372f979aa5 [CI] Add DeepSeek R1 W8A8 HMB nightly ci (#5874)
### What this PR does / why we need it?

Add DeepSeek R1 W8A8 HMB nightly ci

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-15 20:48:20 +08:00
LI SHENGYONG
da958ee386 [EPLB]Eplb Config Renaming (#5533)
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817

### Does this PR introduce _any_ user-facing change?

before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`

after this pr: 
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`

### How was this patch tested?

#### test qwen3-235b eplb num_redundant_experts=16

without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-15 10:26:44 +08:00
SILONG ZENG
7a6fde80b1 [CI]Add Kimi k2 nightly test (#5682)
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.

#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
  - Optimization: **torchair graph**, **no-prefix-caching**.
  - Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
  - Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
  - Performance: vllm-ascend/GSM8K-in3500-bs2800.
  - Accuracy: vllm-ascend/gsm8k-lite.

**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
  - Optimization: **no-prefix-caching**
- Benchmarks:
  - Performance: vllm-ascend/GSM8K-in3500-bs400.
  - Accuracy: vllm-ascend/gsm8k-lite.


### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).

**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
         self.batch_size = aisbench_config["batch_size"]
         self.request_rate = aisbench_config.get("request_rate", 0)
+        self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
         self.temperature = aisbench_config.get("temperature")
         self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
         content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
                          content)
+        content = re.sub(r'trust_remote_code=.*',
+                         f'trust_remote_code={self.trust_remote_code},',
+                         content)
         content = content.replace("top_k", "#top_k")
         content = content.replace("seed", "#seed")
```

**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.

**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
    "case_type": "performance",
    "request_conf": "vllm_api_stream_chat",
    "trust_remote_code": True,  # New user-facing parameter
    ...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433

Result as following:

- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      7cd45e     accuracy  gen                       96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max           │ Median         │ P75            │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ total   │ 34571.489 ms   │ 28657.8054 ms  │ 36294.1788 ms │ 34714.7329 ms  │ 35247.2724 ms  │ 35526.6758 ms  │ 36146.4314 ms  │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ total   │ 2043.9136 ms   │ 627.4718 ms    │ 3532.3978 ms  │ 1906.0194 ms   │ 2307.7979 ms   │ 2883.8528 ms   │ 3283.7012 ms   │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ total   │ 127.5591 ms    │ 106.4937 ms    │ 137.107 ms    │ 128.3135 ms    │ 129.5704 ms    │ 131.1332 ms    │ 134.1087 ms    │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ total   │ 126.5571 ms    │ 0.0095 ms      │ 1340.783 ms   │ 104.1398 ms    │ 110.1272 ms    │ 119.6124 ms    │ 950.2924 ms    │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ total   │ 3516.6055      │ 3014.0         │ 3985.0        │ 3525.0         │ 3525.0         │ 3586.8         │ 3800.67        │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ total   │ 256.0          │ 256.0          │ 256.0         │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ total   │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric            │ Stage   │ Value             │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration       │ total   │ 279430.9375 ms    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests           │ total   │ 512               │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests          │ total   │ 0                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests         │ total   │ 512               │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency              │ total   │ 63.3452           │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency          │ total   │ 64                │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput       │ total   │ 1.8323 req/s      │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens       │ total   │ 1800502           │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total   │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens   │ total   │ 131072            │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput   │ total   │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput  │ total   │ 469.0676 token/s  │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput   │ total   │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```

- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset    version    metric    mode      vllm-api-general-chat
---------  ---------  --------  ------  -----------------------
gsm8k      7cd45e     accuracy  gen                      100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max            │ Median         │ P75            │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ total   │ 172384.3573 ms │ 34456.5517 ms  │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms  │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ total   │ 138740.3228 ms │ 655.1066 ms    │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ total   │ 131.9374 ms    │ 90.6331 ms     │ 135.4144 ms    │ 132.405 ms     │ 132.948 ms     │ 133.7549 ms    │ 135.2543 ms    │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ total   │ 130.9028 ms    │ 0.0099 ms      │ 960.3683 ms    │ 116.9623 ms    │ 122.3127 ms    │ 132.0522 ms    │ 886.4662 ms    │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ total   │ 3514.575       │ 3014.0         │ 3843.0         │ 3525.0         │ 3525.0         │ 3588.0         │ 3801.08        │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ total   │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 256.0          │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ total   │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s  │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric            │ Stage   │ Value             │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration       │ total   │ 1166795.568 ms    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests           │ total   │ 400               │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests          │ total   │ 0                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests         │ total   │ 400               │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency              │ total   │ 59.0967           │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency          │ total   │ 64                │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput       │ total   │ 0.3428 req/s      │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens       │ total   │ 1405830           │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total   │ 25.332 token/s    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens   │ total   │ 102400            │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput   │ total   │ 1204.864 token/s  │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput  │ total   │ 87.7617 token/s   │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput   │ total   │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2026-01-12 15:56:07 +08:00
1092626063
f63c1341d9 [Feature] GLM4.6 support mtp with fullgraph (#5460)
### What this PR does / why we need it?
GLM4.6 support mtp with fullgraph to improve performance

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
`
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE=AIV

vllm serve /weight/glm4.6_w8a8_with_float_mtp \
  --data-parallel-size 1 \
  --tensor-parallel-size 16 \
  --seed 1024 \
  --served-model-name glm \
  --max-model-len 35000 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 1,
"model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32],
"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --async-scheduling \
`

test case:
`
vllm bench serve \
  --backend vllm \
  --dataset-name prefix_repetition \
  --prefix-repetition-prefix-len 22400 \
  --prefix-repetition-suffix-len 9600 \
  --prefix-repetition-output-len 1024 \
  --num-prompts 1 \
  --prefix-repetition-num-prefixes 1 \
  --ignore-eos \
  --model glm \
  --tokenizer /weight/glm4.6_w8a8_with_float_mtp \
  --seed 1000 \
  --host 0.0.0.0 \
  --port 8000 \
  --endpoint /v1/completions \
  --max-concurrency 1 \
  --request-rate 1

`
- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: 1092626063 <1092626063@qq.com>
2026-01-09 16:07:42 +08:00
InSec
2d713fee93 [CI] Accuracy issue of qwen3-next-w8a8 nightly test fix. (#5746)
### What this PR does / why we need it?
Close the **Full Graph** mode to temporarily avoid accuracy issue for
**Qwen3-Next-80B-A3B-Instruct-W8A8**.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: InSec <1790766300@qq.com>
2026-01-09 15:55:13 +08:00
LeeWenquan
a3a74d6984 [CI] Add qwen3 next ci (#5395)
### What this PR does / why we need it?
Add Qwen3Next CI 

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-01-09 10:29:09 +08:00
Icey
137f28341d [Tests] Add qwen3-8b nightly test (#5597)
### What this PR does / why we need it?
Add qwen3-8b nightly test 

- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-07 18:42:05 +08:00
InSec
089ca2ddcc [Nightly][Test] Add Qwen3-Next-80B-A3B-Instruct-W8A8 nightly test (#5616)
### What this PR does / why we need it?
There was an accuracy issue with **Qwen3-Next-80B-A3B-Instruct-W8A8**
model in the old version of **Triton-Ascend**, so, we are now adding one
nightly test to maintain it.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: IncSec <1790766300@qq.com>
2026-01-06 17:36:00 +08:00
Li Wang
e760aae1df [1/N] Refactor nightly test structure (#5479)
### What this PR does / why we need it?
This patch is a series of refactoring actions, including clarifying the
directory structure of nightly tests, refactoring the config retrieval
logic, and optimizing the workflow, etc. This is the first step:
refactoring the directory structure of nightly to make it more readable
and logical.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-30 19:03:02 +08:00