Commit Graph

6 Commits

Author SHA1 Message Date
1092626063
f63c1341d9 [Feature] GLM4.6 support mtp with fullgraph (#5460)
### What this PR does / why we need it?
GLM4.6 support mtp with fullgraph to improve performance

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
`
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE=AIV

vllm serve /weight/glm4.6_w8a8_with_float_mtp \
  --data-parallel-size 1 \
  --tensor-parallel-size 16 \
  --seed 1024 \
  --served-model-name glm \
  --max-model-len 35000 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 1,
"model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32],
"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --async-scheduling \
`

test case:
`
vllm bench serve \
  --backend vllm \
  --dataset-name prefix_repetition \
  --prefix-repetition-prefix-len 22400 \
  --prefix-repetition-suffix-len 9600 \
  --prefix-repetition-output-len 1024 \
  --num-prompts 1 \
  --prefix-repetition-num-prefixes 1 \
  --ignore-eos \
  --model glm \
  --tokenizer /weight/glm4.6_w8a8_with_float_mtp \
  --seed 1000 \
  --host 0.0.0.0 \
  --port 8000 \
  --endpoint /v1/completions \
  --max-concurrency 1 \
  --request-rate 1

`
- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: 1092626063 <1092626063@qq.com>
2026-01-09 16:07:42 +08:00
InSec
2d713fee93 [CI] Accuracy issue of qwen3-next-w8a8 nightly test fix. (#5746)
### What this PR does / why we need it?
Close the **Full Graph** mode to temporarily avoid accuracy issue for
**Qwen3-Next-80B-A3B-Instruct-W8A8**.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: InSec <1790766300@qq.com>
2026-01-09 15:55:13 +08:00
LeeWenquan
a3a74d6984 [CI] Add qwen3 next ci (#5395)
### What this PR does / why we need it?
Add Qwen3Next CI 

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-01-09 10:29:09 +08:00
Icey
137f28341d [Tests] Add qwen3-8b nightly test (#5597)
### What this PR does / why we need it?
Add qwen3-8b nightly test 

- vLLM version: v0.13.0
- vLLM main:
7157596103
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-07 18:42:05 +08:00
InSec
089ca2ddcc [Nightly][Test] Add Qwen3-Next-80B-A3B-Instruct-W8A8 nightly test (#5616)
### What this PR does / why we need it?
There was an accuracy issue with **Qwen3-Next-80B-A3B-Instruct-W8A8**
model in the old version of **Triton-Ascend**, so, we are now adding one
nightly test to maintain it.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: IncSec <1790766300@qq.com>
2026-01-06 17:36:00 +08:00
Li Wang
e760aae1df [1/N] Refactor nightly test structure (#5479)
### What this PR does / why we need it?
This patch is a series of refactoring actions, including clarifying the
directory structure of nightly tests, refactoring the config retrieval
logic, and optimizing the workflow, etc. This is the first step:
refactoring the directory structure of nightly to make it more readable
and logical.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-30 19:03:02 +08:00