1092626063
f63c1341d9
[Feature] GLM4.6 support mtp with fullgraph (#5460)
### What this PR does / why we need it?
GLM4.6 support mtp with fullgraph to improve performance
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
`
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE=AIV
vllm serve /weight/glm4.6_w8a8_with_float_mtp \
--data-parallel-size 1 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name glm \
--max-model-len 35000 \
--max-num-batched-tokens 16384 \
--max-num-seqs 16 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 1,
"model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32],
"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--async-scheduling \
`
test case:
`
vllm bench serve \
--backend vllm \
--dataset-name prefix_repetition \
--prefix-repetition-prefix-len 22400 \
--prefix-repetition-suffix-len 9600 \
--prefix-repetition-output-len 1024 \
--num-prompts 1 \
--prefix-repetition-num-prefixes 1 \
--ignore-eos \
--model glm \
--tokenizer /weight/glm4.6_w8a8_with_float_mtp \
--seed 1000 \
--host 0.0.0.0 \
--port 8000 \
--endpoint /v1/completions \
--max-concurrency 1 \
--request-rate 1
`
- vLLM version: v0.13.0
- vLLM main:
5326c89803
Signed-off-by: 1092626063 <1092626063@qq.com>
2026-01-09 16:07:42 +08:00
..
2025-11-28 18:06:39 +08:00
2026-01-09 15:58:40 +08:00
2026-01-07 18:42:55 +08:00
2025-12-20 17:03:25 +08:00
2026-01-07 09:25:55 +08:00
2026-01-08 09:15:09 +08:00
2026-01-07 11:26:47 +08:00
2025-12-27 10:21:20 +08:00
2026-01-09 15:57:43 +08:00
2026-01-09 15:54:54 +08:00
2026-01-09 16:05:32 +08:00
2026-01-07 18:42:55 +08:00
2026-01-09 16:07:42 +08:00
2026-01-08 09:15:55 +08:00
2026-01-08 23:49:23 +08:00
2026-01-08 23:49:23 +08:00
2026-01-09 15:55:30 +08:00
2025-12-02 17:35:47 +08:00
2026-01-08 09:05:02 +08:00
2026-01-07 11:23:42 +08:00
2026-01-07 09:11:26 +08:00
2025-10-21 09:17:03 +08:00
2026-01-08 09:05:02 +08:00
2025-12-14 09:34:13 +08:00
2025-09-13 11:58:52 +08:00
2026-01-08 09:26:49 +08:00
2025-12-05 09:03:45 +08:00
2026-01-09 15:58:40 +08:00