Commit Graph

4073 Commits

Author SHA1 Message Date
Lianmin Zheng
43ad05907c [Auto Sync] Update scheduler.py, server_args.py (20251020) (#11875)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Kan Wu <wukanustc@gmail.com>
2025-10-20 17:41:19 -07:00
fzyzcjy
0917c5da8c Support mixing cutedsl and deepgemm backend (#11807) 2025-10-21 07:38:35 +08:00
penguin_wwy
184a4df697 Replace function call with set literal (#11867) 2025-10-21 01:39:16 +08:00
Qiaolin Yu
f7b1d8c5ab Fix acc len and gen throughput metrics when enabling overlap-spec (#11823)
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2025-10-21 01:34:38 +08:00
Cheng Wan
bfc3b3f786 [9/N] MoE Refactor: cleanup dispatcher interfaces (#11847) 2025-10-20 10:11:46 -07:00
Liangsheng Yin
da5bde4d16 Tiny fix main lint (#11862) 2025-10-20 19:57:24 +08:00
DarkSharpness
276e7b3e4e [Feature] New structural tag support (#10691) 2025-10-20 18:25:58 +08:00
ishandhanani
296f689242 fix(server_args): handle tokenizer init conflicts (#11776) 2025-10-20 00:27:19 -07:00
Shane A
d383e6616e [Model] Add Olmo 3 model support (#11396) 2025-10-19 23:59:16 -07:00
Shangming Cai
a2ba0bc3df Tiny clean up for PD module and doc (#11747)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-10-20 11:52:42 +08:00
Ziming Huang
6d2d0ce285 [PD] Improve eagle acceptance rate by transferring draft model hidden states (#10801)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-10-20 11:52:18 +08:00
Yuan Luo
271d3d0d50 Support mrope triton kernel and add unit test (#11722)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
2025-10-20 11:51:07 +08:00
ykcombat
c4e81e64fb [Feature] Use current greenctx stream to communicate in PD-Multiplexing. (#11594) 2025-10-20 10:58:20 +08:00
harrisonlimh
c726d44cc7 Recapture cuda graph after model weight update to resolve IMA error (#11780) 2025-10-20 10:50:03 +08:00
huangtingwei
cae3956585 check master server for mooncake store (#10510) 2025-10-20 09:37:09 +08:00
Liu-congo
be0058bc05 [BugFix] replace the input_to_float8 used in dsv2 (#11612)
Signed-off-by: Liu-congo <1502632128@qq.com>
2025-10-19 19:34:13 -05:00
fzyzcjy
a8ba32798e Fix triton_kernels import error on some hardwares (#11831) 2025-10-20 08:14:47 +08:00
Johnny
252dc4e112 [NVIDIA] FA3/FA4 Fix (#11606)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-19 17:10:10 -07:00
Baizhou Zhang
cbb5fc2edc [CI] Add CI test for DeepSeek V3.2 MTP (#11835) 2025-10-19 17:00:25 -07:00
Stefan He
4fff1ec1d9 Deterministic Mode: Add 1-stage triton kernel for prefill (#11147)
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Binyao Jiang <bijiang@linkedin.com>
2025-10-20 01:47:36 +08:00
Liangsheng Yin
7a020e0f3b [Test] Add basic matched stop for beta eagle (#11833) 2025-10-20 01:17:00 +08:00
Liangsheng Yin
48738af7f9 [CI] always print back trace in retry() (#11834) 2025-10-20 01:12:49 +08:00
Paiiii
efa473348b [Spec Decoding] Support MTP for dsv3.2 (#11652)
Co-authored-by: Paiiiiiiiiiiiiii <zengpai@baidu.com>
2025-10-19 23:44:22 +08:00
Liangsheng Yin
d658f0497e [overlap-spec] fix stop condition and trimming (#11819) 2025-10-19 22:00:20 +08:00
Liangsheng Yin
57e25de756 Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" (#11827) 2025-10-19 19:44:06 +08:00
fzyzcjy
12eb02e982 Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 (#11805) 2025-10-19 16:15:13 +08:00
fzyzcjy
002d037359 Avoid generation gets hanging when user specifies multiple event loops (#5162)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-10-19 16:12:49 +08:00
fzyzcjy
ce399e154c Make single-batch overlap compatible with NextN (#11804) 2025-10-19 16:10:44 +08:00
fzyzcjy
ea6275dfbc Tiny add hints when users send requests to wrong place (#11808) 2025-10-19 16:10:20 +08:00
narutolhy
eb7318f1c2 support tokenized batch request (#11091) 2025-10-19 07:05:02 +00:00
YAMY
80407b0493 Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (#10788) 2025-10-19 11:37:43 +08:00
Liangsheng Yin
b288f4f440 Improve send_sone script (#11817) 2025-10-19 11:28:16 +08:00
tazjin
6d6ea5af0c fix: do not wrap invalid grammar objects during constrained generation (#11328) 2025-10-19 10:54:33 +08:00
Marin
1dacedd2db make sure logit bias is applied during eagle spec decoding verification (#11555) 2025-10-19 10:53:33 +08:00
ybyang
b5e14b2b78 [1/2][feature] support openai like classification api (#11618) 2025-10-18 19:32:48 -07:00
Qiaolin Yu
ebda73dc72 Use cutlass fp4 gemm by default (#11813) 2025-10-18 14:10:15 -07:00
b8zhong
f9a7d9b3dc support server arg override KV cache to bf16 to avoid slow cases (#11749) 2025-10-19 02:49:48 +08:00
Liangsheng Yin
a93f10a722 [overlap-spec] support page size > 1 (#11772) 2025-10-19 02:09:13 +08:00
Teng Ma
585e1223f0 [HiCache] feat: add more eviction policy (#11506) 2025-10-18 15:49:45 +00:00
fzyzcjy
a7043c6f0d Bump torch_memory_saver to avoid installing pre-release versions (#11797) 2025-10-18 01:20:42 -07:00
Lianmin Zheng
67e34c56d7 Fix install instructions and pyproject.tomls (#11781) 2025-10-18 01:08:01 -07:00
Yuwei An
1d726528f7 Eager Compiler for Torch Compile (#11803)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
2025-10-18 15:18:52 +08:00
Minglei Zhu
f4488e9dd9 set default attention backend for deterministic inference (#11801) 2025-10-18 00:01:24 -07:00
Zilin Zhu
e68a2b5b2f [RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded (#10152) 2025-10-18 14:29:35 +08:00
Zilin Zhu
31b9f19e54 [RL] support weight update with DP attention (#11669) 2025-10-18 14:26:19 +08:00
Jimmy
f7ab955455 fix(glm45): disable reduce scatter (#11665)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-10-18 12:19:20 +08:00
Chang Su
ca240eefb4 [router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client (#11798) 2025-10-17 20:49:43 -07:00
Cheng Wan
5b214b50b6 [Refactor] move deep_gemm_wrapper out of quantization (#11784) 2025-10-17 18:57:54 -07:00
Minglei Zhu
13219e1e48 completely remove mixed mode deterministic test as prefix mode could cover it (#11783)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-17 17:46:03 -07:00
fzyzcjy
33e9bbec35 Make single-batch overlap compatible with offloading (#11614) 2025-10-18 08:45:54 +08:00