Commit Graph

3382 Commits

Author SHA1 Message Date
Chang Su
7b02c32679 [Bugfix](gemma3_mm): handle flatten_batch constraint for multiple images (#6562) 2025-05-23 18:11:54 -07:00
miter
fefa19fec0 Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. (#6524)
Signed-off-by: miter <miterv@outlook.com>
2025-05-23 15:20:21 -07:00
Shi Shuai
9c574585b3 fix: remove content=none test when tool called (#6347) 2025-05-23 15:12:55 -07:00
Byron Hsu
8233cc10fd [PD] Support logprob & Add failure test (#6558) 2025-05-23 14:29:20 -07:00
HandH1998
1b2e8f76d9 [2/2] Support Qserve (#6521) 2025-05-23 12:39:18 -07:00
Byron Hsu
d2e0881a34 [PD] support spec decode (#6507)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-23 12:03:05 -07:00
Li Hui
2f42749184 Fix topk inference performance reduce (#6474) 2025-05-23 02:58:31 -07:00
YanbingJiang
d8189660a9 Update sgl-kernel UTs for activation/topk/norm/rope kernels (#6452) 2025-05-23 02:03:15 -07:00
Chunyuan WU
3ded6235c9 Add fp8 fused_experts kernel for CPU in sgl-kernel and add UT (#6404) 2025-05-23 02:01:55 -07:00
blzheng
4ba1eea83f Add fp8 qkv_proj_with_rope kernel for CPU in sgl-kernel and add UT (#6493) 2025-05-23 00:14:46 -07:00
Chang Su
4685fbb888 [VLM] Support chunk prefill for VLM (#6355)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-05-22 20:32:41 -07:00
Byron Hsu
0a4fc73b48 [PD] Fix failure abort (#6535) 2025-05-22 20:32:03 -07:00
Yineng Zhang
a6970a17f3 misc: fix accept_length (#6536) 2025-05-22 14:27:10 -07:00
ryang
a6ae3af15e Support XiaomiMiMo inference with mtp (#6059) 2025-05-22 14:14:49 -07:00
Yineng Zhang
0b07c4a99f chore: upgrade sgl-kernel v0.1.4 (#6532) 2025-05-22 13:28:16 -07:00
lukec
fc0e3b9174 Support qwen3 deepep (#6120) 2025-05-22 11:04:45 -07:00
Yineng Zhang
d71f3f0a2a chore: bump sgl-kernel v0.1.4 (#6522) 2025-05-22 09:47:42 -07:00
shangmingc
58f10679e1 Fix missing http status import for PD failure handler (#6520)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-05-22 15:23:54 +08:00
fzyzcjy
7a80f56513 Support dynamically rebalancing experts using EPLB (#6469) 2025-05-21 23:13:21 -07:00
fzyzcjy
9484eba4ad Support logging expert balancedness metrics (#6482) 2025-05-21 23:05:33 -07:00
Zilin Zhu
e9feb48838 [RL] Remove the w13 weight_scale and input_scale for UnquantizedEPMoE… (#6308) 2025-05-21 22:03:15 -07:00
fzyzcjy
fc992a09f9 Support updating expert locations dynamically (#6388) 2025-05-21 21:59:33 -07:00
Yuan Luo
121f92c583 Add main for merge state tests (#6492)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-05-21 21:56:25 -07:00
Byron Hsu
3bde101099 [PD] Abort request if transfer fails (#6504) 2025-05-21 21:44:25 -07:00
Byron Hsu
7513558074 [PD] Add doc and simplify sender.send (#6019) 2025-05-21 21:22:21 -07:00
HandH1998
4d643f6c7a [1/2] Support Qserve (#6457)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
2025-05-21 19:48:59 -07:00
Ke Bao
6ce0ed073b Apply constraint grammar to EAGLE (#6499)
Co-authored-by: merrymercy <lianminzheng@gmail.com>
2025-05-21 17:18:41 -07:00
fzyzcjy
969660c762 Recover from corrupted cache file in bench serving (#6510) 2025-05-21 17:13:54 -07:00
Xinyuan Tong
16d4f6801b doc: Update README.md with adding deepwiki badge to enable weekly auto-refresh (#6508) 2025-05-21 16:27:34 -07:00
Kyungmin Lee
ada268fd05 fix: EXAONE when using tie_word_embeddings (#5759) 2025-05-21 11:30:04 -07:00
blzheng
cfe48c5902 [CPU] Fix build issue (#6419) 2025-05-21 11:17:10 -07:00
Baizhou Zhang
d4c038daed [Fix]Fix capture fail bug for DeepSeek (#6275) 2025-05-21 11:11:20 -07:00
fzyzcjy
55f6005f53 Fix bench_one_batch_server (#6503) 2025-05-21 11:08:17 -07:00
fzyzcjy
7222e1dacc Let bench_one_batch_server use sharegpt data to make expert distribution more natural (#5573) 2025-05-21 02:08:43 -07:00
fzyzcjy
505eec4dc9 Tiny make Lint CI show diff (#6445) 2025-05-21 02:06:25 -07:00
fzyzcjy
ccfe5c009d Support redundant experts in expert parallel (#6461) 2025-05-21 02:05:53 -07:00
fzyzcjy
a071dc4084 Tiny add stage assertions to DeepEPDispatcher to avoid misuse (#6467) 2025-05-21 02:05:05 -07:00
fzyzcjy
a40aecc5a3 Fix num_qps_per_rank computation when providing custom DeepEP configuration (#6468) 2025-05-21 02:04:33 -07:00
fzyzcjy
d6e1d28c8a Refactor DeepSeek attention dispatching (#6476) 2025-05-21 02:03:39 -07:00
Zilin Zhu
7c347259ff [RL] allow weight updation with dp attention enabled (#6311) 2025-05-21 01:58:55 -07:00
Zilin Zhu
669caa0a3f [router] support http2 in router (#6487) 2025-05-21 01:42:45 -07:00
Jiajun Li
4024e1d2a8 Implement Siglip Vision model, and support BNB quantization for gemma3-mm (#5339) 2025-05-20 23:53:46 -07:00
HAI
5c0b38f369 aiter attention-backend (default enabled on AMD/ROCm) (#6381) 2025-05-20 22:52:41 -07:00
Yuan Luo
30ca18f423 Refactor group_concurrent_contiguous in NIXL (#6214)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-05-21 11:55:04 +08:00
Lianmin Zheng
03886917bd Disable all two stream overlap on amd (#6475) 2025-05-20 19:06:59 -07:00
Wenxuan Tan
66324895c6 [docs] Fix torch version (#6472) 2025-05-20 10:53:14 -07:00
fzyzcjy
13feffd082 Fix master CI for DeepSeek (#6447) 2025-05-20 00:31:42 -07:00
fzyzcjy
e98afbe042 Support dispatching logical to physical experts (#6385) 2025-05-19 22:13:55 -07:00
JieXin Liang
69af3ec35f [doc] add note for get_num_kv_splits in triton_backend (#6444) 2025-05-19 21:40:21 -07:00
YanbingJiang
32cc66efa5 Update extend/decode attention kernel for CPU in sgl-kernel and add UTs (#6405)
Co-authored-by: mingfeima <mingfei.ma@intel.com>
2025-05-19 21:23:17 -07:00