sglang

Author	SHA1	Message	Date
fzyzcjy	c49c1d9226	Remove 200us slow concat kernel (part 2: srt) (#7020 )	2025-06-13 15:19:31 -07:00
pansicheng	2f4ec752bc	filter by num_hidden_layers (#7056 )	2025-06-13 00:53:09 -07:00
fzyzcjy	f6ebba537a	Support both approximate and exact expert distribution collection (#6964 )	2025-06-09 20:56:17 -07:00
fzyzcjy	de1350ea20	Minor remove one kernel for DeepSeek (#6977 )	2025-06-08 17:41:35 -07:00
Xiaoyu Zhang	3712abfaf9	Fuse routed scaling factor in deepseek (#6970 )	2025-06-08 15:24:24 -07:00
Yineng Zhang	1fb76ebb93	Revert "Fuse routed scaling factor in topk_reduce kernel (#6220 )" (#6968 )	2025-06-07 21:02:49 -07:00
Pavani Majety	c2c4f57f63	[DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model (#6853 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2025-06-07 17:24:35 -07:00
Xiaoyu Zhang	515ef4facb	Fuse routed scaling factor in topk_reduce kernel (#6220 )	2025-06-07 11:06:50 -07:00
JieXin Liang	22fe787852	[sgl-kernel] update deepgemm (#6942 )	2025-06-06 23:24:41 -07:00
miter	f8eaaab817	[fix] logical_to_all_physical_map index 256 is out of bounds in EP parallel. (#6767 ) Signed-off-by: miter <miterv@outlook.com>	2025-06-06 21:32:33 -07:00
HAI	b819381fec	AITER backend extension and workload optimizations (#6838 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>	2025-06-05 23:00:18 -07:00
Cheng Wan	81964328b7	Set `num_fused_shared_experts` as `num_shared_experts` when shared_experts fusion is not disabled (#6736 )	2025-06-04 15:53:22 -07:00
Cheng Wan	8a5480528d	[Refactor] Rename `n_share_experts_fusion` as `num_fused_shared_experts` (#6735 )	2025-06-03 17:48:24 -07:00
fzyzcjy	0ea330ca34	Fix wrong weight reference in dynamic EPLB (#6818 )	2025-06-02 23:26:04 -07:00
Li Hui	69dd878b51	Fix shared experts fusion error (#6289 )	2025-05-30 01:16:11 -07:00
Zilin Zhu	51cdd81f97	[fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight (#6265 )	2025-05-29 16:28:10 -07:00
fzyzcjy	31589e177e	Speed up when having padding tokens two-batch overlap (#6668 ) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>	2025-05-28 16:00:58 -07:00
fzyzcjy	541a985f85	Fuse routed_scaling_factor in DeepSeek (#6710 )	2025-05-28 15:53:37 -07:00
HAI	183d9f969c	DeepSeek: enable none block-quant FP8 quantizations (#6638 )	2025-05-27 09:06:40 -07:00
fzyzcjy	32cd707002	Support TP in attention for two batch overlap (#6634 )	2025-05-26 20:28:12 -07:00
fzyzcjy	0ca3e56802	Tiny fix missing expert location dispatch info (#6620 )	2025-05-26 08:58:31 -07:00
Yi Zhang	65f091310c	refactor qwen moe code, use communicator to support tp+dp (#6581 )	2025-05-25 23:01:10 -07:00
fzyzcjy	0d47788025	Support overlapping two batches (#4068 )	2025-05-24 17:39:07 -07:00
fzyzcjy	b2388433be	Add back DeepSeek non-TBO branches (#6578 )	2025-05-24 17:34:00 -07:00
fzyzcjy	a38376fa99	Refactor attention into multiple stages (#6477 )	2025-05-24 17:33:25 -07:00
fzyzcjy	fc992a09f9	Support updating expert locations dynamically (#6388 )	2025-05-21 21:59:33 -07:00
Baizhou Zhang	d4c038daed	[Fix]Fix capture fail bug for DeepSeek (#6275 )	2025-05-21 11:11:20 -07:00
fzyzcjy	ccfe5c009d	Support redundant experts in expert parallel (#6461 )	2025-05-21 02:05:53 -07:00
fzyzcjy	d6e1d28c8a	Refactor DeepSeek attention dispatching (#6476 )	2025-05-21 02:03:39 -07:00
Lianmin Zheng	03886917bd	Disable all two stream overlap on amd (#6475 )	2025-05-20 19:06:59 -07:00
fzyzcjy	13feffd082	Fix master CI for DeepSeek (#6447 )	2025-05-20 00:31:42 -07:00
fzyzcjy	e98afbe042	Support dispatching logical to physical experts (#6385 )	2025-05-19 22:13:55 -07:00
HAI	6317c5c61f	Address performance regression: disable multiple streams on ROCm (#6412 )	2025-05-19 21:16:20 -07:00
fzyzcjy	d0443275f0	Refactor DeepSeek logic into atomic operations (#6326 )	2025-05-19 21:05:30 -07:00
fzyzcjy	1b19df4b2a	Refactor communication logic of DeepSeek for extensibility and understandability (#6321 )	2025-05-19 20:14:48 -07:00
fzyzcjy	f0653886a5	Expert distribution recording without overhead for EPLB (#4957 )	2025-05-19 20:07:43 -07:00
fzyzcjy	72bfb0baf0	Refactor DeepSeek MoE layer to unify the two forward branches (#6325 )	2025-05-18 15:34:36 -07:00
fzyzcjy	2716830802	Speed up when having padding tokens in DeepEP (#6175 )	2025-05-17 16:44:05 -07:00
fzyzcjy	2df9d40aa6	Minor code cleanup refactor for DeepSeek models (#6324 )	2025-05-16 19:06:03 -07:00
fzyzcjy	8dc191f237	Fix one wasted kernel in DeepSeek and minor refactor (#6316 )	2025-05-16 19:05:33 -07:00
fzyzcjy	f194e14fb7	Reduce MoE memory usage (#6147 )	2025-05-15 09:38:28 -07:00
Cheng Wan	b2e95f62b4	Fix two issues related to `--moe-dense-tp-size=1` (#5657 ) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com>	2025-05-12 23:51:39 -07:00
Cheng Wan	25c83fff6a	Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558 ) Co-authored-by: liusy58 <liusy58@linux.alibaba.com>	2025-05-11 23:36:29 -07:00
applesaucethebun	2ce8793519	Add typo checker in pre-commit (#6179 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-11 12:55:00 +08:00
JieXin Liang	c178abdabc	[fix] fix determine_n_share_experts_fusion (#6118 )	2025-05-10 01:19:09 -07:00
xu-yfei	e30c273bc9	opt flashinfer mla cat (#5822 ) Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>	2025-05-08 23:17:14 -07:00
JieXin Liang	5e02330137	[perf] dsv3 bmm fallback to bf16 (#5662 )	2025-05-08 11:43:39 -07:00
lukec	acc816d8a2	DeepEP normal support deepgemm-contiguous (#5626 ) Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com>	2025-05-08 01:20:32 -07:00
Baizhou Zhang	73600673bb	Clean logs for DeepSeek-V3 launching (#6079 )	2025-05-07 18:54:50 -07:00
JieXin Liang	b70957fcf8	[refactor] slightly tidy fp8 module (#5993 )	2025-05-07 17:28:24 -07:00

1 2 3 4

172 Commits