sglang

Author	SHA1	Message	Date
Ke Bao	799c4bb502	Fuse MLA set kv cache kernel (#5748 )	2025-04-26 18:42:22 -07:00
ZXN	04d0123fd9	[Fix]: support deepseek-vl2-tiny model (#5552 ) Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com>	2025-04-26 17:52:53 +08:00
Ke Bao	c3948ba67e	Reorder loop in shared expert weight loading (#5719 )	2025-04-25 17:27:42 -07:00
Lianmin Zheng	5641a09458	Revert "[Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct)" (#5754 )	2025-04-25 15:50:28 -07:00
Brayden Zhong	43fb95c2fa	[Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct) (#5078 ) Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>	2025-04-25 15:24:09 +08:00
Baizhou Zhang	a14654dd68	Fix weight loading bug for Deepseek v3+nextn (#5684 )	2025-04-24 21:29:56 +08:00
Yuhong Guo	5d93a950ee	[BugFix] Fix combination of MTP and `--n-share-experts-fusion`with R1 (#5707 )	2025-04-24 21:13:51 +08:00
Mick	c998d04b46	vlm: enable radix cache for qwen-vl models (#5349 ) Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>	2025-04-23 20:35:05 -07:00
fzyzcjy	71d1785f2d	Remove unnecessary `torch.full` in DeepSeek (#5601 )	2025-04-22 21:24:29 -07:00
Baizhou Zhang	3f87f83116	Fuse q_a_proj and kv_a_proj (#5619 )	2025-04-22 20:35:08 -07:00
Ke Bao	6b6e748775	Remove q concat in FA3 backend for DeepSeek decode (#5638 )	2025-04-22 11:43:12 -07:00
lambert0312	76d17c7ecb	Fix shared experts fusion error without quantization (#5632 )	2025-04-22 09:22:26 -07:00
JieXin Liang	4418f599a5	Fix FA3 DeepSeek prefill performance regression (#5624 ) Co-authored-by: ispobock <ispobaoke@gmail.com>	2025-04-22 01:41:41 -07:00
Yineng Zhang	04f2abcb34	fix: gemma 3 not use softcap (#5622 )	2025-04-22 01:16:08 -07:00
JieXin Liang	506be6b892	[fix] fix compile_deep_gemm missing kv_b_proj (#5620 )	2025-04-22 00:06:36 -07:00
Ke Bao	11b23ae97b	Remove extra copy in deepseek forward absorb (#5578 ) Co-authored-by: saienduri <saimanas.enduri@amd.com>	2025-04-21 19:33:21 -07:00
JieXin Liang	c2942907d5	[feature] enable pre compile jit deep_gemm (#5580 )	2025-04-21 16:52:53 -07:00
JieXin Liang	97cb762bb6	[misc] remove is_cuda_available (#5319 )	2025-04-20 18:16:51 -07:00
fzyzcjy	5239d79568	Speedup shared expert weight construction by avoid cloning (#5188 )	2025-04-20 18:12:01 -07:00
Brayden Zhong	b868526d94	Fix one more issue reported by torchfix (#4859 )	2025-04-20 17:49:27 -07:00
Xiaoyu Zhang	d9dd529854	enable DeepSeek V3 shared_experts_fusion in sm90 (#5571 )	2025-04-20 12:46:42 -07:00
fzyzcjy	0a0dd34e6a	Fix BumpAllocator error when no input_ids (#5564 )	2025-04-20 02:20:53 -07:00
JieXin Liang	99456bcacb	[perf] introduce deep gemm group_gemm_masked as bmm (#5432 )	2025-04-20 00:38:27 -07:00
fzyzcjy	613b197e57	Remove one kernel in per_tensor_quant_mla_fp8 (#5549 )	2025-04-19 15:08:15 -07:00
Xiaoyu Zhang	d58e354472	simplify the control logic for using shared experts fusion (#5504 )	2025-04-19 13:17:35 -07:00
yhyang201	4db463b1ad	[Model] Adding Qwen3 and Qwen3MoE (#4693 )	2025-04-18 09:51:29 -07:00
fzyzcjy	53dcf38876	Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (#4836 )	2025-04-17 21:38:26 -07:00
Michael Feil	1effba4c70	Configuration qwen2_moe.py - qkv_bias now in transformers (#5512 )	2025-04-17 21:23:22 -07:00
Xuchun Shang	06d0a3d92b	[Bug fix] use correct func path in deepseek (#5496 ) Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>	2025-04-17 02:41:41 -07:00
fzyzcjy	8beb356f0d	Refactor DeepSeek decoder layer branches (#5205 )	2025-04-17 02:11:11 -07:00
woodx	3bface15e6	Feat/support encoder model (like bert) (#4887 )	2025-04-17 01:50:48 -07:00
Baizhou Zhang	4fb05583ef	Deprecate disable-mla (#5481 )	2025-04-17 01:43:14 -07:00
eigen	8f783c1943	[Model Support] unsloth/Phi-4-mini bnb model (#4982 ) Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>	2025-04-16 19:58:20 -07:00
Lianmin Zheng	177320a582	Clean up imports (#5467 )	2025-04-16 15:26:49 -07:00
Baizhou Zhang	a42736bbb8	Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113 )	2025-04-15 22:01:22 -07:00
ryang	bc24205b32	Support BNB quantization for llama/mllama (#5038 ) Co-authored-by: Yuhao Yang <yyh073@foxmail.com>	2025-04-15 18:00:31 -07:00
Yangcheng Li	ee9d6ca677	[fix/misc] remove duplicate row in deepseek v2 model (#5279 )	2025-04-14 18:41:24 -07:00
JieXin Liang	bdde237562	[perf] experimental enhance fp8 per-tensor quant (#5370 )	2025-04-14 12:35:43 -07:00
yhyang201	072df75354	Support for Qwen2.5-VL Model in bitsandbytes Format (#5003 )	2025-04-14 02:03:40 -07:00
yulei	adca585bfb	[DeepEP] Reduce routed scaling overhead (#5277 ) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>	2025-04-13 16:03:09 -07:00
lambert0312	1b1b47a949	Fix w8a8_int8 model shared experts fusion load weights error (#5120 )	2025-04-11 23:33:51 -07:00
Mick	34ef6c8135	[VLM] Adopt fast image processor by default (#5065 )	2025-04-11 21:46:58 -07:00
Mick	e53a0b3d5b	[fix] fix mrope positions not picked up (#5265 )	2025-04-11 01:29:45 -07:00
fzyzcjy	cd7e32e2cb	Optimize attention in llama4 (#5127 )	2025-04-11 00:32:41 -07:00
fzyzcjy	e3c4bd3153	Fix DeepSeek error when using DeepEP mode (#5190 )	2025-04-09 17:43:22 -07:00
Mick	fbebcb7aa4	model: support mllama4 (#5144 )	2025-04-09 09:28:44 -07:00
HandH1998	4065248214	Support Llama4 fp8 inference (#5194 ) Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com>	2025-04-09 20:14:34 +08:00
fzyzcjy	86a876d883	Optimize topk operation in llama4 (#5128 )	2025-04-09 02:50:22 -07:00
Yineng Zhang	90caf06c00	fix: use DeepEPDispatcher on CUDA (#5180 )	2025-04-08 21:56:53 -07:00
Jinyan Chen	bc3f6db2dd	[Fix] DeepEP Compatibility with Low Latency (#5068 ) Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-04-08 20:31:31 -07:00

1 2 3 4 5 ...

365 Commits