sglang

Author	SHA1	Message	Date
Cheng Wan	ac5010e0ba	Fix CUDA Graph Check under Deepep with DP FFN (#7451 )	2025-06-22 20:35:58 -07:00
Liangsheng Yin	05c9bc8956	[minor] simplify the `TokenToKVPoolAllocator` (#7414 )	2025-06-22 12:37:18 +08:00
Lifu Huang	1998ce4046	Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (#7412 )	2025-06-21 16:09:19 -07:00
Cheng Wan	e879d8b7a8	[Feature] Comprehensive Hybrid Parallelism Support (#6389 )	2025-06-20 14:43:11 -07:00
Stefan He	3774f07825	Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (#7099 )	2025-06-19 00:56:37 -07:00
YanbingJiang	094c116f7d	Update python API of activation, topk, norm and rope and remove vllm dependency (#6614 ) Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com>	2025-06-17 22:11:50 -07:00
u4lr451	10d60cd41b	feat: mtp support dp-attention (#6081 ) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-06-17 00:33:28 -07:00
KavioYu	873ae12cee	support custom weight loader for model runner (#7122 ) Co-authored-by: kavioyu <kavioyu@tencent.com>	2025-06-16 16:28:15 -07:00
Lianmin Zheng	c64290dcb5	Use seq_len_fill_value in the cuda graph runners (#7233 )	2025-06-16 15:57:07 -07:00
woodx	e30ef368ab	Feat/support rerank (#6058 )	2025-06-16 10:50:01 -07:00
fzyzcjy	b4c41f7276	Refactor DeepGEMM integration (#7150 )	2025-06-13 20:41:03 -07:00
Lianmin Zheng	6b12d6a8d5	Simplify the heuristics for setting --mem-fraction-static (#7054 )	2025-06-10 19:01:39 -07:00
kyle-pena-kuzco	b56de8f943	Open AI API hidden states (#6716 )	2025-06-10 14:37:29 -07:00
Lianmin Zheng	4a102a2b02	Minor style fix in cuda_graph_runner.py (#7053 )	2025-06-10 06:32:41 -07:00
Lianmin Zheng	6406408a70	Clean up server_args.py (#7037 )	2025-06-10 05:34:29 -07:00
Byron Hsu	c2b16795b5	Add decode req pool (#6980 )	2025-06-09 21:23:36 -07:00
Baizhou Zhang	6716b41786	Update default settings for blackwell (#7023 )	2025-06-09 20:37:47 -07:00
Lianmin Zheng	dc0705a504	Simplify prepare_extend_after_decode (#6987 )	2025-06-09 16:39:21 -07:00
Baizhou Zhang	971a0dfa32	Extend cuda graph capture bs for B200 (#6937 )	2025-06-08 05:13:22 -07:00
fzyzcjy	2fc1299562	Remove unnecessary kernels of num_token_non_padded (#6965 )	2025-06-08 05:09:17 -07:00
Lianmin Zheng	20d3ad3b58	Fix CI and triton moe Configs (#6974 )	2025-06-08 05:06:46 -07:00
fzyzcjy	f5599ef124	Refactor global_server_args_dict (#6866 )	2025-06-07 03:10:35 -07:00
fzyzcjy	c499591ac8	Add canary for EPLB rebalancing (#6895 )	2025-06-07 03:09:33 -07:00
Lianmin Zheng	60fdad7cf3	Sync the changes on cuda graph runners (#6932 )	2025-06-06 18:23:52 -07:00
HAI	b819381fec	AITER backend extension and workload optimizations (#6838 ) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>	2025-06-05 23:00:18 -07:00
fzyzcjy	0de5e7d40f	Support layerwise rebalancing experts (#6851 )	2025-06-05 00:05:52 -07:00
zyksir	8e3797be1c	support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277 )	2025-06-04 22:11:24 -07:00
Cheng Wan	81964328b7	Set `num_fused_shared_experts` as `num_shared_experts` when shared_experts fusion is not disabled (#6736 )	2025-06-04 15:53:22 -07:00
Cheng Wan	8a5480528d	[Refactor] Rename `n_share_experts_fusion` as `num_fused_shared_experts` (#6735 )	2025-06-03 17:48:24 -07:00
fzyzcjy	b6d0ce9f78	Minor add metrics to expert location updater (#6816 )	2025-06-02 23:59:11 -07:00
fzyzcjy	e05e29d178	Refactor CustomOp to avoid confusing bugs (#5382 )	2025-06-02 10:27:36 -07:00
Lianmin Zheng	20fd53b8f6	Correctly abort the failed grammar requests & Improve the handling of abort (#6803 )	2025-06-01 19:00:07 -07:00
Lianmin Zheng	2d72fc47cf	Improve profiler and integrate profiler in bench_one_batch_server (#6787 )	2025-05-31 15:53:55 -07:00
YanbingJiang	888cb175a6	Add intel_amx backend for Radix Attention for CPU (#6408 ) Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>	2025-05-30 21:37:42 -07:00
Jianan Ji	22630ca242	Support sliding window in triton backend (#6509 )	2025-05-30 01:11:53 -07:00
Baizhou Zhang	f2bd3515fb	Tune memory arguments on B200 (#6718 )	2025-05-29 00:03:22 -07:00
fzyzcjy	ae6a5b2950	Minor refactor two-batch overlap (#6682 )	2025-05-28 15:54:17 -07:00
fzyzcjy	673ff668f7	Speed up expert location update (#6661 )	2025-05-27 10:00:09 -07:00
fzyzcjy	447be24228	Fix OOM when updating expert locations (#6660 )	2025-05-27 09:59:53 -07:00
Yi Zhang	14d1075f2c	fix qwen3moe eplb prefill bug (#6617 )	2025-05-26 02:15:21 -07:00
fzyzcjy	5ccf8fe1a0	Hint users when weight update timeouts (#6570 )	2025-05-25 09:13:17 -07:00
fzyzcjy	0d47788025	Support overlapping two batches (#4068 )	2025-05-24 17:39:07 -07:00
kk	7a5e6ce1cb	Fix GPU OOM (#6564 ) Co-authored-by: michael <michael.zhang@amd.com>	2025-05-24 16:38:39 -07:00
fzyzcjy	c4831e2fcf	Fix accuracy is zero when enabling moe-dense-tp-size as in large scale EP (#6567 )	2025-05-24 00:27:10 -07:00
Chang Su	4685fbb888	[VLM] Support chunk prefill for VLM (#6355 ) Co-authored-by: yizhang2077 <1109276519@qq.com>	2025-05-22 20:32:41 -07:00
ryang	a6ae3af15e	Support XiaomiMiMo inference with mtp (#6059 )	2025-05-22 14:14:49 -07:00
fzyzcjy	7a80f56513	Support dynamically rebalancing experts using EPLB (#6469 )	2025-05-21 23:13:21 -07:00
fzyzcjy	fc992a09f9	Support updating expert locations dynamically (#6388 )	2025-05-21 21:59:33 -07:00
Baizhou Zhang	d4c038daed	[Fix]Fix capture fail bug for DeepSeek (#6275 )	2025-05-21 11:11:20 -07:00
fzyzcjy	ccfe5c009d	Support redundant experts in expert parallel (#6461 )	2025-05-21 02:05:53 -07:00

1 2 3 4 5 ...

393 Commits