sglang

Author	SHA1	Message	Date
fzyzcjy	7e5071c92a	Super tiny enable sole usage of expert distribution metrics and update doc (#6680 )	2025-05-29 08:11:38 -07:00
Baizhou Zhang	f2bd3515fb	Tune memory arguments on B200 (#6718 )	2025-05-29 00:03:22 -07:00
JieXin Liang	535c838674	[fix] more mem for draft_extend cuda_graph (#6726 )	2025-05-28 23:25:18 -07:00
Ke Bao	631950280a	Support EAGLE draft extend CUDA graph (#6606 ) Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>	2025-05-27 02:35:17 -07:00
Lifu Huang	477a101cbd	Refactor LoRA handling to support adapter tensors in fused format (#6585 )	2025-05-26 21:51:54 -07:00
fzyzcjy	fe386acae6	Automatically configure for EPLB-related args (#6628 )	2025-05-26 08:42:49 -07:00
fzyzcjy	0ca1811715	Support fake perfectly balanced EP dispatch algorithm (#6571 )	2025-05-25 22:35:51 -07:00
fzyzcjy	0d47788025	Support overlapping two batches (#4068 )	2025-05-24 17:39:07 -07:00
Neo	2e37fa07ba	[FIX]remove ServerArgs duplicate code (#6485 )	2025-05-23 22:54:41 -07:00
miter	fefa19fec0	Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. (#6524 ) Signed-off-by: miter <miterv@outlook.com>	2025-05-23 15:20:21 -07:00
HandH1998	1b2e8f76d9	[2/2] Support Qserve (#6521 )	2025-05-23 12:39:18 -07:00
fzyzcjy	7a80f56513	Support dynamically rebalancing experts using EPLB (#6469 )	2025-05-21 23:13:21 -07:00
fzyzcjy	9484eba4ad	Support logging expert balancedness metrics (#6482 )	2025-05-21 23:05:33 -07:00
fzyzcjy	ccfe5c009d	Support redundant experts in expert parallel (#6461 )	2025-05-21 02:05:53 -07:00
HAI	5c0b38f369	aiter attention-backend (default enabled on AMD/ROCm) (#6381 )	2025-05-20 22:52:41 -07:00
fzyzcjy	e98afbe042	Support dispatching logical to physical experts (#6385 )	2025-05-19 22:13:55 -07:00
fzyzcjy	f0653886a5	Expert distribution recording without overhead for EPLB (#4957 )	2025-05-19 20:07:43 -07:00
Trevor Morris	7adf245ba2	[Metrics] Add KV events publishing (#6098 )	2025-05-19 14:19:54 -07:00
fzyzcjy	fd08c04821	Support custom DeepEP tuning config (#6257 )	2025-05-17 17:09:42 -07:00
Lianmin Zheng	e07a6977e7	Minor improvements of TokenizerManager / health check (#6327 )	2025-05-15 15:29:25 -07:00
Yi Liu	cfc9f9ab8d	Fix gpu mem check on CPU (#6317 ) Signed-off-by: yiliu30 <yi4.liu@intel.com>	2025-05-15 09:37:45 -07:00
Lianmin Zheng	d18c6b3358	Support incremental streaming of logprob/token_ids between scheduler and detokenizer (#6225 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-05-12 14:33:38 -07:00
Lianmin Zheng	e8e18dcdcc	Revert "fix some typos" (#6244 )	2025-05-12 12:53:26 -07:00
Ying Sheng	bad7c26fdc	[PP] Fix init_memory_pool desync & add PP for mixtral (#6223 )	2025-05-12 12:38:09 -07:00
applesaucethebun	d738ab52f8	fix some typos (#6209 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-13 01:42:38 +08:00
Lianmin Zheng	fba8eccd7e	Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-05-12 00:17:33 -07:00
Cheng Wan	25c83fff6a	Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558 ) Co-authored-by: liusy58 <liusy58@linux.alibaba.com>	2025-05-11 23:36:29 -07:00
Xiaoyu Zhang	9f2c9568f0	[doc] add a note for --n-share-experts-fusion args (#6154 )	2025-05-11 23:18:38 -07:00
Lianmin Zheng	03227c5fa6	[CI] Reorganize the 8 gpu tests (#6192 )	2025-05-11 10:55:06 -07:00
applesaucethebun	2ce8793519	Add typo checker in pre-commit (#6179 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-11 12:55:00 +08:00
Zhu Chen	fa7d7fd9e5	[Feature] Add FlashAttention3 as a backend for VisionAttention (#5764 ) Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yi Zhang <1109276519@qq.com>	2025-05-08 10:01:19 -07:00
fzyzcjy	4c7b42424c	Hint users DeepEP normal mode is incompatible with CUDA Graph (#5014 )	2025-05-07 22:40:59 +08:00
Song Zhang	00c2c1f08b	[Feature] Support for Ascend NPU backend (#3853 ) Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com>	2025-05-06 20:32:53 -07:00
Liangsheng Yin	a3e4e9bf9e	Better PD initialization (#5751 )	2025-05-07 01:12:57 +08:00
shangmingc	56f6589ecb	[PD] Optimize disaggregation ib device help info (#5781 )	2025-05-05 13:47:37 +08:00
Ke Bao	ebaba85655	Update ci test and doc for MTP api change (#5952 )	2025-05-01 09:30:27 -07:00
Ying Sheng	11383cec3c	[PP] Add pipeline parallelism (#5724 )	2025-04-30 18:18:07 -07:00
Chang Su	2b06484bd1	feat: support pythonic tool call and index in tool call streaming (#5725 )	2025-04-29 17:30:44 -07:00
JieXin Liang	e4b6133b78	[fix] relax mem_fraction_static for h200 (#5893 ) Co-authored-by: alcanerian <alcanerian@gmail.com>	2025-04-29 17:01:12 -07:00
Ke Bao	dd408ee481	Auto set draft model path for MTP (#5793 )	2025-04-29 16:25:40 -07:00
Qiaolin Yu	8c0cfca87d	Feat: support cuda graph for LoRA (#4115 ) Co-authored-by: Beichen Ma <mabeichen12@gmail.com>	2025-04-28 23:30:44 -07:00
Lianmin Zheng	849c83a0c0	[CI] test chunked prefill more (#5798 )	2025-04-28 10:57:17 -07:00
Trevor Morris	84810da4ae	Add Cutlass MLA attention backend (#5390 )	2025-04-27 20:58:53 -07:00
Lianmin Zheng	621e96bf9b	[CI] Fix ci tests (#5769 )	2025-04-27 07:18:10 -07:00
Kebe	7e944246c3	Add memory_saver check (#4986 ) Signed-off-by: Kebe <mail@kebe7jun.com>	2025-04-26 20:20:05 -07:00
vzed	df2cf583ce	we fix the non existent access of `decrypted_config_file` (#5685 )	2025-04-26 18:32:37 -07:00
Mick	c998d04b46	vlm: enable radix cache for qwen-vl models (#5349 ) Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>	2025-04-23 20:35:05 -07:00
Yineng Zhang	04f2abcb34	fix: gemma 3 not use softcap (#5622 )	2025-04-22 01:16:08 -07:00
Lianmin Zheng	1343200299	Clean up mem settings (#5610 )	2025-04-21 17:19:00 -07:00
Liangsheng Yin	e69a219074	Enhance GPU memory settings (#5604 )	2025-04-21 15:15:00 -07:00

1 2 3 4 5 ...

325 Commits