sglang

Author	SHA1	Message	Date
Trevor Morris	5962e70d8d	FlashInfer NVFP4 MoE with EP & 2-stream shared expert (#7327 ) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com>	2025-06-22 13:38:47 -07:00
Cheng Wan	e879d8b7a8	[Feature] Comprehensive Hybrid Parallelism Support (#6389 )	2025-06-20 14:43:11 -07:00
KavioYu	873ae12cee	support custom weight loader for model runner (#7122 ) Co-authored-by: kavioyu <kavioyu@tencent.com>	2025-06-16 16:28:15 -07:00
Byron Hsu	7d316991b2	[PD] Update prefill.py (#7190 )	2025-06-14 15:59:54 -07:00
Povilas Kanapickas	bd7cfbd2f8	[Fix] Reduce busy polling when scheduler is idle (#6026 )	2025-06-12 14:58:22 -07:00
Lianmin Zheng	dbdf76ca98	Clean up docs for server args and sampling parameters (generated by grok) (#7076 )	2025-06-10 19:55:42 -07:00
Lianmin Zheng	6b12d6a8d5	Simplify the heuristics for setting --mem-fraction-static (#7054 )	2025-06-10 19:01:39 -07:00
kyle-pena-kuzco	b56de8f943	Open AI API hidden states (#6716 )	2025-06-10 14:37:29 -07:00
Lianmin Zheng	6406408a70	Clean up server_args.py (#7037 )	2025-06-10 05:34:29 -07:00
fzyzcjy	f6ebba537a	Support both approximate and exact expert distribution collection (#6964 )	2025-06-09 20:56:17 -07:00
Lianmin Zheng	dc0705a504	Simplify prepare_extend_after_decode (#6987 )	2025-06-09 16:39:21 -07:00
Swipe4057	e1ce44cdb1	Disabling mixed chunked prefill when eagle is enabled (#6874 )	2025-06-07 03:06:58 -07:00
Lianmin Zheng	60fdad7cf3	Sync the changes on cuda graph runners (#6932 )	2025-06-06 18:23:52 -07:00
fzyzcjy	0de5e7d40f	Support layerwise rebalancing experts (#6851 )	2025-06-05 00:05:52 -07:00
zyksir	8e3797be1c	support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277 )	2025-06-04 22:11:24 -07:00
Cheng Wan	81964328b7	Set `num_fused_shared_experts` as `num_shared_experts` when shared_experts fusion is not disabled (#6736 )	2025-06-04 15:53:22 -07:00
Marc Sun	37f1547587	[FEAT] Add transformers backend support (#5929 )	2025-06-03 21:05:29 -07:00
Cheng Wan	8a5480528d	[Refactor] Rename `n_share_experts_fusion` as `num_fused_shared_experts` (#6735 )	2025-06-03 17:48:24 -07:00
Lianmin Zheng	2d72fc47cf	Improve profiler and integrate profiler in bench_one_batch_server (#6787 )	2025-05-31 15:53:55 -07:00
YanbingJiang	888cb175a6	Add intel_amx backend for Radix Attention for CPU (#6408 ) Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>	2025-05-30 21:37:42 -07:00
Baizhou Zhang	73def253b5	Fix mem_fraction_static for AMD CI (#6748 )	2025-05-29 12:37:30 -07:00
fzyzcjy	3ab7d9b55e	Support picking variants of EPLB algorithms (#6728 )	2025-05-29 08:12:01 -07:00
fzyzcjy	7e5071c92a	Super tiny enable sole usage of expert distribution metrics and update doc (#6680 )	2025-05-29 08:11:38 -07:00
Baizhou Zhang	f2bd3515fb	Tune memory arguments on B200 (#6718 )	2025-05-29 00:03:22 -07:00
JieXin Liang	535c838674	[fix] more mem for draft_extend cuda_graph (#6726 )	2025-05-28 23:25:18 -07:00
Ke Bao	631950280a	Support EAGLE draft extend CUDA graph (#6606 ) Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>	2025-05-27 02:35:17 -07:00
Lifu Huang	477a101cbd	Refactor LoRA handling to support adapter tensors in fused format (#6585 )	2025-05-26 21:51:54 -07:00
fzyzcjy	fe386acae6	Automatically configure for EPLB-related args (#6628 )	2025-05-26 08:42:49 -07:00
fzyzcjy	0ca1811715	Support fake perfectly balanced EP dispatch algorithm (#6571 )	2025-05-25 22:35:51 -07:00
fzyzcjy	0d47788025	Support overlapping two batches (#4068 )	2025-05-24 17:39:07 -07:00
Neo	2e37fa07ba	[FIX]remove ServerArgs duplicate code (#6485 )	2025-05-23 22:54:41 -07:00
miter	fefa19fec0	Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. (#6524 ) Signed-off-by: miter <miterv@outlook.com>	2025-05-23 15:20:21 -07:00
HandH1998	1b2e8f76d9	[2/2] Support Qserve (#6521 )	2025-05-23 12:39:18 -07:00
fzyzcjy	7a80f56513	Support dynamically rebalancing experts using EPLB (#6469 )	2025-05-21 23:13:21 -07:00
fzyzcjy	9484eba4ad	Support logging expert balancedness metrics (#6482 )	2025-05-21 23:05:33 -07:00
fzyzcjy	ccfe5c009d	Support redundant experts in expert parallel (#6461 )	2025-05-21 02:05:53 -07:00
HAI	5c0b38f369	aiter attention-backend (default enabled on AMD/ROCm) (#6381 )	2025-05-20 22:52:41 -07:00
fzyzcjy	e98afbe042	Support dispatching logical to physical experts (#6385 )	2025-05-19 22:13:55 -07:00
fzyzcjy	f0653886a5	Expert distribution recording without overhead for EPLB (#4957 )	2025-05-19 20:07:43 -07:00
Trevor Morris	7adf245ba2	[Metrics] Add KV events publishing (#6098 )	2025-05-19 14:19:54 -07:00
fzyzcjy	fd08c04821	Support custom DeepEP tuning config (#6257 )	2025-05-17 17:09:42 -07:00
Lianmin Zheng	e07a6977e7	Minor improvements of TokenizerManager / health check (#6327 )	2025-05-15 15:29:25 -07:00
Yi Liu	cfc9f9ab8d	Fix gpu mem check on CPU (#6317 ) Signed-off-by: yiliu30 <yi4.liu@intel.com>	2025-05-15 09:37:45 -07:00
Lianmin Zheng	d18c6b3358	Support incremental streaming of logprob/token_ids between scheduler and detokenizer (#6225 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-05-12 14:33:38 -07:00
Lianmin Zheng	e8e18dcdcc	Revert "fix some typos" (#6244 )	2025-05-12 12:53:26 -07:00
Ying Sheng	bad7c26fdc	[PP] Fix init_memory_pool desync & add PP for mixtral (#6223 )	2025-05-12 12:38:09 -07:00
applesaucethebun	d738ab52f8	fix some typos (#6209 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-13 01:42:38 +08:00
Lianmin Zheng	fba8eccd7e	Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-05-12 00:17:33 -07:00
Cheng Wan	25c83fff6a	Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558 ) Co-authored-by: liusy58 <liusy58@linux.alibaba.com>	2025-05-11 23:36:29 -07:00
Xiaoyu Zhang	9f2c9568f0	[doc] add a note for --n-share-experts-fusion args (#6154 )	2025-05-11 23:18:38 -07:00

1 2 3 4 5 ...

347 Commits