sglang

Author	SHA1	Message	Date
Ke Bao	6b6e748775	Remove q concat in FA3 backend for DeepSeek decode (#5638 )	2025-04-22 11:43:12 -07:00
JieXin Liang	917324862e	[fix] reduce dp capture bs (#5634 ) Co-authored-by: alcanerian <alcanerian@gmail.com>	2025-04-22 11:08:45 -07:00
lukec	2ed96c7a8a	fix flashmla bug (#5272 )	2025-04-22 10:36:23 -07:00
saltyfish66	2aa3f5e2d0	[feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (#5641 ) Co-authored-by: yuethe <yuethe@tencent.com>	2025-04-22 09:33:13 -07:00
lambert0312	76d17c7ecb	Fix shared experts fusion error without quantization (#5632 )	2025-04-22 09:22:26 -07:00
Connector Switch	70d040f904	[NFC] Remove duplicate `compressed-tensors` (#5640 )	2025-04-22 09:10:25 -07:00
JieXin Liang	4418f599a5	Fix FA3 DeepSeek prefill performance regression (#5624 ) Co-authored-by: ispobock <ispobaoke@gmail.com>	2025-04-22 01:41:41 -07:00
Yineng Zhang	04f2abcb34	fix: gemma 3 not use softcap (#5622 )	2025-04-22 01:16:08 -07:00
JieXin Liang	506be6b892	[fix] fix compile_deep_gemm missing kv_b_proj (#5620 )	2025-04-22 00:06:36 -07:00
JieXin Liang	2343d8df7d	[fix] force use deepgemm in compile_deep_gemm (#5618 )	2025-04-21 21:36:02 -07:00
Ke Bao	11b23ae97b	Remove extra copy in deepseek forward absorb (#5578 ) Co-authored-by: saienduri <saimanas.enduri@amd.com>	2025-04-21 19:33:21 -07:00
Yineng Zhang	b9c87e781d	chore: bump v0.4.5.post3 (#5611 )	2025-04-21 18:16:20 -07:00
michael-amd	968ef51562	Support aiter RMSNorm in AMD (#5510 ) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>	2025-04-21 17:40:39 -07:00
Lianmin Zheng	1343200299	Clean up mem settings (#5610 )	2025-04-21 17:19:00 -07:00
JieXin Liang	c2942907d5	[feature] enable pre compile jit deep_gemm (#5580 )	2025-04-21 16:52:53 -07:00
Liangsheng Yin	e69a219074	Enhance GPU memory settings (#5604 )	2025-04-21 15:15:00 -07:00
Byron Hsu	bf98d2e377	[PD] Support prefill overlap + Ensure no race condition (#5609 )	2025-04-21 12:12:56 -07:00
Byron Hsu	e65b9f21e3	[PD] Support decode overlap schedule (#5608 )	2025-04-21 12:06:16 -07:00
Trevor Morris	4dce1cc608	[PD] Add NIXL transfer backend (#5477 )	2025-04-22 01:36:12 +08:00
Byron Hsu	deded17f38	[PD] Fix edge case and simplify large page size + chunked prefill (#5589 )	2025-04-21 10:27:02 -07:00
shangmingc	f29a718f63	[PD] Fix generate endpoint of min_lb for PD (#5598 ) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>	2025-04-21 21:39:18 +08:00
Yongtong Wu	3f57b00a59	Support PD bootstrap fields on /v1/chat/completions endpoint (#5488 )	2025-04-21 01:10:58 -07:00
fzyzcjy	453d412cdb	Tiny update error hint (#5037 )	2025-04-21 00:47:47 -07:00
fzyzcjy	dc86f25a57	Tiny remove duplicated code (#5021 )	2025-04-21 00:47:32 -07:00
Chuyue Sun	08289eaa3e	Support o1 model on Azure (#4980 ) Co-authored-by: Shan Yu <shanyu1@g.ucla.edu>	2025-04-21 00:46:09 -07:00
Lucius	3b6d539f63	[Fix] Enhance DP Attention for IPv6 Compatibility (#4937 )	2025-04-21 00:44:11 -07:00
lambert0312	c44f2869c9	Modify metrics service endpoint (#3443 )	2025-04-21 00:35:38 -07:00
fzyzcjy	685d8980c3	Tiny add warning when cannot recognize bool env var (#5348 )	2025-04-20 23:11:29 -07:00
Zhiqiang Xie	70645f4d7d	upstream hicache fixes (#5570 )	2025-04-20 23:08:30 -07:00
Qingquan Song	188f0955fa	Add Speculative Decoding Eagle3 topk > 1 (#5318 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com>	2025-04-20 22:58:28 -07:00
Lianmin Zheng	eef9433b46	Fix flush cache (#5590 )	2025-04-20 22:56:40 -07:00
JieXin Liang	97cb762bb6	[misc] remove is_cuda_available (#5319 )	2025-04-20 18:16:51 -07:00
fzyzcjy	1195182040	Tiny add Engine.flush_cache API (#5241 )	2025-04-20 18:15:03 -07:00
fzyzcjy	5239d79568	Speedup shared expert weight construction by avoid cloning (#5188 )	2025-04-20 18:12:01 -07:00
Sundara Raman Ramachandran	f08154193c	Perform Batch Tokenization. (#5141 )	2025-04-20 18:10:37 -07:00
fzyzcjy	5fc4b6004e	Add sanity check for max_running_requests (#5016 )	2025-04-20 17:56:49 -07:00
Brayden Zhong	b868526d94	Fix one more issue reported by torchfix (#4859 )	2025-04-20 17:49:27 -07:00
Juwan Yoo	502524e2da	compressed_tensors: port w8a16 fp8 from vllm (#4852 )	2025-04-20 17:48:31 -07:00
Enrique Shockwave	4c7640079c	check marlin format before attempting conversion (#4675 )	2025-04-20 17:47:09 -07:00
kyle-pena-kuzco	9f3bd2ad39	Feat: Implement JSON Mode (response_format.type="json_object") (#4733 ) Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net>	2025-04-20 17:41:22 -07:00
Yi Zhou	fac17acf08	add function call parser for DeepSeek V3 (#5224 )	2025-04-20 17:38:08 -07:00
Adarsh Shirawalmath	8b39274e34	[Feature] Prefill assistant response - add continue_final_message parameter (#4226 ) Co-authored-by: Chayenne <zhaochen20@outlook.com>	2025-04-20 17:37:18 -07:00
Byron Hsu	c951d312ed	[PD] Fix large page size + chunk prefill (#5588 )	2025-04-20 17:21:54 -07:00
AmadeusW	dcb8232596	Fix ChatCompletionMessageGenericParam to allow for None content (#5452 )	2025-04-20 17:15:38 -07:00
Yineng Zhang	66c0ff9e31	fix: use fa3 for gemma2 (#5586 )	2025-04-20 17:02:09 -07:00
tarinkk	9a7e83e899	Fix enable chunked prefill for Llama4 (#5575 )	2025-04-20 17:01:30 -07:00
lukec	417b44eba8	[Feat] upgrade pytorch2.6 (#5417 )	2025-04-20 16:06:34 -07:00
fzyzcjy	475e2e378a	[PD] Fix server crash when using batch requests (#5531 )	2025-04-20 16:02:23 -07:00
fzyzcjy	fba86b6b54	Tiny improve error message (#5526 )	2025-04-20 16:00:15 -07:00
fzyzcjy	fa2f677e18	Fix torch memory saver not enabled in DP scenario (#5560 )	2025-04-20 14:20:52 -07:00

1 2 3 4 5 ...

1975 Commits