sglang

Author	SHA1	Message	Date
Chang Su	066cf44546	[OAI] Add rid tracing for v1/embeddings and fix rid type in Chat (#6397 )	2025-05-18 13:05:38 -07:00
JieXin Liang	1f30c05d4a	[fix] fix fa3 forward_decode with spec_decode (#6395 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com>	2025-05-18 12:50:15 -07:00
Chang Su	205d5cb407	perf: Optimize local attention memory allocation in FlashAttentionBackend (#6356 )	2025-05-17 01:45:46 -07:00
Chang Su	912788c095	perf: optimize local_block_table memory allocation (#6273 )	2025-05-13 17:18:38 -07:00
applesaucethebun	2ce8793519	Add typo checker in pre-commit (#6179 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-11 12:55:00 +08:00
Chang Su	1940cdec61	[Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (#6162 )	2025-05-09 15:33:02 -07:00
Minglei Zhu	79961afa82	optimize pad operations in fa3 to accelarate 100+us (#6077 )	2025-05-07 23:40:08 -07:00
Lifu Huang	1acca3a2c6	FA3 speed up: skip len operation and get batch size directly from forward batch (#5969 ) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>	2025-05-02 00:26:12 -07:00
Stefan He	6fc175968c	Optimize a pad operation to accelerate 25us (#5945 )	2025-05-01 10:48:55 -07:00
Ke Bao	799c4bb502	Fuse MLA set kv cache kernel (#5748 )	2025-04-26 18:42:22 -07:00
Ke Bao	6b6e748775	Remove q concat in FA3 backend for DeepSeek decode (#5638 )	2025-04-22 11:43:12 -07:00
Qingquan Song	188f0955fa	Add Speculative Decoding Eagle3 topk > 1 (#5318 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com>	2025-04-20 22:58:28 -07:00
Chang Su	c776234b45	Enable local attention during decode (#5479 )	2025-04-17 02:07:43 -07:00
Baizhou Zhang	4fb05583ef	Deprecate disable-mla (#5481 )	2025-04-17 01:43:14 -07:00
Baizhou Zhang	a42736bbb8	Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113 )	2025-04-15 22:01:22 -07:00
Yineng Zhang	fa909dc3c4	feat: update model_specific_adjustment (#5344 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-04-15 14:45:15 -07:00
Yineng Zhang	57de7c6b5f	feat: use fa3 mla by default on hopper (#5210 ) Co-authored-by: yundai424 <yundai424@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-04-12 01:09:25 -07:00
Qingquan Song	aea98512a8	Fix fa3 window size setup (#5316 )	2025-04-11 23:37:52 -07:00
Chang Su	aee62d744b	Optimize GPU memory usage in FlashAttentionBackend's strided indexing (#5262 ) Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-04-11 00:34:17 -07:00
Chang Su	aac531c53b	[Bugfix] Fix index out of bounds in local attention with large sequences (#5173 )	2025-04-08 18:43:13 -07:00
Yun Dai	2695ab0537	Fix loading KV quantization scale; Enable modelopt kv cache (#4686 ) Co-authored-by: qingquansong <ustcsqq@gmail.com>	2025-04-08 09:11:35 -07:00
Yubo Wang	804d9f2e4c	Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (#4760 )	2025-04-07 23:20:51 -07:00
Chunan Zeng	a7c3f74bec	[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (#5103 )	2025-04-07 22:58:08 -07:00
Stefan He	93470a1411	Refactor and Optimize FA3 Code (#5090 ) Co-authored-by: Qingquan Song <ustcsqq@gmail.com>	2025-04-07 11:52:42 -07:00
Chang Su	f04c80dc42	Add Llama4 support (#5092 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com>	2025-04-07 00:29:36 -07:00
Stefan He	ca8d02abd5	FA3 Spec Decoding to support top k = 1 and add cuda graph support (#5050 ) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com>	2025-04-04 23:03:59 -07:00
Qingquan Song	e983e43248	Add Eagle Speculative Decoding to FA3 Backend (#4951 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com>	2025-04-02 13:09:02 -07:00
Yineng Zhang	1c63e79756	use fa3 in sgl-kernel (#4954 )	2025-03-31 16:14:49 -07:00
Baizhou Zhang	e62d60fe6d	[Fix] avoid stream sync and torch compile in prefill for fa3 backend (#4932 )	2025-03-30 13:53:44 -07:00
Baizhou Zhang	20c90be23d	[Feature] Support FA3 backend for MLA (#4831 )	2025-03-28 18:30:14 -07:00
Qingquan Song	6ffb6bd47a	Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (#4855 )	2025-03-28 01:35:59 -07:00
Stefan He	26c0f13126	Support Page Size > 1 for FA3 (#4832 ) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-03-27 22:07:14 -07:00
Stefan He	1b9175cb23	[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 (#4745 ) Co-authored-by: Yubo Wang <yubowang2019@gmail.com>	2025-03-27 00:45:11 -07:00
Stefan He	5d7edc8e55	Support FA3 as Attention backend by using `--attention-backend fa3` (#4680 ) Co-authored-by: qsong <qsong@linkedin.com> Co-authored-by: qingquansong <ustcsqq@gmail.com>	2025-03-23 23:28:11 -07:00

34 Commits