sglang

Author	SHA1	Message	Date
Baizhou Zhang	bf203cb7a2	[Fix] Suppress dynamo logging when using flashinfer backend with torch compile (#5992 )	2025-05-04 09:49:13 -07:00
Lifu Huang	1acca3a2c6	FA3 speed up: skip len operation and get batch size directly from forward batch (#5969 ) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>	2025-05-02 00:26:12 -07:00
Stefan He	6fc175968c	Optimize a pad operation to accelerate 25us (#5945 )	2025-05-01 10:48:55 -07:00
ryang	4322c31e24	Support XiaomiMiMo/MiMo model inference (#5921 )	2025-05-01 07:41:13 -07:00
Baizhou Zhang	799789afed	Bump Flashinfer to 0.2.5 (#5870 ) Co-authored-by: Yuhao Chen <yxckeis8@gmail.com>	2025-04-29 19:50:57 -07:00
pengcuo	8e5a6d3441	[Fix] Fix a bug for flashmla to run R1 model (#5875 ) Co-authored-by: pengcuo <dgpengcuo@gmail.com>	2025-04-29 01:03:13 -07:00
Trevor Morris	8d463fe351	Cutlass MLA decode - fix dtype error (#5868 )	2025-04-28 21:12:58 -07:00
Trevor Morris	84810da4ae	Add Cutlass MLA attention backend (#5390 )	2025-04-27 20:58:53 -07:00
Ke Bao	799c4bb502	Fuse MLA set kv cache kernel (#5748 )	2025-04-26 18:42:22 -07:00
liwenju0	4d1e52abea	Add an assertion to enhance the robustness of the operator (#5736 )	2025-04-26 18:09:12 -07:00
Ke Bao	6b6e748775	Remove q concat in FA3 backend for DeepSeek decode (#5638 )	2025-04-22 11:43:12 -07:00
lukec	2ed96c7a8a	fix flashmla bug (#5272 )	2025-04-22 10:36:23 -07:00
Qingquan Song	188f0955fa	Add Speculative Decoding Eagle3 topk > 1 (#5318 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com>	2025-04-20 22:58:28 -07:00
JieXin Liang	97cb762bb6	[misc] remove is_cuda_available (#5319 )	2025-04-20 18:16:51 -07:00
Yineng Zhang	a6f892e5d0	Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544 )	2025-04-18 16:50:21 -07:00
yhyang201	4db463b1ad	[Model] Adding Qwen3 and Qwen3MoE (#4693 )	2025-04-18 09:51:29 -07:00
Wenxuan Tan	bfa3922451	Avoid computing lse in Ragged Prefill when there's no prefix. (#5476 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-04-18 01:13:57 -07:00
Chang Su	c776234b45	Enable local attention during decode (#5479 )	2025-04-17 02:07:43 -07:00
woodx	3bface15e6	Feat/support encoder model (like bert) (#4887 )	2025-04-17 01:50:48 -07:00
Baizhou Zhang	4fb05583ef	Deprecate disable-mla (#5481 )	2025-04-17 01:43:14 -07:00
Baizhou Zhang	a42736bbb8	Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113 )	2025-04-15 22:01:22 -07:00
Yineng Zhang	fa909dc3c4	feat: update model_specific_adjustment (#5344 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-04-15 14:45:15 -07:00
Yineng Zhang	57de7c6b5f	feat: use fa3 mla by default on hopper (#5210 ) Co-authored-by: yundai424 <yundai424@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-04-12 01:09:25 -07:00
Qingquan Song	aea98512a8	Fix fa3 window size setup (#5316 )	2025-04-11 23:37:52 -07:00
Mick	e53a0b3d5b	[fix] fix mrope positions not picked up (#5265 )	2025-04-11 01:29:45 -07:00
Chang Su	aee62d744b	Optimize GPU memory usage in FlashAttentionBackend's strided indexing (#5262 ) Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-04-11 00:34:17 -07:00
Chang Su	aac531c53b	[Bugfix] Fix index out of bounds in local attention with large sequences (#5173 )	2025-04-08 18:43:13 -07:00
Yun Dai	2695ab0537	Fix loading KV quantization scale; Enable modelopt kv cache (#4686 ) Co-authored-by: qingquansong <ustcsqq@gmail.com>	2025-04-08 09:11:35 -07:00
Yubo Wang	804d9f2e4c	Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (#4760 )	2025-04-07 23:20:51 -07:00
Chunan Zeng	a7c3f74bec	[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (#5103 )	2025-04-07 22:58:08 -07:00
Stefan He	93470a1411	Refactor and Optimize FA3 Code (#5090 ) Co-authored-by: Qingquan Song <ustcsqq@gmail.com>	2025-04-07 11:52:42 -07:00
Chang Su	f04c80dc42	Add Llama4 support (#5092 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com>	2025-04-07 00:29:36 -07:00
Baizhou Zhang	efbae697b3	[Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052 )	2025-04-05 01:23:02 -07:00
Stefan He	ca8d02abd5	FA3 Spec Decoding to support top k = 1 and add cuda graph support (#5050 ) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com>	2025-04-04 23:03:59 -07:00
Lianmin Zheng	74885a848b	Revert "Replace enable_flashinfer_mla argument with attention_backend" (#5048 )	2025-04-03 13:30:56 -07:00
Baizhou Zhang	e8999b13b7	Replace enable_flashinfer_mla argument with attention_backend (#5005 )	2025-04-03 02:53:58 -07:00
Qingquan Song	e983e43248	Add Eagle Speculative Decoding to FA3 Backend (#4951 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com>	2025-04-02 13:09:02 -07:00
Yineng Zhang	1c63e79756	use fa3 in sgl-kernel (#4954 )	2025-03-31 16:14:49 -07:00
Baizhou Zhang	e62d60fe6d	[Fix] avoid stream sync and torch compile in prefill for fa3 backend (#4932 )	2025-03-30 13:53:44 -07:00
Lianmin Zheng	b26bc86b36	Support page size > 1 + eagle (#4908 )	2025-03-30 00:46:23 -07:00
Baizhou Zhang	20c90be23d	[Feature] Support FA3 backend for MLA (#4831 )	2025-03-28 18:30:14 -07:00
Qingquan Song	6ffb6bd47a	Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (#4855 )	2025-03-28 01:35:59 -07:00
Stefan He	26c0f13126	Support Page Size > 1 for FA3 (#4832 ) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-03-27 22:07:14 -07:00
Stefan He	1b9175cb23	[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 (#4745 ) Co-authored-by: Yubo Wang <yubowang2019@gmail.com>	2025-03-27 00:45:11 -07:00
Yi Pan	45fdf1f7f3	Fix shared memory OOM on sm86 GPUs. (#4797 )	2025-03-26 10:41:53 -07:00
Thysrael	ced35a0649	fix(typo): fix `reply` to `replay` in `base_attn_backend.py` (#4784 )	2025-03-26 00:19:12 -07:00
lukec	57eec0bfbc	fix FlashMLA cudagraph config (#4691 ) Co-authored-by: yinfan98 <1106310035@qq.com>	2025-03-24 21:06:58 -07:00
Stefan He	5d7edc8e55	Support FA3 as Attention backend by using `--attention-backend fa3` (#4680 ) Co-authored-by: qsong <qsong@linkedin.com> Co-authored-by: qingquansong <ustcsqq@gmail.com>	2025-03-23 23:28:11 -07:00
Mick	11577cedb7	refactor: bug fixes and refactor for vlm (#4661 )	2025-03-22 22:48:49 -07:00
JieXin Liang	9e93ef3f8e	[fix] fix illegal mem access and clean up triton attention backend (#4571 )	2025-03-20 02:01:52 -07:00

1 2 3 4 5

204 Commits