sglang

Author	SHA1	Message	Date
Lianmin Zheng	86d10d220f	Update grok.py and tiktoken tokenizer (#9532 )	2025-08-23 05:40:18 -07:00
Cheng Wan	c0fb25e949	DP Enhancement (#8280 )	2025-07-24 21:36:21 -07:00
Lianmin Zheng	38af4f68a9	Fix grammar abort & Minor style fixes (#7204 )	2025-06-14 22:49:41 -07:00
Ke Bao	799c4bb502	Fuse MLA set kv cache kernel (#5748 )	2025-04-26 18:42:22 -07:00
Ke Bao	6b6e748775	Remove q concat in FA3 backend for DeepSeek decode (#5638 )	2025-04-22 11:43:12 -07:00
woodx	3bface15e6	Feat/support encoder model (like bert) (#4887 )	2025-04-17 01:50:48 -07:00
Yun Dai	2695ab0537	Fix loading KV quantization scale; Enable modelopt kv cache (#4686 ) Co-authored-by: qingquansong <ustcsqq@gmail.com>	2025-04-08 09:11:35 -07:00
Chang Su	f04c80dc42	Add Llama4 support (#5092 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com>	2025-04-07 00:29:36 -07:00
Qubitium-ModelCloud	56a724eba3	[QUANT] Add GPTQModel Dynamic Quantization + `lm_head` Quantization (#3790 ) Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>	2025-03-05 01:11:00 -08:00
Lianmin Zheng	dc1881326f	Fix perf regression on small batch sizes (#3008 )	2025-01-20 03:39:49 -08:00
bjmsong	0bb0f76311	Support FP8 E4M3 KV Cache (#2786 ) Co-authored-by: root <bjmsong@126.com>	2025-01-12 21:17:11 -08:00
Lianmin Zheng	641b7d0ae0	[Minor] Improve code style (#2422 )	2024-12-09 06:30:35 -08:00
Ke Bao	ec52464dde	MLA prefill w/o weight absorption (#2349 )	2024-12-05 01:50:28 +08:00
Xuehai Pan	62a4a339eb	docs: fix module docstrings and copyright headers (#2077 )	2024-11-22 22:16:53 +08:00
Liangsheng Yin	100f5b8bc9	Simplify flashinfer dispatch (#1552 )	2024-10-01 00:28:42 -07:00
Lianmin Zheng	36d5acfca5	Rename InputMetadata -> ForwardBatch (#1543 )	2024-09-30 02:41:11 -07:00
Lianmin Zheng	fec185ce0c	Refactor attention backend (#1381 )	2024-09-11 11:44:26 -07:00
Lianmin Zheng	46094e0c1b	Deprecate --disable-flashinfer and introduce --attention-backend (#1380 )	2024-09-10 17:11:16 -07:00
Lianmin Zheng	3a6e8b6d78	[Minor] move triton attention kernels into a separate folder (#1379 )	2024-09-10 15:15:08 -07:00
Liangsheng Yin	69b3bb9ae1	Unify forward mode (#1360 )	2024-09-09 13:49:29 -07:00
Ke Bao	2c615d120f	[Feature] Support fp8 e5m2 kv cache with flashinfer (#1204 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-25 17:38:11 -07:00
Ying Sheng	93d4e354d8	[Fix] Window attention compatible with RadixAttention and chunked prefill (#1112 )	2024-08-15 10:33:20 -07:00
Ying Sheng	14cb544d56	[Fix] fix flashinfer usage for window attention (#1107 )	2024-08-15 00:53:24 -07:00
Ying Sheng	96a2093ef0	[Fix] Compatibility of window attention and cuda graph (#1090 )	2024-08-14 10:37:01 -07:00
Ying Sheng	0909bb0d2f	[Feat] Add window attention for gemma-2 (#1056 )	2024-08-13 17:01:26 -07:00
Lianmin Zheng	fb1f28cbbb	Clean up the comments and names under python/sglang/srt/layers (#1047 )	2024-08-12 05:54:37 +00:00
Liangsheng Yin	87e8c090e9	Organize code (rename, movement) (#953 )	2024-08-06 20:50:32 -07:00
Ke Bao	e1eae1fd15	Support MLA for DeepSeek-V2 with Triton - step 1 (#905 )	2024-08-05 03:40:33 +10:00
Ying Sheng	e7487b08bc	Adjust default mem fraction to avoid OOM (#823 )	2024-07-30 01:58:31 -07:00
Liangsheng Yin	cdcbde5fc3	Code structure refactor (#807 )	2024-07-29 23:04:48 -07:00
Yineng Zhang	dd7e8b9421	chore: add copyright for srt (#790 )	2024-07-28 23:07:12 +10:00
Lianmin Zheng	752e643007	Allow disabling flashinfer sampling kernel (#778 )	2024-07-27 20:18:56 -07:00
Mingyi	a523a3c13a	Reduce hardcoded logic of kernel usage (#707 )	2024-07-23 16:42:21 -07:00
Ying Sheng	cf99eab7d5	Fix flashinfer (#700 )	2024-07-23 01:27:01 -07:00
Liangsheng Yin	69d19188fc	Decouple kv (#679 )	2024-07-20 14:16:45 -07:00
Liangsheng Yin	a9ef49c12c	Detokenize incrementally when streaming (#653 )	2024-07-18 17:57:40 -07:00
Liangsheng Yin	abd5385ac5	Move `global_server_args_dict` (#642 )	2024-07-17 13:49:15 -07:00
Ying Sheng	5949b1ca0e	Fix memory pool index error (#616 )	2024-07-13 16:45:11 -07:00
Liangsheng Yin	10143e1a5f	Memorypool chunked prefetch (#614 )	2024-07-13 15:24:03 -07:00
Lianmin Zheng	396a69240f	Cleanup attention backend: flashinfer and triton (#611 )	2024-07-12 18:21:11 -07:00
Lianmin Zheng	af4e7910e7	Clean up the usage of flashinfer (#610 )	2024-07-12 13:00:03 -07:00
Lianmin Zheng	519e20cfda	Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py (#609 )	2024-07-12 12:28:09 -07:00
Liangsheng Yin	f25b76c02a	add `LogitsMetadata` (#604 )	2024-07-08 17:46:55 -07:00
Mingyi	f4e885b7c3	Reduce number of workspaces (#601 )	2024-07-07 19:35:22 -07:00
Ying Sheng	dc1b8bcfaa	Format (#593 )	2024-07-05 10:06:17 -07:00
Ying Sheng	5a57b8addd	Add Gemma2 (#592 )	2024-07-05 09:48:54 -07:00
Ying Sheng	2a754e57b0	2x performance improvement for large prefill & Fix workspace conflicts (#579 )	2024-07-03 16:14:57 -07:00
Ying Sheng	9380f50ff9	Turn on flashinfer by default (#578 )	2024-07-02 02:25:07 -07:00
Yueyang Pan	d9ac639202	Fix flashinfer version (#576 )	2024-07-01 22:08:39 -07:00
Lianmin Zheng	b7e2f800ac	Update flashinfer to 0.0.5 (#554 )	2024-06-20 20:29:06 -07:00

1 2

63 Commits