sglang

Author	SHA1	Message	Date
Lianmin Zheng	a509552087	[minor] Improve code style and compatibility (#1961 )	2024-11-08 02:19:41 -08:00
Lianmin Zheng	0abbf289a8	Unify the model type checking (#1905 )	2024-11-03 12:25:39 -08:00
Lianmin Zheng	b548801ddb	Update docs (#1839 )	2024-10-30 02:49:08 -07:00
Lianmin Zheng	86e0dde555	Improve the user control of new_token_ratio (#1811 )	2024-10-26 16:39:41 -07:00
Lianmin Zheng	2b80978859	Provide an argument to set the maximum batch size for cuda graph (#1809 )	2024-10-26 15:09:33 -07:00
Lianmin Zheng	e646c5901e	Fix logprob in the overlapped mode (#1795 )	2024-10-25 11:06:57 -07:00
yizhang2077	def55bc876	Qwen2vl support cuda graph and disable radix cache (#1780 )	2024-10-25 10:45:17 -04:00
Lianmin Zheng	86a2c473b7	[Fix] Fix seq_lens_sum for cuda graph runner in padded cases (#1789 )	2024-10-24 21:26:05 -07:00
Lianmin Zheng	384d85ba35	Re-introduce `get_cuda_graph_seq_len_fill_value` (#1783 )	2024-10-24 13:30:11 -07:00
Lianmin Zheng	fc82f5a743	[Fix] Fix cuda graph padding for triton attention backend (#1782 )	2024-10-24 12:33:15 -07:00
Lianmin Zheng	0089c4bc96	[Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer (#1779 )	2024-10-24 04:16:59 -07:00
Lianmin Zheng	05b3bf5e8e	Crash the server on warnings in CI (#1772 )	2024-10-23 16:27:13 -07:00
Lianmin Zheng	ad4125d1a9	Fuse more ops & Simplify token mapping (#1758 )	2024-10-22 23:20:43 -07:00
Liangsheng Yin	94cde10920	Llama3.2 vision model support (#1551 )	2024-10-21 15:01:21 -07:00
Lianmin Zheng	09603c6dc9	Maintain seq_lens_sum to make more FlashInfer operations non-blocking (#1741 )	2024-10-21 01:43:16 -07:00
Lianmin Zheng	b121bc03a3	Simplify batch result resolution (#1735 )	2024-10-20 19:47:14 -07:00
yizhang2077	554fbf93cd	[Bugfix] qwen2vl forward_extend (#1727 )	2024-10-20 02:38:35 -07:00
Lianmin Zheng	b48edff67f	Split the overlapped version of TpModelWorkerClient into a separate file (#1726 )	2024-10-20 00:29:29 -07:00
Lianmin Zheng	59cbf47626	Unify the memory pool api and tp worker API (#1724 )	2024-10-19 23:19:26 -07:00
Yineng Zhang	cbbc82b7b8	Support qwen2 vl model (#1721 ) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: ispobock <ISPObaoke@163.com>	2024-10-19 21:44:38 -07:00
Yineng Zhang	8bee20f80b	Update vllm to 0.6.3 (#1711 ) (#1720 ) Co-authored-by: Ke Bao <ISPObaoke@163.com>	2024-10-19 20:45:41 -07:00
Lianmin Zheng	f0f8a7699b	Simplify the nan detection and greedy check in sampler (#1709 )	2024-10-18 20:21:24 -07:00
Lianmin Zheng	2bcfba1b08	Skip unnecessary penalizer (#1707 )	2024-10-18 17:54:03 -07:00
Lianmin Zheng	392f2863c8	Add dtype for more operations (#1705 )	2024-10-18 12:18:15 -07:00
Lianmin Zheng	6d0fa73ece	Simplify flashinfer utilities (#1704 )	2024-10-17 22:54:14 -07:00
Shuo Yang	061e546313	Support double sparsity (#1459 )	2024-10-14 02:00:41 -07:00
Lianmin Zheng	9da5a60b18	Add an option to disable penalizer (#1651 )	2024-10-12 17:53:23 -07:00
Zhang, Liangang	5d638c92f5	[Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480 )	2024-10-12 18:10:32 +00:00
Lianmin Zheng	23cc66f7b6	Add back data parallelism (#1635 )	2024-10-11 07:22:48 -07:00
Zhang, Liangang	8275049ce3	Add device support (#1607 )	2024-10-11 02:05:58 -07:00
Amos You	c996e8ccd4	[Minor] Fix logging typo (#1615 )	2024-10-08 21:11:19 -07:00
Lianmin Zheng	45473d4b2b	Make input_ids a torch.Tensor (#1568 )	2024-10-04 01:09:59 -07:00
Lianmin Zheng	32eb6e96f2	Organize sampling batch info better (#1562 )	2024-10-03 18:29:49 -07:00
Lianmin Zheng	4ae0969c0a	Move status check in the memory pool to CPU (#1557 )	2024-10-02 18:23:35 -07:00
Liangsheng Yin	100f5b8bc9	Simplify flashinfer dispatch (#1552 )	2024-10-01 00:28:42 -07:00
Liangsheng Yin	99ec439da4	Organize Attention Backends (#1547 )	2024-09-30 15:54:18 -07:00
Lianmin Zheng	63ba2f8d7b	Clean up batch data structures: Introducing ModelWorkerBatch (#1544 )	2024-09-30 06:41:49 -07:00
Lianmin Zheng	36d5acfca5	Rename InputMetadata -> ForwardBatch (#1543 )	2024-09-30 02:41:11 -07:00
Lianmin Zheng	3f0fe08d37	Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541 )	2024-09-29 20:28:45 -07:00
Lianmin Zheng	f86c1e611f	Move scheduler code from tp_worker.py to scheduler.py (#1538 )	2024-09-29 17:42:45 -07:00
Lianmin Zheng	048685430d	Improve process creation (#1534 )	2024-09-29 02:36:12 -07:00
Liangsheng Yin	fd9ad817ec	Organize image inputs (#1531 )	2024-09-29 06:28:55 +00:00
Lianmin Zheng	9ae1db0bdc	[Fix] Ignore import error (#1513 )	2024-09-25 11:32:21 -07:00
Ke Bao	8d4ed42ad5	MoE torch compile (#1497 )	2024-09-24 01:46:59 -07:00
Lianmin Zheng	2854a5ea9f	Fix the overhead due to penalizer in bench_latency (#1496 )	2024-09-23 07:38:14 -07:00
Lianmin Zheng	39bb49d156	Update dockerfile to include datamodel_code_generator (#1492 )	2024-09-22 04:49:16 -07:00
Lianmin Zheng	2d346a57c2	Fix padding in the cuda graph (#1469 )	2024-09-19 01:52:15 -07:00
Lianmin Zheng	7f24ea95c3	Fuse top_k and top_k in the sampler (#1457 )	2024-09-18 04:35:35 -07:00
Ke Bao	b3710d2c93	Fix attention backend (#1448 )	2024-09-17 14:07:53 +00:00
Ke Bao	c6b6d2e71b	Enable MLA by default (#1447 )	2024-09-17 11:42:48 +00:00

1 2 3 4

165 Commits