sglang

Author	SHA1	Message	Date
Ke Bao	c6b6d2e71b	Enable MLA by default (#1447 )	2024-09-17 11:42:48 +00:00
Lianmin Zheng	2fa5cec775	Simplify sampler and its error handling (#1441 )	2024-09-16 21:23:31 -07:00
Lianmin Zheng	27b557aea7	Clean up model loader (#1440 )	2024-09-16 18:16:27 -07:00
Liangsheng Yin	70b6802982	Optimize conflicts between CUDA graph and vocab mask tensors (#1392 )	2024-09-13 20:27:53 -07:00
Ying Sheng	712216928f	[Feature] Initial support for multi-LoRA serving (#1307 )	2024-09-12 16:46:14 -07:00
Lianmin Zheng	3efa798116	Support cuda graph in the triton attention backend (#1401 )	2024-09-12 00:36:55 -07:00
Lianmin Zheng	fec185ce0c	Refactor attention backend (#1381 )	2024-09-11 11:44:26 -07:00
Lianmin Zheng	46094e0c1b	Deprecate --disable-flashinfer and introduce --attention-backend (#1380 )	2024-09-10 17:11:16 -07:00
Lianmin Zheng	3a6e8b6d78	[Minor] move triton attention kernels into a separate folder (#1379 )	2024-09-10 15:15:08 -07:00
Liangsheng Yin	69b3bb9ae1	Unify forward mode (#1360 )	2024-09-09 13:49:29 -07:00
Jerry Zhang	a7c47e0f02	Add torchao quant (int4/int8/fp8) to llama models (#1341 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2024-09-09 05:32:41 -07:00
Lianmin Zheng	e4d68afcf0	[Minor] Many cleanup (#1357 )	2024-09-09 04:14:11 -07:00
Yineng Zhang	dc67d97693	misc: speedup load safetensors (#1319 ) Co-authored-by: ispobock <ISPObaoke@163.com>	2024-09-04 04:29:53 +10:00
Lianmin Zheng	f64eae3a29	[Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308 )	2024-09-02 21:44:45 -07:00
Liangsheng Yin	a5a134f39f	Fix bugs in sampler with CUDA graph / torch.compile (#1306 )	2024-09-02 23:18:48 +00:00
Kai-Hsun Chen	0836055324	[Chore] Rename model_overide_args to model_override_args (#1284 ) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-09-01 03:14:56 -07:00
Ke Bao	6cb32ef92c	Support Triton fp8 e5m2 kv cache (#1286 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-09-01 19:46:40 +10:00
Liangsheng Yin	381dd57bd6	Sampler cudagraph (#1253 )	2024-08-28 18:58:52 -07:00
Lianmin Zheng	bf53bf5142	[Fix] Fix llava on multi images (#1247 )	2024-08-28 06:33:05 -07:00
Yineng Zhang	f25f4dfde5	hotfix: revert sampler CUDA Graph (#1242 )	2024-08-28 21:16:47 +10:00
Yineng Zhang	198974cd1a	feat: support sm75 with FlashInfer v0.1.6 (#1233 )	2024-08-28 18:39:12 +10:00
Yineng Zhang	c5fe11a8e1	chore: bump v0.2.14 (#1155 )	2024-08-27 00:28:24 +10:00
Liangsheng Yin	75ce37f401	Move sampler into CUDA graph (#1201 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-26 07:02:50 -07:00
Ke Bao	2c615d120f	[Feature] Support fp8 e5m2 kv cache with flashinfer (#1204 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-25 17:38:11 -07:00
Lianmin Zheng	902278008a	[Minor] Improve the function organization in TokenizerManager & improve loggers (#1208 )	2024-08-25 14:46:34 -07:00
Chayenne	30b4f771b0	Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2024-08-25 10:29:12 -07:00
Lianmin Zheng	f6af3a6561	Cleanup readme, llava examples, usage examples and nccl init (#1194 )	2024-08-24 08:02:23 -07:00
Liangsheng Yin	83e23c69b3	Improve code style of sampler (#1168 )	2024-08-21 16:48:24 -07:00
Lianmin Zheng	bea2bb9eea	Improve multi-node stability (#1171 )	2024-08-20 22:35:05 -07:00
Shan Yu	cd10654e7e	[Feat] Support update weights without restart server (#1157 )	2024-08-20 13:48:24 -07:00
Xu-Chen	ff2cfdb1a2	[Feature] add disable-custom-all-reduce (#1148 ) Co-authored-by: chenxu02 <chenxu02@zhihu.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-20 08:44:12 -07:00
Lianmin Zheng	a8ae640328	Improve docs and warnings (#1164 )	2024-08-20 08:31:29 -07:00
Yineng Zhang	9208591f05	fix: use fp16 dtype for sm75 (#1136 )	2024-08-18 00:45:42 +10:00
Lianmin Zheng	5a261bd055	Fix the deadlock in multi-node tp (#1122 )	2024-08-16 01:39:24 -07:00
Ying Sheng	93d4e354d8	[Fix] Window attention compatible with RadixAttention and chunked prefill (#1112 )	2024-08-15 10:33:20 -07:00
Yineng Zhang	9195d1362a	misc: rm unused model_loader (#1110 )	2024-08-15 08:29:35 -07:00
Ying Sheng	14cb544d56	[Fix] fix flashinfer usage for window attention (#1107 )	2024-08-15 00:53:24 -07:00
Ying Sheng	8d2d876fc8	[Fix] fix the typo bug for window attention (#1106 )	2024-08-14 21:56:01 -07:00
Lianmin Zheng	326df4bab2	Use a single workspace for flashinfer (#1077 )	2024-08-14 19:25:37 -07:00
Ying Sheng	96a2093ef0	[Fix] Compatibility of window attention and cuda graph (#1090 )	2024-08-14 10:37:01 -07:00
Lianmin Zheng	a59636bb5e	Update grok 1 model (#1095 )	2024-08-14 04:40:44 -07:00
Ying Sheng	0909bb0d2f	[Feat] Add window attention for gemma-2 (#1056 )	2024-08-13 17:01:26 -07:00
Ying Sheng	e040a2450b	Add e5-mistral embedding model - step 3/3 (#988 )	2024-08-08 16:31:19 -07:00
Liangsheng Yin	1ac304eeb4	Adjust `InputeMetadata` and `ScheduleBatch` (#981 )	2024-08-08 01:11:22 -07:00
Liangsheng Yin	f724f1f1e9	PrefillAdder abstraction (#968 )	2024-08-07 13:47:28 -07:00
Liangsheng Yin	87e8c090e9	Organize code (rename, movement) (#953 )	2024-08-06 20:50:32 -07:00
Ke Bao	e1eae1fd15	Support MLA for DeepSeek-V2 with Triton - step 1 (#905 )	2024-08-05 03:40:33 +10:00
Yineng Zhang	6b8f66efe1	misc: update cuda graph capture exception log (#894 )	2024-08-03 00:40:52 +10:00
Liangsheng Yin	6b0f2e9088	Add `--max-total-tokens` (#840 )	2024-07-30 13:33:55 -07:00
Ying Sheng	e7487b08bc	Adjust default mem fraction to avoid OOM (#823 )	2024-07-30 01:58:31 -07:00

1 2

51 Commits