sglang

Author	SHA1	Message	Date
Liangsheng Yin	1ece2cda3d	Fix bench latency benchmark (#1225 )	2024-08-28 00:37:32 -07:00
Yineng Zhang	3602692c7c	feat: replace get_act_fn for gpt_bigcode (#1231 )	2024-08-27 21:15:31 +10:00
havetc	909f34363b	[FIX] Wrong logger (#1230 )	2024-08-27 20:10:46 +10:00
caiyueliang	2f1d92834f	[FEAT] Support batches cancel (#1222 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-26 23:28:26 +00:00
havetc	9935f97b3e	[FEAT] JSON constrained support (#1125 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-26 09:37:26 -07:00
Yineng Zhang	c5fe11a8e1	chore: bump v0.2.14 (#1155 )	2024-08-27 00:28:24 +10:00
Liangsheng Yin	75ce37f401	Move sampler into CUDA graph (#1201 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-26 07:02:50 -07:00
Mingyi	97589a60a2	[CI] Parallelize unit tests in CI (#1219 )	2024-08-26 04:54:02 +00:00
Liangsheng Yin	632d506d0b	minor: improve CI and dependencies (#1212 )	2024-08-26 04:26:31 +00:00
Kaichen Zhang - NTU	3579162ab1	[Fix] Multi-images loading error (#1218 )	2024-08-26 03:58:51 +00:00
Mingyi	7514b9f8d3	[CI] Fix CI (#1217 )	2024-08-26 02:56:42 +00:00
Mingyi	158e8f1e2d	improve the threshold and ports in tests (#1215 )	2024-08-25 19:02:08 -07:00
Ke Bao	2c615d120f	[Feature] Support fp8 e5m2 kv cache with flashinfer (#1204 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-25 17:38:11 -07:00
Lianmin Zheng	15f1a49d2d	Update CI workflows (#1210 )	2024-08-25 16:43:07 -07:00
Ying Sheng	308d024092	[CI] Fix the issue of unit test hanging (#1211 )	2024-08-25 16:21:37 -07:00
Lianmin Zheng	902278008a	[Minor] Improve the function organization in TokenizerManager & improve loggers (#1208 )	2024-08-25 14:46:34 -07:00
Chayenne	30b4f771b0	Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2024-08-25 10:29:12 -07:00
Kaichen Zhang - NTU	66e7dcaf70	[Fix] Fixing the multi-images error for llava-onevision (#1205 )	2024-08-25 10:28:23 -07:00
Ying Sheng	1cb4da5c5f	[Fix] the issue of random order when input is a list (#1199 )	2024-08-24 21:43:03 -07:00
Ying Sheng	e61d13acdf	[CI] Fix the problem of hf runner too slow (#1202 )	2024-08-24 18:35:55 -07:00
Lianmin Zheng	f6af3a6561	Cleanup readme, llava examples, usage examples and nccl init (#1194 )	2024-08-24 08:02:23 -07:00
Yineng Zhang	c9064e6fd9	feat: use gelu_tanh_and_mul (#1193 )	2024-08-24 01:58:16 -07:00
Kaichen Zhang - NTU	a5b14ad043	[Feat/WIP] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. (#1123 ) Co-authored-by: Bo Li <drluodian@gmail.com>	2024-08-23 14:11:16 -07:00
Ying Sheng	5fafcac008	Fix benchmark script (#1185 )	2024-08-22 09:03:25 +00:00
Liangsheng Yin	364d3d72a7	Fix broken penalty (#1184 )	2024-08-22 08:16:35 +00:00
Lianmin Zheng	5623826f73	[Minor] Improve logging and rename the health check endpoint name (#1180 )	2024-08-21 19:24:36 -07:00
Liangsheng Yin	83e23c69b3	Improve code style of sampler (#1168 )	2024-08-21 16:48:24 -07:00
intervitens	068e9eae55	Support min-p sampling (#1167 )	2024-08-21 22:49:32 +00:00
rainred	d6aeb9fa15	[Feature] Add a function to convert sampling_params to kwargs (#1170 ) Co-authored-by: lzhang <zhanglei@modelbest.cn>	2024-08-21 14:28:35 -07:00
Yineng Zhang	1fb9459908	fix: custom op fallback forward native when lower sm80 (#1177 )	2024-08-21 14:26:35 -07:00
Lianmin Zheng	bea2bb9eea	Improve multi-node stability (#1171 )	2024-08-20 22:35:05 -07:00
Shan Yu	cd10654e7e	[Feat] Support update weights without restart server (#1157 )	2024-08-20 13:48:24 -07:00
Lucien	6242c399ab	Generate 1 token to verify the health of the inference service in /health (#1154 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-21 03:14:34 +10:00
Yineng Zhang	04707b09b7	misc: add hypervisor vendor (#1165 )	2024-08-21 02:14:51 +10:00
Xu-Chen	ff2cfdb1a2	[Feature] add disable-custom-all-reduce (#1148 ) Co-authored-by: chenxu02 <chenxu02@zhihu.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-20 08:44:12 -07:00
Lianmin Zheng	a8ae640328	Improve docs and warnings (#1164 )	2024-08-20 08:31:29 -07:00
Juwan Yoo	d8476818ef	feat: allow streaming for multi-prompt and/or parallel sampling (#1134 )	2024-08-20 08:06:55 -07:00
Ke Bao	df191254ab	Optimize MLA/GQA/MQA Triton decoding (#1138 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-19 20:23:07 +10:00
yichuan~	b997a18d74	[Feat]Add support for optional start len of logprobs (#1035 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>	2024-08-18 23:45:41 -07:00
min-xu-et	fa13b95d6b	fixed a typo (#1143 )	2024-08-18 14:29:09 -07:00
Lianmin Zheng	3c1f5a9220	Fix duplicated imports in hf_transformers_utils.py (#1141 )	2024-08-17 18:03:00 -07:00
Lianmin Zheng	57d0bd91ec	Improve benchmark (#1140 )	2024-08-17 17:43:23 -07:00
Lianmin Zheng	cdc8d60752	Improve the code style: more comments and remove useless packages (#1139 )	2024-08-17 14:37:52 -07:00
Yineng Zhang	9208591f05	fix: use fp16 dtype for sm75 (#1136 )	2024-08-18 00:45:42 +10:00
Liangsheng Yin	f624f6a6cc	Fix port conflicts between local CI and runner CI. (#1131 )	2024-08-16 15:12:38 -07:00
Liangsheng Yin	3694f8f996	Mixed style of chunked prefill (#1013 )	2024-08-16 09:13:00 +00:00
Lianmin Zheng	5a261bd055	Fix the deadlock in multi-node tp (#1122 )	2024-08-16 01:39:24 -07:00
Yineng Zhang	5bd953749b	chore: bump v0.2.13 (#1111 )	2024-08-16 03:50:43 +10:00
Lianmin Zheng	0cb099e20a	set CUDA_DEVICE_MAX_CONNECTIONS=1 (#1113 )	2024-08-16 03:47:39 +10:00
Ying Sheng	93d4e354d8	[Fix] Window attention compatible with RadixAttention and chunked prefill (#1112 )	2024-08-15 10:33:20 -07:00

1 2 3 4 5 ...

574 Commits