sglang

Author	SHA1	Message	Date
Jerry Zhang	a7c47e0f02	Add torchao quant (int4/int8/fp8) to llama models (#1341 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2024-09-09 05:32:41 -07:00
Lianmin Zheng	e4d68afcf0	[Minor] Many cleanup (#1357 )	2024-09-09 04:14:11 -07:00
Kai-Hsun Chen	c9b75917d5	[server] Passing `model_override_args` to `launch_server` via the CLI. (#1298 ) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>	2024-09-09 02:14:25 -07:00
Kaichen Zhang - NTU	662ecd9368	[Feat] Add modalities for vision server when handling pixel values for llava (#1346 )	2024-09-09 02:07:34 -07:00
Byron Hsu	8e6bdf851c	[triton] Support head_dim not 2^n in triton extend and decode attention (#1281 )	2024-09-09 01:30:24 -07:00
Liangsheng Yin	05bea6883c	Fix some online scheduling delay (#1345 )	2024-09-07 20:46:27 -07:00
Liangsheng Yin	ab4a83b259	Optimize schedule (#1339 )	2024-09-05 14:30:26 -07:00
Lianmin Zheng	eda7c09048	Remove useless fields in global_config.py (#1328 )	2024-09-04 05:37:32 -07:00
Yineng Zhang	a63c8275c6	chore: bump v0.3.0 (#1320 )	2024-09-04 04:32:18 +08:00
Yineng Zhang	dc67d97693	misc: speedup load safetensors (#1319 ) Co-authored-by: ispobock <ISPObaoke@163.com>	2024-09-04 04:29:53 +10:00
Lianmin Zheng	1e495e0847	[Fix] Fix select by ensuring each request has at least one token (#1318 )	2024-09-03 06:31:45 -07:00
Lianmin Zheng	12cb115d38	Fix llama2 weight loader (#1317 )	2024-09-03 05:32:14 -07:00
Jani Monoses	474317f2b6	Support Phi3 mini and medium (#1299 )	2024-09-02 21:49:40 -07:00
Lianmin Zheng	f64eae3a29	[Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308 )	2024-09-02 21:44:45 -07:00
Liangsheng Yin	a5a134f39f	Fix bugs in sampler with CUDA graph / torch.compile (#1306 )	2024-09-02 23:18:48 +00:00
Yineng Zhang	2561ed012c	feat: update nightly gsm8k eval (#1304 )	2024-09-03 01:18:41 +10:00
Lianmin Zheng	9999442756	Release v0.2.15 (#1295 )	2024-09-01 22:22:38 -07:00
Max Shawabkeh	6def9b018c	Fix hang when doing s += None. (#1297 ) Co-authored-by: max99x <mshawabkeh@jamandtea.studio>	2024-09-01 21:56:33 -07:00
Liangsheng Yin	47f20da223	Fix regex mask (#1296 )	2024-09-01 21:50:58 -07:00
Yineng Zhang	9b0805242e	fix: resolve fp8 for mixtral (#1290 )	2024-09-02 00:29:06 +10:00
Enrique Shockwave	32a4141d5a	Allow new lines during JSON generation (#1277 )	2024-09-01 03:42:29 -07:00
Kai-Hsun Chen	0836055324	[Chore] Rename model_overide_args to model_override_args (#1284 ) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-09-01 03:14:56 -07:00
Byron Hsu	00b19f198f	[triton] Remove the zero initialization of qk_acc by directly writing the result (#1288 )	2024-09-01 03:12:06 -07:00
Ke Bao	6cb32ef92c	Support Triton fp8 e5m2 kv cache (#1286 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-09-01 19:46:40 +10:00
Lianmin Zheng	761b2cebd6	[CI] merge all ci tests into one file (#1289 )	2024-09-01 02:36:56 -07:00
Yineng Zhang	54772f784a	feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 (#1285 ) Co-authored-by: ispobock <ispobaoke@163.com>	2024-09-01 17:28:06 +10:00
Lianmin Zheng	1b5d56f7f8	[CI] Add more multi-gpu tests (#1280 )	2024-09-01 00:27:25 -07:00
xiaobochen	d134c139a1	Optimize the update flashinfer indices (#1262 )	2024-08-31 23:40:28 -07:00
Yineng Zhang	52cefdbf57	fix: resolve the fp8 bug introduced by vLLM 0.5.5 (#1276 )	2024-09-01 00:44:29 +10:00
Christopher Chou	51c554d812	Allow more flexible assistant and system response (#1256 )	2024-08-30 11:51:44 -07:00
Lianmin Zheng	79ece2c51f	Report median instead of mean in bench_latency.py (#1269 )	2024-08-30 06:05:01 -07:00
김종곤	b7f8341014	EXAONE 3.0 Model Support (#1258 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-30 08:08:28 +00:00
Ke Bao	f414352ae6	Transpose mla weight offline (#1261 ) Co-authored-by: Yineng Zhang <me@zhyncs.com>	2024-08-30 16:45:40 +10:00
lxww302	a362340b33	fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader (#1260 )	2024-08-30 16:43:41 +10:00
Liangsheng Yin	381dd57bd6	Sampler cudagraph (#1253 )	2024-08-28 18:58:52 -07:00
Zhiqiang Xie	8153168c96	fix data racing due to mutable reference using deepcopy (#1255 )	2024-08-28 18:57:54 -07:00
Enrique Shockwave	6c34d6339c	make json_schema usable from gen (#1254 )	2024-08-28 18:57:10 -07:00
Yineng Zhang	13ac95b894	chore: bump v0.2.14.post2 (#1250 )	2024-08-28 18:46:33 +00:00
Yineng Zhang	492143bf32	fix: resolve qwen2 moe weight loader (#1252 )	2024-08-28 11:25:46 -07:00
Lianmin Zheng	0a97d7962d	[Fix] Fix OOM in llava base class (#1249 )	2024-08-28 08:45:49 -07:00
Yineng Zhang	c411f32e1c	feat: replace GeluAndMul (#1234 )	2024-08-28 14:07:02 +00:00
Lianmin Zheng	bf53bf5142	[Fix] Fix llava on multi images (#1247 )	2024-08-28 06:33:05 -07:00
Yineng Zhang	b1a540ec42	feat: update GemmaRMSNorm (#1232 )	2024-08-28 22:47:34 +10:00
Yineng Zhang	66975360e7	fix: increase max_new_tokens when testing generation models (#1244 )	2024-08-28 22:12:36 +10:00
Yineng Zhang	f25f4dfde5	hotfix: revert sampler CUDA Graph (#1242 )	2024-08-28 21:16:47 +10:00
Yineng Zhang	198974cd1a	feat: support sm75 with FlashInfer v0.1.6 (#1233 )	2024-08-28 18:39:12 +10:00
Lianmin Zheng	6cc38b2bf3	[Minor] Add more type annotations (#1237 )	2024-08-28 00:54:26 -07:00
Liangsheng Yin	1ece2cda3d	Fix bench latency benchmark (#1225 )	2024-08-28 00:37:32 -07:00
Yineng Zhang	3602692c7c	feat: replace get_act_fn for gpt_bigcode (#1231 )	2024-08-27 21:15:31 +10:00
havetc	909f34363b	[FIX] Wrong logger (#1230 )	2024-08-27 20:10:46 +10:00

1 2 3 4 5 ...

621 Commits