sglang

Author	SHA1	Message	Date
Ke Wen	ece724910a	Make torch TP composable with torchao (#2436 )	2024-12-11 04:21:42 -08:00
Ying Sheng	8586b72da0	[feat] Enable chunked prefill for llava-onevision (#2412 )	2024-12-09 09:52:38 -08:00
Lianmin Zheng	641b7d0ae0	[Minor] Improve code style (#2422 )	2024-12-09 06:30:35 -08:00
Lianmin Zheng	0ce091a82d	[Minor] Improve code style (#2419 )	2024-12-09 03:05:59 -08:00
Lianmin Zheng	835f8afc77	Migrate llama_classification to use the /classify interface (#2417 )	2024-12-08 23:30:51 -08:00
Xiaoyu Zhang	3844feb9bb	Add a unittest for fused_moe (#2416 )	2024-12-08 22:46:10 -08:00
Byron Hsu	27f7bed7a7	reduce watchdog interval to 5s (#2410 )	2024-12-08 21:17:31 -08:00
Lianmin Zheng	a6ca736c8e	Simplify stream_output (#2398 )	2024-12-08 12:27:13 -08:00
Lianmin Zheng	cc858953a0	Fix recv_requests (#2405 )	2024-12-08 04:08:04 -08:00
Yineng Zhang	6128f7cff5	fix: specify dtype with begin_forward aka plan (#2404 )	2024-12-08 20:07:30 +08:00
Lianmin Zheng	a2486eb58f	Fix a bug with logprob streaming + chunked prefill (#2403 )	2024-12-08 03:55:27 -08:00
Ke Bao	61dec545b0	Remove unused vars in the triton backend (#2401 )	2024-12-08 03:37:03 -08:00
Ke Bao	7dc66fcb40	Optimize Triton decoding kernel for long context (#2394 )	2024-12-08 01:17:37 -08:00
SangBin Cho	1f09e84b9a	nit: Remove busy waiting on scheduler (#2382 )	2024-12-08 01:06:15 -08:00
Sangchun Ha (Patrick)	63dfab1bea	Fix shape error that occurred when loading lora weight of gemma2 model. (#2330 )	2024-12-08 01:04:08 -08:00
Yineng Zhang	75ae968959	minor: update killall script (#2391 )	2024-12-08 04:21:00 +08:00
HAI	95f93f493a	Fp8 MoE optimizations on AMD (#2388 )	2024-12-07 21:18:26 +08:00
Yineng Zhang	aaac33fd8d	fix: update xgrammar v0.1.6 (#2390 )	2024-12-07 21:09:16 +08:00
Yineng Zhang	d332aa3b0c	fix: resolve fp8 moe issue (#2387 )	2024-12-07 19:28:53 +08:00
Lianmin Zheng	e5f227c0ee	Release v0.4.0.post1 (#2375 )	2024-12-06 06:08:19 -08:00
Lianmin Zheng	0e7409adb6	Fix the overlap for xgrammar (#2377 )	2024-12-06 05:49:29 -08:00
Lianmin Zheng	f5b2a3aa67	Use proc.join instead of busy waiting (#2374 )	2024-12-06 02:01:23 -08:00
Qun Yang	37ee906f61	Add more support for intel Gaudi accelerators (#2357 )	2024-12-06 01:16:33 -08:00
Xiaoyu Zhang	34b364e073	optimize cuda graph max_bs_settings on low-end gpus (#2360 )	2024-12-06 01:13:04 -08:00
Yineng Zhang	84d96b3ae5	Move FP8 to SGLang (#2370 ) Co-authored-by: HaiShaw <hixiao@gmail.com>	2024-12-06 15:42:10 +08:00
xiaobochen	3d32e4a32c	Resubmit MoE-EP (#2371 )	2024-12-06 15:05:21 +08:00
Lianmin Zheng	71e2a27753	Fix the cuda graph capture range for small #max-running-requests (#2359 )	2024-12-06 14:13:57 +08:00
Ke Bao	4a63c181f1	Fix AWQ with enable MLA (#2364 )	2024-12-06 00:46:48 +08:00
Lianmin Zheng	2b0fc5941d	[Minor] Code style improvements (#2355 )	2024-12-04 19:02:08 -08:00
Jerry Zhang	9cc733b38c	move apply_torchao_config_ to model_runner (#2342 )	2024-12-04 17:26:42 -08:00
Ke Wen	d693ec0427	Make torch TP composable with torch.compile (#2352 )	2024-12-04 17:26:00 -08:00
Chayenne	786be44da5	Fix Docs CI When Compile Error (#2323 )	2024-12-04 11:19:46 -08:00
Yineng Zhang	2db4469808	minor: limit the range of vllm versions (#2350 )	2024-12-05 02:00:34 +08:00
Ata Fatahi	ed45e509df	Check gpu availability at server args creation (#2340 ) Signed-off-by: Ata Fatahi <immrata@gmail.com>	2024-12-05 01:53:02 +08:00
Ke Bao	ec52464dde	MLA prefill w/o weight absorption (#2349 )	2024-12-05 01:50:28 +08:00
HAI	b2986d7aa5	Adding SGLang FP8 Utils (#2348 )	2024-12-04 03:01:33 -08:00
Yineng Zhang	f8b0326934	chore: bump v0.4.0 (#2338 )	2024-12-03 11:55:41 -08:00
Lianmin Zheng	1228f7ca69	Fix gptq for moe layers (#2300 ) Co-authored-by: root <me@zhyncs.com>	2024-12-03 23:12:33 +08:00
Lianmin Zheng	07ec07ad1f	Improve torch compile for fused moe (#2327 )	2024-12-03 01:58:25 -08:00
Ying Sheng	aa47f64223	Revert "[feat] Enable chunked prefill for llava-onevision" (#2329 )	2024-12-02 23:11:13 -08:00
Lianmin Zheng	3ddb1c4679	[Minor] Fix logger and style (#2325 )	2024-12-02 20:45:53 -08:00
Ying Sheng	480e38a733	[feat] Enable chunked prefill for llava-onevision (#2281 )	2024-12-02 20:19:02 -08:00
HAI	69e2d4fb66	Relax to include more AMD GPUs (#2319 )	2024-12-02 19:05:58 -08:00
Yineng Zhang	85e1a6f3aa	Update model_loader deps and qqq quantization deps (#2220 ) (#2318 ) Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-02 23:22:13 +08:00
Lianmin Zheng	18108abe5d	[Minor] Fix code style (#2311 )	2024-12-02 02:27:36 -08:00
HAI	c54bda300a	Use rocminfo instead of rocm-smi for more OS/WSL support (#2310 )	2024-12-02 00:15:45 -08:00
Lianmin Zheng	3c79ad35ca	[Fix] Fix the padded hash value for image tokens (#2309 )	2024-12-01 23:36:28 -08:00
Chayenne	983bfcf386	Online weight updates from torch.distributed (#2279 )	2024-12-01 23:23:18 -08:00
Lianmin Zheng	5c18a03733	Fix logprob for completions (#2301 )	2024-12-01 05:17:05 -08:00
Qun Yang	62c516ac45	Add a simple torch native attention backend (#2241 )	2024-12-01 03:01:25 -08:00

1 2 3 4 5 ...

1120 Commits