sglang

Author	SHA1	Message	Date
Lianmin Zheng	dafb6a5266	[Fix] Fix the style of test_large_max_new_tokens.py (#1638 )	2024-10-11 16:05:58 -07:00
Byron Hsu	862cd265e5	[engine] support async and streaming (#1614 )	2024-10-11 15:26:25 -07:00
Lianmin Zheng	5d09ca5735	Fix constrained decoding (#1634 )	2024-10-11 06:26:20 -07:00
Lianmin Zheng	aba9eae4c6	Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py (#1631 )	2024-10-11 05:03:20 -07:00
Byron Hsu	e8613df071	[Engine] Fix generate hanging issue after the first call (#1606 )	2024-10-08 04:26:56 +00:00
Ke Bao	68f8b60d22	Fix chunked prefill condition (#1594 )	2024-10-07 06:34:14 +00:00
Byron Hsu	551a3a9d38	Provide an offline engine API (#1567 )	2024-10-06 20:27:03 -07:00
Byron Hsu	17e998f1a8	Test consistency for single and batch seperately (#1590 )	2024-10-06 22:02:27 +00:00
Ying Sheng	c98e84c21e	[Minor, Performance] Use torch.argmax for greedy sampling (#1589 )	2024-10-06 13:15:05 -07:00
Lianmin Zheng	9244f27f0a	[Minor] Improve the style and fix flaky tests (#1584 )	2024-10-06 00:10:48 -07:00
Byron Hsu	2422de5193	Support min_tokens in sgl.gen (#1573 )	2024-10-05 21:51:12 -07:00
Ying Sheng	04b262cd91	[Fix] Fix major performance bug in certain cases (#1563 ) Co-authored-by: hnyls2002 <hnyls2002@gmail.com>	2024-10-04 08:51:11 +00:00
Lianmin Zheng	32eb6e96f2	Organize sampling batch info better (#1562 )	2024-10-03 18:29:49 -07:00
Minsang Song	e6852b0dd2	[Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' (#1536 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2024-10-02 20:41:15 -07:00
Theresa Barton	2c7d0a5b8b	[Fix] Fix all the Huggingface paths (#1553 )	2024-10-02 10:12:07 -07:00
Liangsheng Yin	99ec439da4	Organize Attention Backends (#1547 )	2024-09-30 15:54:18 -07:00
Ying Sheng	0f4fb19bc8	[Fix, LoRA] fix LoRA with updates in main (#1545 )	2024-09-30 10:06:08 -07:00
Lianmin Zheng	3f0fe08d37	Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541 )	2024-09-29 20:28:45 -07:00
Lianmin Zheng	048685430d	Improve process creation (#1534 )	2024-09-29 02:36:12 -07:00
Ying Sheng	9aa6553d2a	[Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525 )	2024-09-27 23:32:11 -07:00
TianyiQ	3c93187caf	Add support for tie_word_embeddings when loading weights + support for SmolLM (#1508 )	2024-09-24 21:50:20 -07:00
Lianmin Zheng	fb2d0680e0	[Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510 )	2024-09-24 21:37:33 -07:00
Lianmin Zheng	28b4d8e144	Update test_srt_backend.py (#1502 )	2024-09-24 03:17:10 -07:00
Yineng Zhang	42a2d82ba7	minor: add mla fp8 test (#1494 )	2024-09-23 20:40:17 +08:00
Ying Sheng	e4780cf839	[API, Feature] Support response prefill for openai API (#1490 )	2024-09-22 06:46:17 -07:00
Lianmin Zheng	13f1357ef0	Add a unit test for data parallelism (#1489 )	2024-09-22 02:21:05 -07:00
Lianmin Zheng	167591e864	Better unit tests for adding a new model (#1488 )	2024-09-22 01:50:37 -07:00
Ke Bao	b8ccaf4d73	Add MLA gsm8k eval (#1484 )	2024-09-21 11:16:13 +08:00
Ke Bao	a68cb201dd	Fix triton head num (#1482 )	2024-09-21 10:25:20 +08:00
Yineng Zhang	a6db88626e	minor: add quant eval compared with base (#1475 )	2024-09-20 01:57:19 +08:00
Lianmin Zheng	1acccb364a	Fix oom issues with fp8 for llama (#1454 )	2024-09-18 03:45:19 -07:00
Ke Bao	c6b6d2e71b	Enable MLA by default (#1447 )	2024-09-17 11:42:48 +00:00
Lianmin Zheng	27b557aea7	Clean up model loader (#1440 )	2024-09-16 18:16:27 -07:00
Lianmin Zheng	9ba1f09760	[Fix] Fix logprob and normalized_logprob (#1428 )	2024-09-15 06:36:06 -07:00
Lianmin Zheng	9463bc1385	Enable torch.compile for triton backend (#1422 )	2024-09-14 15:38:37 -07:00
Yineng Zhang	e3fc4658f4	fix: resolve nightly eval (#1426 )	2024-09-15 02:07:52 +10:00
Ke Bao	33b54e7c40	Add pytorch sampling backend ut (#1425 )	2024-09-15 01:15:30 +10:00
Lianmin Zheng	ad0ff62a4c	Balance test in CI (#1411 )	2024-09-12 23:29:44 -07:00
Lianmin Zheng	68be2f6d3b	[CI] Include triton backend and online serving benchmark into CI (#1408 )	2024-09-12 21:36:41 -07:00
Ying Sheng	eb02c1618a	[Minor, CI] remove lora test from minimal suite (#1406 )	2024-09-12 16:49:50 -07:00
Ying Sheng	712216928f	[Feature] Initial support for multi-LoRA serving (#1307 )	2024-09-12 16:46:14 -07:00
Lianmin Zheng	3efa798116	Support cuda graph in the triton attention backend (#1401 )	2024-09-12 00:36:55 -07:00
Lianmin Zheng	fec185ce0c	Refactor attention backend (#1381 )	2024-09-11 11:44:26 -07:00
Byron Hsu	8c0efa514d	remove assertion in triton attention and add an unit test (#1385 )	2024-09-11 03:22:07 -07:00
Liangsheng Yin	144bc70fcc	Organize flashinfer indices update (#1378 )	2024-09-10 17:38:59 -07:00
Lianmin Zheng	46094e0c1b	Deprecate --disable-flashinfer and introduce --attention-backend (#1380 )	2024-09-10 17:11:16 -07:00
Lianmin Zheng	6c7cb90365	[Minor] improve kill scripts and torchao import (#1375 )	2024-09-11 04:27:03 +10:00
zifeitong	9144ed1067	Support OpenAI API json_schema response format (#1363 )	2024-09-09 19:08:25 -07:00
Ying Sheng	689ff588ec	[CI] Return output logprobs in unit test (#1361 )	2024-09-09 13:05:13 -07:00
Jerry Zhang	a7c47e0f02	Add torchao quant (int4/int8/fp8) to llama models (#1341 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2024-09-09 05:32:41 -07:00

1 2 3 4

176 Commits