Commit Graph

176 Commits

Author SHA1 Message Date
Lianmin Zheng
dafb6a5266 [Fix] Fix the style of test_large_max_new_tokens.py (#1638) 2024-10-11 16:05:58 -07:00
Byron Hsu
862cd265e5 [engine] support async and streaming (#1614) 2024-10-11 15:26:25 -07:00
Lianmin Zheng
5d09ca5735 Fix constrained decoding (#1634) 2024-10-11 06:26:20 -07:00
Lianmin Zheng
aba9eae4c6 Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py (#1631) 2024-10-11 05:03:20 -07:00
Byron Hsu
e8613df071 [Engine] Fix generate hanging issue after the first call (#1606) 2024-10-08 04:26:56 +00:00
Ke Bao
68f8b60d22 Fix chunked prefill condition (#1594) 2024-10-07 06:34:14 +00:00
Byron Hsu
551a3a9d38 Provide an offline engine API (#1567) 2024-10-06 20:27:03 -07:00
Byron Hsu
17e998f1a8 Test consistency for single and batch seperately (#1590) 2024-10-06 22:02:27 +00:00
Ying Sheng
c98e84c21e [Minor, Performance] Use torch.argmax for greedy sampling (#1589) 2024-10-06 13:15:05 -07:00
Lianmin Zheng
9244f27f0a [Minor] Improve the style and fix flaky tests (#1584) 2024-10-06 00:10:48 -07:00
Byron Hsu
2422de5193 Support min_tokens in sgl.gen (#1573) 2024-10-05 21:51:12 -07:00
Ying Sheng
04b262cd91 [Fix] Fix major performance bug in certain cases (#1563)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
2024-10-04 08:51:11 +00:00
Lianmin Zheng
32eb6e96f2 Organize sampling batch info better (#1562) 2024-10-03 18:29:49 -07:00
Minsang Song
e6852b0dd2 [Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' (#1536)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-10-02 20:41:15 -07:00
Theresa Barton
2c7d0a5b8b [Fix] Fix all the Huggingface paths (#1553) 2024-10-02 10:12:07 -07:00
Liangsheng Yin
99ec439da4 Organize Attention Backends (#1547) 2024-09-30 15:54:18 -07:00
Ying Sheng
0f4fb19bc8 [Fix, LoRA] fix LoRA with updates in main (#1545) 2024-09-30 10:06:08 -07:00
Lianmin Zheng
3f0fe08d37 Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541) 2024-09-29 20:28:45 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00
Ying Sheng
9aa6553d2a [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525) 2024-09-27 23:32:11 -07:00
TianyiQ
3c93187caf Add support for tie_word_embeddings when loading weights + support for SmolLM (#1508) 2024-09-24 21:50:20 -07:00
Lianmin Zheng
fb2d0680e0 [Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510) 2024-09-24 21:37:33 -07:00
Lianmin Zheng
28b4d8e144 Update test_srt_backend.py (#1502) 2024-09-24 03:17:10 -07:00
Yineng Zhang
42a2d82ba7 minor: add mla fp8 test (#1494) 2024-09-23 20:40:17 +08:00
Ying Sheng
e4780cf839 [API, Feature] Support response prefill for openai API (#1490) 2024-09-22 06:46:17 -07:00
Lianmin Zheng
13f1357ef0 Add a unit test for data parallelism (#1489) 2024-09-22 02:21:05 -07:00
Lianmin Zheng
167591e864 Better unit tests for adding a new model (#1488) 2024-09-22 01:50:37 -07:00
Ke Bao
b8ccaf4d73 Add MLA gsm8k eval (#1484) 2024-09-21 11:16:13 +08:00
Ke Bao
a68cb201dd Fix triton head num (#1482) 2024-09-21 10:25:20 +08:00
Yineng Zhang
a6db88626e minor: add quant eval compared with base (#1475) 2024-09-20 01:57:19 +08:00
Lianmin Zheng
1acccb364a Fix oom issues with fp8 for llama (#1454) 2024-09-18 03:45:19 -07:00
Ke Bao
c6b6d2e71b Enable MLA by default (#1447) 2024-09-17 11:42:48 +00:00
Lianmin Zheng
27b557aea7 Clean up model loader (#1440) 2024-09-16 18:16:27 -07:00
Lianmin Zheng
9ba1f09760 [Fix] Fix logprob and normalized_logprob (#1428) 2024-09-15 06:36:06 -07:00
Lianmin Zheng
9463bc1385 Enable torch.compile for triton backend (#1422) 2024-09-14 15:38:37 -07:00
Yineng Zhang
e3fc4658f4 fix: resolve nightly eval (#1426) 2024-09-15 02:07:52 +10:00
Ke Bao
33b54e7c40 Add pytorch sampling backend ut (#1425) 2024-09-15 01:15:30 +10:00
Lianmin Zheng
ad0ff62a4c Balance test in CI (#1411) 2024-09-12 23:29:44 -07:00
Lianmin Zheng
68be2f6d3b [CI] Include triton backend and online serving benchmark into CI (#1408) 2024-09-12 21:36:41 -07:00
Ying Sheng
eb02c1618a [Minor, CI] remove lora test from minimal suite (#1406) 2024-09-12 16:49:50 -07:00
Ying Sheng
712216928f [Feature] Initial support for multi-LoRA serving (#1307) 2024-09-12 16:46:14 -07:00
Lianmin Zheng
3efa798116 Support cuda graph in the triton attention backend (#1401) 2024-09-12 00:36:55 -07:00
Lianmin Zheng
fec185ce0c Refactor attention backend (#1381) 2024-09-11 11:44:26 -07:00
Byron Hsu
8c0efa514d remove assertion in triton attention and add an unit test (#1385) 2024-09-11 03:22:07 -07:00
Liangsheng Yin
144bc70fcc Organize flashinfer indices update (#1378) 2024-09-10 17:38:59 -07:00
Lianmin Zheng
46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) 2024-09-10 17:11:16 -07:00
Lianmin Zheng
6c7cb90365 [Minor] improve kill scripts and torchao import (#1375) 2024-09-11 04:27:03 +10:00
zifeitong
9144ed1067 Support OpenAI API json_schema response format (#1363) 2024-09-09 19:08:25 -07:00
Ying Sheng
689ff588ec [CI] Return output logprobs in unit test (#1361) 2024-09-09 13:05:13 -07:00
Jerry Zhang
a7c47e0f02 Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-09-09 05:32:41 -07:00