Lianmin Zheng
|
3f0fe08d37
|
Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541)
|
2024-09-29 20:28:45 -07:00 |
|
Lianmin Zheng
|
048685430d
|
Improve process creation (#1534)
|
2024-09-29 02:36:12 -07:00 |
|
Ying Sheng
|
9aa6553d2a
|
[Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525)
|
2024-09-27 23:32:11 -07:00 |
|
TianyiQ
|
3c93187caf
|
Add support for tie_word_embeddings when loading weights + support for SmolLM (#1508)
|
2024-09-24 21:50:20 -07:00 |
|
Lianmin Zheng
|
fb2d0680e0
|
[Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510)
|
2024-09-24 21:37:33 -07:00 |
|
Lianmin Zheng
|
28b4d8e144
|
Update test_srt_backend.py (#1502)
|
2024-09-24 03:17:10 -07:00 |
|
Yineng Zhang
|
42a2d82ba7
|
minor: add mla fp8 test (#1494)
|
2024-09-23 20:40:17 +08:00 |
|
Ying Sheng
|
e4780cf839
|
[API, Feature] Support response prefill for openai API (#1490)
|
2024-09-22 06:46:17 -07:00 |
|
Lianmin Zheng
|
13f1357ef0
|
Add a unit test for data parallelism (#1489)
|
2024-09-22 02:21:05 -07:00 |
|
Lianmin Zheng
|
167591e864
|
Better unit tests for adding a new model (#1488)
|
2024-09-22 01:50:37 -07:00 |
|
Ke Bao
|
b8ccaf4d73
|
Add MLA gsm8k eval (#1484)
|
2024-09-21 11:16:13 +08:00 |
|
Ke Bao
|
a68cb201dd
|
Fix triton head num (#1482)
|
2024-09-21 10:25:20 +08:00 |
|
Yineng Zhang
|
a6db88626e
|
minor: add quant eval compared with base (#1475)
|
2024-09-20 01:57:19 +08:00 |
|
Lianmin Zheng
|
1acccb364a
|
Fix oom issues with fp8 for llama (#1454)
|
2024-09-18 03:45:19 -07:00 |
|
Ke Bao
|
c6b6d2e71b
|
Enable MLA by default (#1447)
|
2024-09-17 11:42:48 +00:00 |
|
Lianmin Zheng
|
27b557aea7
|
Clean up model loader (#1440)
|
2024-09-16 18:16:27 -07:00 |
|
Lianmin Zheng
|
9ba1f09760
|
[Fix] Fix logprob and normalized_logprob (#1428)
|
2024-09-15 06:36:06 -07:00 |
|
Lianmin Zheng
|
9463bc1385
|
Enable torch.compile for triton backend (#1422)
|
2024-09-14 15:38:37 -07:00 |
|
Yineng Zhang
|
e3fc4658f4
|
fix: resolve nightly eval (#1426)
|
2024-09-15 02:07:52 +10:00 |
|
Ke Bao
|
33b54e7c40
|
Add pytorch sampling backend ut (#1425)
|
2024-09-15 01:15:30 +10:00 |
|
Lianmin Zheng
|
ad0ff62a4c
|
Balance test in CI (#1411)
|
2024-09-12 23:29:44 -07:00 |
|
Lianmin Zheng
|
68be2f6d3b
|
[CI] Include triton backend and online serving benchmark into CI (#1408)
|
2024-09-12 21:36:41 -07:00 |
|
Ying Sheng
|
eb02c1618a
|
[Minor, CI] remove lora test from minimal suite (#1406)
|
2024-09-12 16:49:50 -07:00 |
|
Ying Sheng
|
712216928f
|
[Feature] Initial support for multi-LoRA serving (#1307)
|
2024-09-12 16:46:14 -07:00 |
|
Lianmin Zheng
|
3efa798116
|
Support cuda graph in the triton attention backend (#1401)
|
2024-09-12 00:36:55 -07:00 |
|
Lianmin Zheng
|
fec185ce0c
|
Refactor attention backend (#1381)
|
2024-09-11 11:44:26 -07:00 |
|
Byron Hsu
|
8c0efa514d
|
remove assertion in triton attention and add an unit test (#1385)
|
2024-09-11 03:22:07 -07:00 |
|
Liangsheng Yin
|
144bc70fcc
|
Organize flashinfer indices update (#1378)
|
2024-09-10 17:38:59 -07:00 |
|
Lianmin Zheng
|
46094e0c1b
|
Deprecate --disable-flashinfer and introduce --attention-backend (#1380)
|
2024-09-10 17:11:16 -07:00 |
|
Lianmin Zheng
|
6c7cb90365
|
[Minor] improve kill scripts and torchao import (#1375)
|
2024-09-11 04:27:03 +10:00 |
|
zifeitong
|
9144ed1067
|
Support OpenAI API json_schema response format (#1363)
|
2024-09-09 19:08:25 -07:00 |
|
Ying Sheng
|
689ff588ec
|
[CI] Return output logprobs in unit test (#1361)
|
2024-09-09 13:05:13 -07:00 |
|
Jerry Zhang
|
a7c47e0f02
|
Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-09-09 05:32:41 -07:00 |
|
Lianmin Zheng
|
e4d68afcf0
|
[Minor] Many cleanup (#1357)
|
2024-09-09 04:14:11 -07:00 |
|
Kai-Hsun Chen
|
c9b75917d5
|
[server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
|
2024-09-09 02:14:25 -07:00 |
|
Kaichen Zhang - NTU
|
662ecd9368
|
[Feat] Add modalities for vision server when handling pixel values for llava (#1346)
|
2024-09-09 02:07:34 -07:00 |
|
Lianmin Zheng
|
843e63d809
|
Fix the flaky test test_moe_eval_accuracy_large.py (#1326)
|
2024-09-04 04:15:11 -07:00 |
|
Lianmin Zheng
|
1e495e0847
|
[Fix] Fix select by ensuring each request has at least one token (#1318)
|
2024-09-03 06:31:45 -07:00 |
|
Yineng Zhang
|
2561ed012c
|
feat: update nightly gsm8k eval (#1304)
|
2024-09-03 01:18:41 +10:00 |
|
Lianmin Zheng
|
58fa607622
|
Fix the flaky tests in test_moe_eval_accuracy_large.py (#1293)
|
2024-09-01 12:20:46 -07:00 |
|
Lianmin Zheng
|
761b2cebd6
|
[CI] merge all ci tests into one file (#1289)
|
2024-09-01 02:36:56 -07:00 |
|
Lianmin Zheng
|
1b5d56f7f8
|
[CI] Add more multi-gpu tests (#1280)
|
2024-09-01 00:27:25 -07:00 |
|
xiaobochen
|
d134c139a1
|
Optimize the update flashinfer indices (#1262)
|
2024-08-31 23:40:28 -07:00 |
|
Christopher Chou
|
51c554d812
|
Allow more flexible assistant and system response (#1256)
|
2024-08-30 11:51:44 -07:00 |
|
Yineng Zhang
|
c411f32e1c
|
feat: replace GeluAndMul (#1234)
|
2024-08-28 14:07:02 +00:00 |
|
Lianmin Zheng
|
bf53bf5142
|
[Fix] Fix llava on multi images (#1247)
|
2024-08-28 06:33:05 -07:00 |
|
Yineng Zhang
|
66975360e7
|
fix: increase max_new_tokens when testing generation models (#1244)
|
2024-08-28 22:12:36 +10:00 |
|
yichuan~
|
5ff25cdf5b
|
[Minor] add delete test and delete tmp file on ci server (#1227)
|
2024-08-26 22:04:52 -07:00 |
|
caiyueliang
|
2f1d92834f
|
[FEAT] Support batches cancel (#1222)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-26 23:28:26 +00:00 |
|
Liangsheng Yin
|
c61a1b6f97
|
Torch compile CI throughput test (#1223)
|
2024-08-26 13:52:58 -07:00 |
|