Commit Graph

900 Commits

Author SHA1 Message Date
Lianmin Zheng
3f0fe08d37 Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541) 2024-09-29 20:28:45 -07:00
Liangsheng Yin
55b974f96f Process image in parallel (#1539) 2024-09-29 18:52:43 -07:00
Lianmin Zheng
f86c1e611f Move scheduler code from tp_worker.py to scheduler.py (#1538) 2024-09-29 17:42:45 -07:00
Xinyu Yang
acaffd233f [Fix] fix ipv6 url when warm up model (#1537) 2024-09-29 11:02:40 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00
Liangsheng Yin
fd9ad817ec Organize image inputs (#1531) 2024-09-29 06:28:55 +00:00
Lianmin Zheng
e165a9fc1b Make detokenizer_manager.py not asyncio (#1532) 2024-09-28 19:33:09 -07:00
Lianmin Zheng
4e4459b91f Multiple minor fixes (#1530) 2024-09-28 14:43:35 -07:00
Jeffrey Fong
065bb94753 Fix RuntimeEndpoint.select method (#1495) 2024-09-28 14:04:06 -07:00
Kylin
f42e9bfb52 [bugfix] Add modelscope package to avoid docker image without modelscope (#1520) 2024-09-28 12:43:22 -07:00
Ninglin Du
840c5dbcb3 [FIX] Catch syntax error of Regex Guide to avoid crash (#1521) 2024-09-28 12:42:06 -07:00
Jerry Zhang
63e845d0bb Add float8 dynamic quant to torchao_utils (#1528) 2024-09-28 12:27:54 -07:00
Ying Sheng
9aa6553d2a [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525) 2024-09-27 23:32:11 -07:00
Ying Sheng
b1e330bcb0 [Event] Update meeting link (#1529) 2024-09-27 13:30:04 -07:00
Liangsheng Yin
4353acb469 minor: fix config (#1524) 2024-09-27 01:49:16 -07:00
Lianmin Zheng
9ae1db0bdc [Fix] Ignore import error (#1513) 2024-09-25 11:32:21 -07:00
Ying Sheng
37c5899fc2 Release v0.3.2 (#1512) 2024-09-25 14:17:09 +08:00
Ying Sheng
f39a0197fd Revert "kernel: use tensor cores for flashinfer gqa kernels" (#1511) 2024-09-24 22:50:31 -07:00
TianyiQ
3c93187caf Add support for tie_word_embeddings when loading weights + support for SmolLM (#1508) 2024-09-24 21:50:20 -07:00
Lianmin Zheng
fb2d0680e0 [Fix] Fix clean_up_tokenization_spaces in tokenizer (#1510) 2024-09-24 21:37:33 -07:00
Lianmin Zheng
067d8e16fc Simplify bench_latency.py (#1503) 2024-09-24 17:42:07 -07:00
luzengxiangcn
e6692bf4a5 debug radixcache stack_overflow (#1499) 2024-09-24 04:58:01 -07:00
Lianmin Zheng
28b4d8e144 Update test_srt_backend.py (#1502) 2024-09-24 03:17:10 -07:00
Lianmin Zheng
bc068e9618 [CI] Move AMD test to a separate file (#1500) 2024-09-24 02:06:28 -07:00
Ke Bao
8d4ed42ad5 MoE torch compile (#1497) 2024-09-24 01:46:59 -07:00
Lianmin Zheng
2854a5ea9f Fix the overhead due to penalizer in bench_latency (#1496) 2024-09-23 07:38:14 -07:00
Yineng Zhang
42a2d82ba7 minor: add mla fp8 test (#1494) 2024-09-23 20:40:17 +08:00
Ying Sheng
e4780cf839 [API, Feature] Support response prefill for openai API (#1490) 2024-09-22 06:46:17 -07:00
Lianmin Zheng
39bb49d156 Update dockerfile to include datamodel_code_generator (#1492) 2024-09-22 04:49:16 -07:00
Ying Sheng
6f3cf1297e [CI, AMD] Add AMD tests to CI (#1491) 2024-09-22 04:45:10 -07:00
Lianmin Zheng
13f1357ef0 Add a unit test for data parallelism (#1489) 2024-09-22 02:21:05 -07:00
wellhowtosay
2a99993cd9 Pr fix max workers (#1456)
Co-authored-by: baolujia <baolujia@shizhuang-inc.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-09-22 02:20:26 -07:00
Lianmin Zheng
167591e864 Better unit tests for adding a new model (#1488) 2024-09-22 01:50:37 -07:00
Yineng Zhang
441c22db8c doc: update backend (#1486) 2024-09-21 22:05:12 +08:00
Ran Chen
ce636ac441 fix incorrect links in documentation (#1481)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-21 20:36:23 +08:00
Yineng Zhang
82136eb0b5 chore: bump v0.3.1.post3 (#1483) 2024-09-21 11:17:45 +08:00
Ke Bao
b8ccaf4d73 Add MLA gsm8k eval (#1484) 2024-09-21 11:16:13 +08:00
Ke Bao
a68cb201dd Fix triton head num (#1482) 2024-09-21 10:25:20 +08:00
Niklas Muennighoff
014982b5e0 Add OLMoE (#1476) 2024-09-20 10:32:49 +08:00
Yineng Zhang
a6db88626e minor: add quant eval compared with base (#1475) 2024-09-20 01:57:19 +08:00
Yineng Zhang
b4408b0d16 feat: update linear deps 1/N (#1305) 2024-09-19 20:53:11 +08:00
Lianmin Zheng
2cd7e181dd Fix env vars in bench_latency (#1472) 2024-09-19 03:19:26 -07:00
Lianmin Zheng
5ce55aee15 Release v0.3.1.post2 (#1470) 2024-09-19 02:03:38 -07:00
Lianmin Zheng
2d346a57c2 Fix padding in the cuda graph (#1469) 2024-09-19 01:52:15 -07:00
Li Bo
446ea33277 fix: creat new dict everytime for putting new frame (#1464) 2024-09-19 01:31:48 -07:00
Ying Sheng
8f527e2940 [Event] Add public meeting invite to README (#1458) 2024-09-18 23:53:22 +08:00
Lianmin Zheng
7f24ea95c3 Fuse top_k and top_k in the sampler (#1457) 2024-09-18 04:35:35 -07:00
Lianmin Zheng
1acccb364a Fix oom issues with fp8 for llama (#1454) 2024-09-18 03:45:19 -07:00
HAI
aa2750beb3 [Bugfix] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) (#1453) 2024-09-18 02:01:35 -07:00
Lianmin Zheng
5e62a6b706 Add bench_server_latency.py (#1452) 2024-09-18 00:56:06 -07:00