Commit Graph

73 Commits

Author SHA1 Message Date
Theresa Barton
2c7d0a5b8b [Fix] Fix all the Huggingface paths (#1553) 2024-10-02 10:12:07 -07:00
Ying Sheng
37963394aa [Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433) 2024-09-15 12:46:04 -07:00
Lianmin Zheng
e4d68afcf0 [Minor] Many cleanup (#1357) 2024-09-09 04:14:11 -07:00
Kai-Hsun Chen
c9b75917d5 [server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2024-09-09 02:14:25 -07:00
Yineng Zhang
62f15eea5a docs: add conclusion (#1340) 2024-09-06 04:25:14 +10:00
Yineng Zhang
79794af52d docs: highlight ttft itl and throughput (#1337) 2024-09-06 00:00:06 +10:00
Yineng Zhang
3494b32c3a docs: update README (#1336) 2024-09-05 23:39:44 +10:00
Lianmin Zheng
57d0bd91ec Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
Lianmin Zheng
5a261bd055 Fix the deadlock in multi-node tp (#1122) 2024-08-16 01:39:24 -07:00
Lianmin Zheng
326df4bab2 Use a single workspace for flashinfer (#1077) 2024-08-14 19:25:37 -07:00
Yineng Zhang
1c2b5f5240 docs: update nsys usage (#1103) 2024-08-15 01:39:15 +08:00
Liangsheng Yin
a34dd86a7d Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-08-14 15:58:07 +00:00
Lianmin Zheng
a59636bb5e Update grok 1 model (#1095) 2024-08-14 04:40:44 -07:00
Meng, Peng
41bb1ab10d fix nsys cannot profile cuda kernel (#957) 2024-08-07 11:51:21 +08:00
Ke Bao
e1eae1fd15 Support MLA for DeepSeek-V2 with Triton - step 1 (#905) 2024-08-05 03:40:33 +10:00
Yineng Zhang
1edd4e07d6 chore: bump v0.2.7 (#830) 2024-07-30 20:41:10 +10:00
Yineng Zhang
a50c8a14b3 fix: use v0.2.5 for benchmark (#814) 2024-07-30 12:40:35 +10:00
Ying Sheng
db6089e6f3 Revert "Organize public APIs" (#815) 2024-07-29 19:40:28 -07:00
Liangsheng Yin
c8e9fed87a Organize public APIs (#809) 2024-07-29 15:34:16 -07:00
Yineng Zhang
768e05d08f fix benchmark (#743)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-07-26 21:26:13 +10:00
Yineng Zhang
fded67441d misc: update bulid instruction (#724) 2024-07-25 17:08:11 +10:00
Yineng Zhang
97e0f7d250 docs: update comment (#721) 2024-07-25 10:51:18 +10:00
Ying Sheng
30d8e130e7 Improve benchmark scripts (#717) 2024-07-24 14:44:14 -07:00
Ying Sheng
08a3bd19cc docs: update doc (#716) 2024-07-24 20:44:03 +00:00
Yineng Zhang
321a963b01 misc: update doc (#715) 2024-07-24 13:05:46 -07:00
Yineng Zhang
2d3ae4e125 docs: update doc (#713) 2024-07-25 00:03:17 +10:00
Yineng Zhang
75f4ccb7dd docs: update README (#712) 2024-07-24 23:33:28 +10:00
Lianmin Zheng
490a1f39dd Fix cuda graph with flashinfer (#675) 2024-07-20 02:43:55 -07:00
zhyncs
2e341cd493 misc: add pre-commit config (#637) 2024-07-17 11:55:39 -07:00
Lianmin Zheng
41d1f67704 Fix flush cache (#627) 2024-07-15 20:44:04 -07:00
Ying Sheng
6a2941f4d0 Improve tensor parallel performance (#625)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
2024-07-15 07:10:51 -07:00
Mingyi
5ac8b80677 Simplify mem state (#623) 2024-07-15 02:01:09 -07:00
Ying Sheng
bae9541e4c Update benchmark script (#621) 2024-07-14 21:38:53 +00:00
Liangsheng Yin
564a898ad9 Optimize mem indices mangement (#619) 2024-07-13 23:39:37 -07:00
Lianmin Zheng
0feca02dd9 Improve benchmark scripts (#615) 2024-07-13 15:59:04 -07:00
Lianmin Zheng
65c6577696 Improve benchmark scripts & fix llava (#613) 2024-07-13 15:00:26 -07:00
Lianmin Zheng
665815969a Enable cuda graph by default (#612) 2024-07-13 05:29:46 -07:00
Liangsheng Yin
f25b76c02a add LogitsMetadata (#604) 2024-07-08 17:46:55 -07:00
Ying Sheng
dc1b8bcfaa Format (#593) 2024-07-05 10:06:17 -07:00
sglang
11616fc6bd Minor fix in compiler & format (#545) 2024-06-29 23:42:14 -07:00
Lianmin Zheng
945aa9beb2 Update readme (#568) 2024-06-27 11:37:49 -07:00
Lianmin Zheng
2e6e62e156 Increase the number of thread limitation for tp worker managers. (#567) 2024-06-26 09:33:45 -07:00
Lianmin Zheng
a385ee27bd Warmup cublas (#566) 2024-06-25 12:46:00 -07:00
Liangsheng Yin
92cb93f390 Fix latency benchmark (#557) 2024-06-22 15:11:04 +08:00
Ying Sheng
09593e9bc9 Multi-node Tensor Parallelism (#550)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-06-17 20:41:24 -07:00
Liangsheng Yin
40e53d65cb Add disk cache for loading ShareGPT dataset. (#542) 2024-06-13 16:37:12 +08:00
Ying Sheng
fb9296f0ed Higher priority for user input of max_prefill_tokens & format (#540) 2024-06-12 21:48:40 -07:00
Ying Sheng
1374334d38 Fix dependency & crash issues (#539) 2024-06-12 21:23:19 -07:00
Lianmin Zheng
3bc01ac137 [Minor] improve code style 2024-06-03 18:11:34 -07:00
Lianmin Zheng
09de730dee Improve benchmark scripts & add more models (#484) 2024-05-27 14:13:26 -07:00