Commit Graph

90 Commits

Author SHA1 Message Date
bjmsong
e21026690d benchmark decoding attention kernel with cudnn (#2467)
Co-authored-by: root <bjmsong@126.com>
2024-12-17 03:31:57 -08:00
Lianmin Zheng
56198b45d9 Add a benchmark script for in-batch prefix caching (#2494) 2024-12-16 18:49:02 -08:00
Xiaoyu Zhang
a0592c059f [Benchmark] add a benchmark for hf/vllm/sglang rmsnorm (#2486) 2024-12-15 13:52:08 +08:00
bjmsong
f67723940d decoding attention kernel benchmark (#2425)
Co-authored-by: root <bjmsong@126.com>
2024-12-11 04:46:59 -08:00
Xiaoyu Zhang
3844feb9bb Add a unittest for fused_moe (#2416) 2024-12-08 22:46:10 -08:00
Lianmin Zheng
07ec07ad1f Improve torch compile for fused moe (#2327) 2024-12-03 01:58:25 -08:00
Lianmin Zheng
33deca81b5 Add more fused moe benchmark utilities (#2314) 2024-12-02 04:26:55 -08:00
Xiaoyu Zhang
262e370f78 [benchmark] Add fused_moe_triton benchmark and tuning tools (#2225)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
2024-11-29 13:36:45 -08:00
Henry Hyeonmok Ko
dbe1729395 Merged three native APIs into one: get_server_info (#2152) 2024-11-24 01:37:58 -08:00
Byron Hsu
cbedd1db1d [router] cache-aware load-balancing router v1 (#2114) 2024-11-23 08:34:48 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Lianmin Zheng
c29b98e043 Fix json benchmark (#2043) 2024-11-15 05:33:43 -08:00
DarkSharpness
954f4e6bd6 benchmark json schema (#2030) 2024-11-15 05:06:19 -08:00
Byron Hsu
f9633fa9b9 [rust] cache-aware DP - approx tree (#1934) 2024-11-10 21:57:32 -08:00
Xuehai Pan
a5e0defb5a minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 13:46:04 +00:00
Lianmin Zheng
dd3809fad8 Fix engine unit test (#1701) 2024-10-17 09:53:32 -07:00
Ying Sheng
9c064bf78a [LoRA, Performance] Speedup multi-LoRA serving - Step 1 (#1587) 2024-10-06 10:33:44 -07:00
Theresa Barton
2c7d0a5b8b [Fix] Fix all the Huggingface paths (#1553) 2024-10-02 10:12:07 -07:00
Ying Sheng
37963394aa [Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433) 2024-09-15 12:46:04 -07:00
Lianmin Zheng
e4d68afcf0 [Minor] Many cleanup (#1357) 2024-09-09 04:14:11 -07:00
Kai-Hsun Chen
c9b75917d5 [server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2024-09-09 02:14:25 -07:00
Yineng Zhang
62f15eea5a docs: add conclusion (#1340) 2024-09-06 04:25:14 +10:00
Yineng Zhang
79794af52d docs: highlight ttft itl and throughput (#1337) 2024-09-06 00:00:06 +10:00
Yineng Zhang
3494b32c3a docs: update README (#1336) 2024-09-05 23:39:44 +10:00
Lianmin Zheng
57d0bd91ec Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
Lianmin Zheng
5a261bd055 Fix the deadlock in multi-node tp (#1122) 2024-08-16 01:39:24 -07:00
Lianmin Zheng
326df4bab2 Use a single workspace for flashinfer (#1077) 2024-08-14 19:25:37 -07:00
Yineng Zhang
1c2b5f5240 docs: update nsys usage (#1103) 2024-08-15 01:39:15 +08:00
Liangsheng Yin
a34dd86a7d Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-08-14 15:58:07 +00:00
Lianmin Zheng
a59636bb5e Update grok 1 model (#1095) 2024-08-14 04:40:44 -07:00
Meng, Peng
41bb1ab10d fix nsys cannot profile cuda kernel (#957) 2024-08-07 11:51:21 +08:00
Ke Bao
e1eae1fd15 Support MLA for DeepSeek-V2 with Triton - step 1 (#905) 2024-08-05 03:40:33 +10:00
Yineng Zhang
1edd4e07d6 chore: bump v0.2.7 (#830) 2024-07-30 20:41:10 +10:00
Yineng Zhang
a50c8a14b3 fix: use v0.2.5 for benchmark (#814) 2024-07-30 12:40:35 +10:00
Ying Sheng
db6089e6f3 Revert "Organize public APIs" (#815) 2024-07-29 19:40:28 -07:00
Liangsheng Yin
c8e9fed87a Organize public APIs (#809) 2024-07-29 15:34:16 -07:00
Yineng Zhang
768e05d08f fix benchmark (#743)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-07-26 21:26:13 +10:00
Yineng Zhang
fded67441d misc: update bulid instruction (#724) 2024-07-25 17:08:11 +10:00
Yineng Zhang
97e0f7d250 docs: update comment (#721) 2024-07-25 10:51:18 +10:00
Ying Sheng
30d8e130e7 Improve benchmark scripts (#717) 2024-07-24 14:44:14 -07:00
Ying Sheng
08a3bd19cc docs: update doc (#716) 2024-07-24 20:44:03 +00:00
Yineng Zhang
321a963b01 misc: update doc (#715) 2024-07-24 13:05:46 -07:00
Yineng Zhang
2d3ae4e125 docs: update doc (#713) 2024-07-25 00:03:17 +10:00
Yineng Zhang
75f4ccb7dd docs: update README (#712) 2024-07-24 23:33:28 +10:00
Lianmin Zheng
490a1f39dd Fix cuda graph with flashinfer (#675) 2024-07-20 02:43:55 -07:00
zhyncs
2e341cd493 misc: add pre-commit config (#637) 2024-07-17 11:55:39 -07:00
Lianmin Zheng
41d1f67704 Fix flush cache (#627) 2024-07-15 20:44:04 -07:00
Ying Sheng
6a2941f4d0 Improve tensor parallel performance (#625)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
2024-07-15 07:10:51 -07:00
Mingyi
5ac8b80677 Simplify mem state (#623) 2024-07-15 02:01:09 -07:00
Ying Sheng
bae9541e4c Update benchmark script (#621) 2024-07-14 21:38:53 +00:00