Commit Graph

44 Commits

Author SHA1 Message Date
Lianmin Zheng
3f0fe08d37 Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541) 2024-09-29 20:28:45 -07:00
Lianmin Zheng
f86c1e611f Move scheduler code from tp_worker.py to scheduler.py (#1538) 2024-09-29 17:42:45 -07:00
Lianmin Zheng
067d8e16fc Simplify bench_latency.py (#1503) 2024-09-24 17:42:07 -07:00
Lianmin Zheng
2854a5ea9f Fix the overhead due to penalizer in bench_latency (#1496) 2024-09-23 07:38:14 -07:00
Lianmin Zheng
2cd7e181dd Fix env vars in bench_latency (#1472) 2024-09-19 03:19:26 -07:00
Lianmin Zheng
5e62a6b706 Add bench_server_latency.py (#1452) 2024-09-18 00:56:06 -07:00
Lianmin Zheng
899cf5c438 Remove deprecated configs (#1431) 2024-09-15 08:52:18 -07:00
Lianmin Zheng
9ba1f09760 [Fix] Fix logprob and normalized_logprob (#1428) 2024-09-15 06:36:06 -07:00
Lianmin Zheng
9463bc1385 Enable torch.compile for triton backend (#1422) 2024-09-14 15:38:37 -07:00
Liangsheng Yin
70b6802982 Optimize conflicts between CUDA graph and vocab mask tensors (#1392) 2024-09-13 20:27:53 -07:00
Lianmin Zheng
3a6e8b6d78 [Minor] move triton attention kernels into a separate folder (#1379) 2024-09-10 15:15:08 -07:00
Liangsheng Yin
69b3bb9ae1 Unify forward mode (#1360) 2024-09-09 13:49:29 -07:00
Kai-Hsun Chen
c9b75917d5 [server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2024-09-09 02:14:25 -07:00
Lianmin Zheng
1b5d56f7f8 [CI] Add more multi-gpu tests (#1280) 2024-09-01 00:27:25 -07:00
Lianmin Zheng
79ece2c51f Report median instead of mean in bench_latency.py (#1269) 2024-08-30 06:05:01 -07:00
Liangsheng Yin
381dd57bd6 Sampler cudagraph (#1253) 2024-08-28 18:58:52 -07:00
Yineng Zhang
f25f4dfde5 hotfix: revert sampler CUDA Graph (#1242) 2024-08-28 21:16:47 +10:00
Liangsheng Yin
1ece2cda3d Fix bench latency benchmark (#1225) 2024-08-28 00:37:32 -07:00
Lianmin Zheng
f6af3a6561 Cleanup readme, llava examples, usage examples and nccl init (#1194) 2024-08-24 08:02:23 -07:00
Ying Sheng
5fafcac008 Fix benchmark script (#1185) 2024-08-22 09:03:25 +00:00
Liangsheng Yin
83e23c69b3 Improve code style of sampler (#1168) 2024-08-21 16:48:24 -07:00
Liangsheng Yin
a34dd86a7d Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-08-14 15:58:07 +00:00
Lianmin Zheng
a59636bb5e Update grok 1 model (#1095) 2024-08-14 04:40:44 -07:00
Ying Sheng
0909bb0d2f [Feat] Add window attention for gemma-2 (#1056) 2024-08-13 17:01:26 -07:00
Liangsheng Yin
43fbb6d919 Fix input_ids && rename to fill_ids (#1021) 2024-08-10 16:24:12 -07:00
Mingyi
61728884d7 Fix benchmark latency (#1007) 2024-08-09 13:18:58 -07:00
Yineng Zhang
b568df5d03 fix: resolve correctness_test issue (#1002) 2024-08-09 23:21:42 +10:00
Liangsheng Yin
87e8c090e9 Organize code (rename, movement) (#953) 2024-08-06 20:50:32 -07:00
min-xu-et
ebf69964cd latency test enhancement - final part (#921) 2024-08-04 18:15:23 -07:00
min-xu-et
afd411d09f enhance latency test - part 2 (#915) 2024-08-04 12:27:25 -07:00
min-xu-et
539856455d latency test enhancement - part 1 (#909) 2024-08-03 22:44:58 -07:00
Liangsheng Yin
cdcbde5fc3 Code structure refactor (#807) 2024-07-29 23:04:48 -07:00
Ying Sheng
db6089e6f3 Revert "Organize public APIs" (#815) 2024-07-29 19:40:28 -07:00
Liangsheng Yin
c8e9fed87a Organize public APIs (#809) 2024-07-29 15:34:16 -07:00
Liangsheng Yin
3de2f30a27 Flashinfer sample kernel (#617) 2024-07-17 13:24:43 -07:00
Lianmin Zheng
41d1f67704 Fix flush cache (#627) 2024-07-15 20:44:04 -07:00
Liangsheng Yin
564a898ad9 Optimize mem indices mangement (#619) 2024-07-13 23:39:37 -07:00
Lianmin Zheng
665815969a Enable cuda graph by default (#612) 2024-07-13 05:29:46 -07:00
Lianmin Zheng
d9a6902986 Fix bench latency (#607) 2024-07-11 14:37:01 -07:00
Ying Sheng
dc1b8bcfaa Format (#593) 2024-07-05 10:06:17 -07:00
Ying Sheng
5a57b8addd Add Gemma2 (#592) 2024-07-05 09:48:54 -07:00
Ying Sheng
2a754e57b0 2x performance improvement for large prefill & Fix workspace conflicts (#579) 2024-07-03 16:14:57 -07:00
Ying Sheng
9ce89bc14b Update benchmark script (#571) 2024-06-28 00:44:22 -07:00
Lianmin Zheng
eb1ae6ae0c Add sglang.bench_latency for offline benchmark (#564) 2024-06-25 03:38:04 -07:00