Lianmin Zheng
|
3f0fe08d37
|
Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541)
|
2024-09-29 20:28:45 -07:00 |
|
Lianmin Zheng
|
f86c1e611f
|
Move scheduler code from tp_worker.py to scheduler.py (#1538)
|
2024-09-29 17:42:45 -07:00 |
|
Lianmin Zheng
|
067d8e16fc
|
Simplify bench_latency.py (#1503)
|
2024-09-24 17:42:07 -07:00 |
|
Lianmin Zheng
|
2854a5ea9f
|
Fix the overhead due to penalizer in bench_latency (#1496)
|
2024-09-23 07:38:14 -07:00 |
|
Lianmin Zheng
|
2cd7e181dd
|
Fix env vars in bench_latency (#1472)
|
2024-09-19 03:19:26 -07:00 |
|
Lianmin Zheng
|
5e62a6b706
|
Add bench_server_latency.py (#1452)
|
2024-09-18 00:56:06 -07:00 |
|
Lianmin Zheng
|
899cf5c438
|
Remove deprecated configs (#1431)
|
2024-09-15 08:52:18 -07:00 |
|
Lianmin Zheng
|
9ba1f09760
|
[Fix] Fix logprob and normalized_logprob (#1428)
|
2024-09-15 06:36:06 -07:00 |
|
Lianmin Zheng
|
9463bc1385
|
Enable torch.compile for triton backend (#1422)
|
2024-09-14 15:38:37 -07:00 |
|
Liangsheng Yin
|
70b6802982
|
Optimize conflicts between CUDA graph and vocab mask tensors (#1392)
|
2024-09-13 20:27:53 -07:00 |
|
Lianmin Zheng
|
3a6e8b6d78
|
[Minor] move triton attention kernels into a separate folder (#1379)
|
2024-09-10 15:15:08 -07:00 |
|
Liangsheng Yin
|
69b3bb9ae1
|
Unify forward mode (#1360)
|
2024-09-09 13:49:29 -07:00 |
|
Kai-Hsun Chen
|
c9b75917d5
|
[server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
|
2024-09-09 02:14:25 -07:00 |
|
Lianmin Zheng
|
1b5d56f7f8
|
[CI] Add more multi-gpu tests (#1280)
|
2024-09-01 00:27:25 -07:00 |
|
Lianmin Zheng
|
79ece2c51f
|
Report median instead of mean in bench_latency.py (#1269)
|
2024-08-30 06:05:01 -07:00 |
|
Liangsheng Yin
|
381dd57bd6
|
Sampler cudagraph (#1253)
|
2024-08-28 18:58:52 -07:00 |
|
Yineng Zhang
|
f25f4dfde5
|
hotfix: revert sampler CUDA Graph (#1242)
|
2024-08-28 21:16:47 +10:00 |
|
Liangsheng Yin
|
1ece2cda3d
|
Fix bench latency benchmark (#1225)
|
2024-08-28 00:37:32 -07:00 |
|
Lianmin Zheng
|
f6af3a6561
|
Cleanup readme, llava examples, usage examples and nccl init (#1194)
|
2024-08-24 08:02:23 -07:00 |
|
Ying Sheng
|
5fafcac008
|
Fix benchmark script (#1185)
|
2024-08-22 09:03:25 +00:00 |
|
Liangsheng Yin
|
83e23c69b3
|
Improve code style of sampler (#1168)
|
2024-08-21 16:48:24 -07:00 |
|
Liangsheng Yin
|
a34dd86a7d
|
Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2024-08-14 15:58:07 +00:00 |
|
Lianmin Zheng
|
a59636bb5e
|
Update grok 1 model (#1095)
|
2024-08-14 04:40:44 -07:00 |
|
Ying Sheng
|
0909bb0d2f
|
[Feat] Add window attention for gemma-2 (#1056)
|
2024-08-13 17:01:26 -07:00 |
|
Liangsheng Yin
|
43fbb6d919
|
Fix input_ids && rename to fill_ids (#1021)
|
2024-08-10 16:24:12 -07:00 |
|
Mingyi
|
61728884d7
|
Fix benchmark latency (#1007)
|
2024-08-09 13:18:58 -07:00 |
|
Yineng Zhang
|
b568df5d03
|
fix: resolve correctness_test issue (#1002)
|
2024-08-09 23:21:42 +10:00 |
|
Liangsheng Yin
|
87e8c090e9
|
Organize code (rename, movement) (#953)
|
2024-08-06 20:50:32 -07:00 |
|
min-xu-et
|
ebf69964cd
|
latency test enhancement - final part (#921)
|
2024-08-04 18:15:23 -07:00 |
|
min-xu-et
|
afd411d09f
|
enhance latency test - part 2 (#915)
|
2024-08-04 12:27:25 -07:00 |
|
min-xu-et
|
539856455d
|
latency test enhancement - part 1 (#909)
|
2024-08-03 22:44:58 -07:00 |
|
Liangsheng Yin
|
cdcbde5fc3
|
Code structure refactor (#807)
|
2024-07-29 23:04:48 -07:00 |
|
Ying Sheng
|
db6089e6f3
|
Revert "Organize public APIs" (#815)
|
2024-07-29 19:40:28 -07:00 |
|
Liangsheng Yin
|
c8e9fed87a
|
Organize public APIs (#809)
|
2024-07-29 15:34:16 -07:00 |
|
Liangsheng Yin
|
3de2f30a27
|
Flashinfer sample kernel (#617)
|
2024-07-17 13:24:43 -07:00 |
|
Lianmin Zheng
|
41d1f67704
|
Fix flush cache (#627)
|
2024-07-15 20:44:04 -07:00 |
|
Liangsheng Yin
|
564a898ad9
|
Optimize mem indices mangement (#619)
|
2024-07-13 23:39:37 -07:00 |
|
Lianmin Zheng
|
665815969a
|
Enable cuda graph by default (#612)
|
2024-07-13 05:29:46 -07:00 |
|
Lianmin Zheng
|
d9a6902986
|
Fix bench latency (#607)
|
2024-07-11 14:37:01 -07:00 |
|
Ying Sheng
|
dc1b8bcfaa
|
Format (#593)
|
2024-07-05 10:06:17 -07:00 |
|
Ying Sheng
|
5a57b8addd
|
Add Gemma2 (#592)
|
2024-07-05 09:48:54 -07:00 |
|
Ying Sheng
|
2a754e57b0
|
2x performance improvement for large prefill & Fix workspace conflicts (#579)
|
2024-07-03 16:14:57 -07:00 |
|
Ying Sheng
|
9ce89bc14b
|
Update benchmark script (#571)
|
2024-06-28 00:44:22 -07:00 |
|
Lianmin Zheng
|
eb1ae6ae0c
|
Add sglang.bench_latency for offline benchmark (#564)
|
2024-06-25 03:38:04 -07:00 |
|