Commit Graph

53 Commits

Author SHA1 Message Date
bjmsong
0bb0f76311 Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
2025-01-12 21:17:11 -08:00
Lianmin Zheng
641b7d0ae0 [Minor] Improve code style (#2422) 2024-12-09 06:30:35 -08:00
Ke Bao
ec52464dde MLA prefill w/o weight absorption (#2349) 2024-12-05 01:50:28 +08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Liangsheng Yin
100f5b8bc9 Simplify flashinfer dispatch (#1552) 2024-10-01 00:28:42 -07:00
Lianmin Zheng
36d5acfca5 Rename InputMetadata -> ForwardBatch (#1543) 2024-09-30 02:41:11 -07:00
Lianmin Zheng
fec185ce0c Refactor attention backend (#1381) 2024-09-11 11:44:26 -07:00
Lianmin Zheng
46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) 2024-09-10 17:11:16 -07:00
Lianmin Zheng
3a6e8b6d78 [Minor] move triton attention kernels into a separate folder (#1379) 2024-09-10 15:15:08 -07:00
Liangsheng Yin
69b3bb9ae1 Unify forward mode (#1360) 2024-09-09 13:49:29 -07:00
Ke Bao
2c615d120f [Feature] Support fp8 e5m2 kv cache with flashinfer (#1204)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-25 17:38:11 -07:00
Ying Sheng
93d4e354d8 [Fix] Window attention compatible with RadixAttention and chunked prefill (#1112) 2024-08-15 10:33:20 -07:00
Ying Sheng
14cb544d56 [Fix] fix flashinfer usage for window attention (#1107) 2024-08-15 00:53:24 -07:00
Ying Sheng
96a2093ef0 [Fix] Compatibility of window attention and cuda graph (#1090) 2024-08-14 10:37:01 -07:00
Ying Sheng
0909bb0d2f [Feat] Add window attention for gemma-2 (#1056) 2024-08-13 17:01:26 -07:00
Lianmin Zheng
fb1f28cbbb Clean up the comments and names under python/sglang/srt/layers (#1047) 2024-08-12 05:54:37 +00:00
Liangsheng Yin
87e8c090e9 Organize code (rename, movement) (#953) 2024-08-06 20:50:32 -07:00
Ke Bao
e1eae1fd15 Support MLA for DeepSeek-V2 with Triton - step 1 (#905) 2024-08-05 03:40:33 +10:00
Ying Sheng
e7487b08bc Adjust default mem fraction to avoid OOM (#823) 2024-07-30 01:58:31 -07:00
Liangsheng Yin
cdcbde5fc3 Code structure refactor (#807) 2024-07-29 23:04:48 -07:00
Yineng Zhang
dd7e8b9421 chore: add copyright for srt (#790) 2024-07-28 23:07:12 +10:00
Lianmin Zheng
752e643007 Allow disabling flashinfer sampling kernel (#778) 2024-07-27 20:18:56 -07:00
Mingyi
a523a3c13a Reduce hardcoded logic of kernel usage (#707) 2024-07-23 16:42:21 -07:00
Ying Sheng
cf99eab7d5 Fix flashinfer (#700) 2024-07-23 01:27:01 -07:00
Liangsheng Yin
69d19188fc Decouple kv (#679) 2024-07-20 14:16:45 -07:00
Liangsheng Yin
a9ef49c12c Detokenize incrementally when streaming (#653) 2024-07-18 17:57:40 -07:00
Liangsheng Yin
abd5385ac5 Move global_server_args_dict (#642) 2024-07-17 13:49:15 -07:00
Ying Sheng
5949b1ca0e Fix memory pool index error (#616) 2024-07-13 16:45:11 -07:00
Liangsheng Yin
10143e1a5f Memorypool chunked prefetch (#614) 2024-07-13 15:24:03 -07:00
Lianmin Zheng
396a69240f Cleanup attention backend: flashinfer and triton (#611) 2024-07-12 18:21:11 -07:00
Lianmin Zheng
af4e7910e7 Clean up the usage of flashinfer (#610) 2024-07-12 13:00:03 -07:00
Lianmin Zheng
519e20cfda Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py (#609) 2024-07-12 12:28:09 -07:00
Liangsheng Yin
f25b76c02a add LogitsMetadata (#604) 2024-07-08 17:46:55 -07:00
Mingyi
f4e885b7c3 Reduce number of workspaces (#601) 2024-07-07 19:35:22 -07:00
Ying Sheng
dc1b8bcfaa Format (#593) 2024-07-05 10:06:17 -07:00
Ying Sheng
5a57b8addd Add Gemma2 (#592) 2024-07-05 09:48:54 -07:00
Ying Sheng
2a754e57b0 2x performance improvement for large prefill & Fix workspace conflicts (#579) 2024-07-03 16:14:57 -07:00
Ying Sheng
9380f50ff9 Turn on flashinfer by default (#578) 2024-07-02 02:25:07 -07:00
Yueyang Pan
d9ac639202 Fix flashinfer version (#576) 2024-07-01 22:08:39 -07:00
Lianmin Zheng
b7e2f800ac Update flashinfer to 0.0.5 (#554) 2024-06-20 20:29:06 -07:00
Ying Sheng
fb9296f0ed Higher priority for user input of max_prefill_tokens & format (#540) 2024-06-12 21:48:40 -07:00
Lianmin Zheng
f6dbd24043 Improve doc strings (#518) 2024-06-08 02:39:32 -07:00
Ying Sheng
0463f7fb52 Support data parallelism (static) (#480)
Co-authored-by: Ying Sheng <ying.sheng@databricks.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2024-05-27 21:24:10 -07:00
Lianmin Zheng
2cea6146d8 Improve logging & add logit cap (#471) 2024-05-24 03:48:53 -07:00
Liangsheng Yin
9acc6e3504 add .isort.cfg (#378) 2024-04-22 22:38:09 +08:00
Lianmin Zheng
4aa5dd2c5f Update version to v0.1.13 (#280) 2024-03-11 05:49:27 -07:00
Liangsheng Yin
1b35547927 Organize server_args (#277) 2024-03-11 20:06:52 +08:00
Liangsheng Yin
89885b31ef Gemma Support (#256) 2024-03-11 12:14:27 +08:00
Cody Yu
26c3494152 [Submodule] Change FlashInfer to import (#156) 2024-02-06 19:28:29 -08:00
Lianmin Zheng
23f05005fd Format code & move functions (#155) 2024-02-06 13:27:46 -08:00