Lianmin Zheng
|
86d10d220f
|
Update grok.py and tiktoken tokenizer (#9532)
|
2025-08-23 05:40:18 -07:00 |
|
Cheng Wan
|
c0fb25e949
|
DP Enhancement (#8280)
|
2025-07-24 21:36:21 -07:00 |
|
Lianmin Zheng
|
38af4f68a9
|
Fix grammar abort & Minor style fixes (#7204)
|
2025-06-14 22:49:41 -07:00 |
|
Ke Bao
|
799c4bb502
|
Fuse MLA set kv cache kernel (#5748)
|
2025-04-26 18:42:22 -07:00 |
|
Ke Bao
|
6b6e748775
|
Remove q concat in FA3 backend for DeepSeek decode (#5638)
|
2025-04-22 11:43:12 -07:00 |
|
woodx
|
3bface15e6
|
Feat/support encoder model (like bert) (#4887)
|
2025-04-17 01:50:48 -07:00 |
|
Yun Dai
|
2695ab0537
|
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-04-08 09:11:35 -07:00 |
|
Chang Su
|
f04c80dc42
|
Add Llama4 support (#5092)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@163.com>
|
2025-04-07 00:29:36 -07:00 |
|
Qubitium-ModelCloud
|
56a724eba3
|
[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization (#3790)
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
|
2025-03-05 01:11:00 -08:00 |
|
Lianmin Zheng
|
dc1881326f
|
Fix perf regression on small batch sizes (#3008)
|
2025-01-20 03:39:49 -08:00 |
|
bjmsong
|
0bb0f76311
|
Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
|
2025-01-12 21:17:11 -08:00 |
|
Lianmin Zheng
|
641b7d0ae0
|
[Minor] Improve code style (#2422)
|
2024-12-09 06:30:35 -08:00 |
|
Ke Bao
|
ec52464dde
|
MLA prefill w/o weight absorption (#2349)
|
2024-12-05 01:50:28 +08:00 |
|
Xuehai Pan
|
62a4a339eb
|
docs: fix module docstrings and copyright headers (#2077)
|
2024-11-22 22:16:53 +08:00 |
|
Liangsheng Yin
|
100f5b8bc9
|
Simplify flashinfer dispatch (#1552)
|
2024-10-01 00:28:42 -07:00 |
|
Lianmin Zheng
|
36d5acfca5
|
Rename InputMetadata -> ForwardBatch (#1543)
|
2024-09-30 02:41:11 -07:00 |
|
Lianmin Zheng
|
fec185ce0c
|
Refactor attention backend (#1381)
|
2024-09-11 11:44:26 -07:00 |
|
Lianmin Zheng
|
46094e0c1b
|
Deprecate --disable-flashinfer and introduce --attention-backend (#1380)
|
2024-09-10 17:11:16 -07:00 |
|
Lianmin Zheng
|
3a6e8b6d78
|
[Minor] move triton attention kernels into a separate folder (#1379)
|
2024-09-10 15:15:08 -07:00 |
|
Liangsheng Yin
|
69b3bb9ae1
|
Unify forward mode (#1360)
|
2024-09-09 13:49:29 -07:00 |
|
Ke Bao
|
2c615d120f
|
[Feature] Support fp8 e5m2 kv cache with flashinfer (#1204)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-25 17:38:11 -07:00 |
|
Ying Sheng
|
93d4e354d8
|
[Fix] Window attention compatible with RadixAttention and chunked prefill (#1112)
|
2024-08-15 10:33:20 -07:00 |
|
Ying Sheng
|
14cb544d56
|
[Fix] fix flashinfer usage for window attention (#1107)
|
2024-08-15 00:53:24 -07:00 |
|
Ying Sheng
|
96a2093ef0
|
[Fix] Compatibility of window attention and cuda graph (#1090)
|
2024-08-14 10:37:01 -07:00 |
|
Ying Sheng
|
0909bb0d2f
|
[Feat] Add window attention for gemma-2 (#1056)
|
2024-08-13 17:01:26 -07:00 |
|
Lianmin Zheng
|
fb1f28cbbb
|
Clean up the comments and names under python/sglang/srt/layers (#1047)
|
2024-08-12 05:54:37 +00:00 |
|
Liangsheng Yin
|
87e8c090e9
|
Organize code (rename, movement) (#953)
|
2024-08-06 20:50:32 -07:00 |
|
Ke Bao
|
e1eae1fd15
|
Support MLA for DeepSeek-V2 with Triton - step 1 (#905)
|
2024-08-05 03:40:33 +10:00 |
|
Ying Sheng
|
e7487b08bc
|
Adjust default mem fraction to avoid OOM (#823)
|
2024-07-30 01:58:31 -07:00 |
|
Liangsheng Yin
|
cdcbde5fc3
|
Code structure refactor (#807)
|
2024-07-29 23:04:48 -07:00 |
|
Yineng Zhang
|
dd7e8b9421
|
chore: add copyright for srt (#790)
|
2024-07-28 23:07:12 +10:00 |
|
Lianmin Zheng
|
752e643007
|
Allow disabling flashinfer sampling kernel (#778)
|
2024-07-27 20:18:56 -07:00 |
|
Mingyi
|
a523a3c13a
|
Reduce hardcoded logic of kernel usage (#707)
|
2024-07-23 16:42:21 -07:00 |
|
Ying Sheng
|
cf99eab7d5
|
Fix flashinfer (#700)
|
2024-07-23 01:27:01 -07:00 |
|
Liangsheng Yin
|
69d19188fc
|
Decouple kv (#679)
|
2024-07-20 14:16:45 -07:00 |
|
Liangsheng Yin
|
a9ef49c12c
|
Detokenize incrementally when streaming (#653)
|
2024-07-18 17:57:40 -07:00 |
|
Liangsheng Yin
|
abd5385ac5
|
Move global_server_args_dict (#642)
|
2024-07-17 13:49:15 -07:00 |
|
Ying Sheng
|
5949b1ca0e
|
Fix memory pool index error (#616)
|
2024-07-13 16:45:11 -07:00 |
|
Liangsheng Yin
|
10143e1a5f
|
Memorypool chunked prefetch (#614)
|
2024-07-13 15:24:03 -07:00 |
|
Lianmin Zheng
|
396a69240f
|
Cleanup attention backend: flashinfer and triton (#611)
|
2024-07-12 18:21:11 -07:00 |
|
Lianmin Zheng
|
af4e7910e7
|
Clean up the usage of flashinfer (#610)
|
2024-07-12 13:00:03 -07:00 |
|
Lianmin Zheng
|
519e20cfda
|
Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py (#609)
|
2024-07-12 12:28:09 -07:00 |
|
Liangsheng Yin
|
f25b76c02a
|
add LogitsMetadata (#604)
|
2024-07-08 17:46:55 -07:00 |
|
Mingyi
|
f4e885b7c3
|
Reduce number of workspaces (#601)
|
2024-07-07 19:35:22 -07:00 |
|
Ying Sheng
|
dc1b8bcfaa
|
Format (#593)
|
2024-07-05 10:06:17 -07:00 |
|
Ying Sheng
|
5a57b8addd
|
Add Gemma2 (#592)
|
2024-07-05 09:48:54 -07:00 |
|
Ying Sheng
|
2a754e57b0
|
2x performance improvement for large prefill & Fix workspace conflicts (#579)
|
2024-07-03 16:14:57 -07:00 |
|
Ying Sheng
|
9380f50ff9
|
Turn on flashinfer by default (#578)
|
2024-07-02 02:25:07 -07:00 |
|
Yueyang Pan
|
d9ac639202
|
Fix flashinfer version (#576)
|
2024-07-01 22:08:39 -07:00 |
|
Lianmin Zheng
|
b7e2f800ac
|
Update flashinfer to 0.0.5 (#554)
|
2024-06-20 20:29:06 -07:00 |
|