Commit Graph

84 Commits

Author SHA1 Message Date
Zhihao Zhang
e7bc600304 [Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-18 16:42:41 -07:00
Yineng Zhang
de15d1405a Revert "Fix flashinfer version in sgl-kernel (#10135)" (#10310) 2025-09-11 01:27:58 -07:00
Lianmin Zheng
76a2c86b88 Fix flashinfer version in sgl-kernel (#10135) 2025-09-07 12:54:07 -07:00
Ben Barsdell
a12061df4c Fix cuda graph mode in flashinfer attn backend (#10056) 2025-09-06 22:59:48 -07:00
Lianmin Zheng
fd71b11b1d move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py (#9679) 2025-08-27 03:34:29 -07:00
Wenxuan Tan
0f587e80d3 Use Tensor Core Decode when gqa group size >= 4 (#8624)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 23:25:15 +08:00
Nathan Wang
24eaebeb4b Fix FlashInfer GPU <-> CPU sync (#9409) 2025-08-20 15:26:12 -07:00
eigen
4dbf43601d fix: zero_init buffer (#9065)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-08-14 02:39:09 -07:00
Lianmin Zheng
9e426466af Clean up allocators (#9134)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-13 13:56:04 -07:00
DarkSharpness
7ba5ad5766 [Fix] Fix flashinfer cpu <-> gpu synchronization (#8340) 2025-08-10 03:11:40 +00:00
Clay
cbdfb77123 Enable FlashInfer support encoder models and add head_dim padding workaround (#6230) 2025-07-19 19:30:16 -07:00
Hanming Lu
9379da77de SWA Prefix Cache (#7367)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-07-13 12:31:07 -07:00
u4lr451
10d60cd41b feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-17 00:33:28 -07:00
Lianmin Zheng
c64290dcb5 Use seq_len_fill_value in the cuda graph runners (#7233) 2025-06-16 15:57:07 -07:00
Lianmin Zheng
b1286a116a [EAGLE] Refactor code for page size > 1 & more simplifications (#7213) 2025-06-16 03:04:29 -07:00
Lianmin Zheng
fff10809bf Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" (#7210) 2025-06-15 02:48:00 -07:00
Lianmin Zheng
5f1ab32717 [EAGLE] Refactor code for page size > 1 & more simplifications (#7163) 2025-06-14 23:16:23 -07:00
Jianan Ji
5f91c82526 [Feature] Support Flashinfer fmha on Blackwell (#6930) 2025-06-06 12:57:50 -07:00
Ke Bao
a2cb5913a0 Add draft extend CUDA graph for flashinfer backend (#6805) 2025-06-02 01:51:26 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Wenxuan Tan
22da3d978f Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (#5555) 2025-05-05 10:32:17 -07:00
Baizhou Zhang
bf203cb7a2 [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (#5992) 2025-05-04 09:49:13 -07:00
ryang
4322c31e24 Support XiaomiMiMo/MiMo model inference (#5921) 2025-05-01 07:41:13 -07:00
Baizhou Zhang
799789afed Bump Flashinfer to 0.2.5 (#5870)
Co-authored-by: Yuhao Chen <yxckeis8@gmail.com>
2025-04-29 19:50:57 -07:00
Yineng Zhang
a6f892e5d0 Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544) 2025-04-18 16:50:21 -07:00
yhyang201
4db463b1ad [Model] Adding Qwen3 and Qwen3MoE (#4693) 2025-04-18 09:51:29 -07:00
Wenxuan Tan
bfa3922451 Avoid computing lse in Ragged Prefill when there's no prefix. (#5476)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-04-18 01:13:57 -07:00
Yun Dai
2695ab0537 Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-04-08 09:11:35 -07:00
Lianmin Zheng
b26bc86b36 Support page size > 1 + eagle (#4908) 2025-03-30 00:46:23 -07:00
JieXin Liang
9e93ef3f8e [fix] fix illegal mem access and clean up triton attention backend (#4571) 2025-03-20 02:01:52 -07:00
JieXin Liang
c0e9a36c5f Optimize Triton decoding kernel for dynamic workload (#4553) 2025-03-18 21:25:38 -07:00
Lianmin Zheng
82dec1f70b Remove redundant type conversion (#4513) 2025-03-17 05:57:35 -07:00
Lianmin Zheng
aa957102a9 Simplify tests & Fix trtllm custom allreduce registration (#4252) 2025-03-10 01:24:22 -07:00
yinfan98
ab7fba0ece Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant (#4147) 2025-03-06 11:50:07 -08:00
Lianmin Zheng
bc1534ff32 Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134)
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-06 06:13:59 -08:00
Baizhou Zhang
fc91d08a8f [Revision] Add fast decode plan for flashinfer mla (#4012) 2025-03-05 11:20:41 -08:00
Ying Sheng
d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2025-03-05 08:06:07 -08:00
Lianmin Zheng
2dd7d0c533 Revert "Fix nightly-test CI" (#4065) 2025-03-04 05:38:24 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
yinfan98
b4d34cd35d Fix nightly-test CI (#3826) 2025-03-02 23:14:45 -08:00
Lianmin Zheng
9e1014cf99 Revert "Add fast decode plan for flashinfer mla" (#4008) 2025-03-02 19:29:10 -08:00
Baizhou Zhang
fa56106731 Add fast decode plan for flashinfer mla (#3987) 2025-03-02 19:16:37 -08:00
Baizhou Zhang
90a4b7d98a [Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-28 18:13:56 -08:00
Ke Bao
e5ce395a6c Fix draft decode max batch size (#3676) 2025-02-18 23:03:26 +08:00
Yineng Zhang
714f3e6362 feat: support flashinfer mla with prefix cache (#3643) 2025-02-18 02:06:43 +08:00
Yineng Zhang
70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) 2025-02-14 08:50:14 +08:00
Ying Sheng
d23cb9a01e [Eagle] reduce one draft forward (#3468) 2025-02-10 20:21:49 +08:00
Yineng Zhang
27c4c9cf52 remove _grouped_size_compiled_for_decode_kernels (#3453) 2025-02-10 13:01:21 +08:00