Commit Graph

204 Commits

Author SHA1 Message Date
Baizhou Zhang
bf203cb7a2 [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (#5992) 2025-05-04 09:49:13 -07:00
Lifu Huang
1acca3a2c6 FA3 speed up: skip len operation and get batch size directly from forward batch (#5969)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-02 00:26:12 -07:00
Stefan He
6fc175968c Optimize a pad operation to accelerate 25us (#5945) 2025-05-01 10:48:55 -07:00
ryang
4322c31e24 Support XiaomiMiMo/MiMo model inference (#5921) 2025-05-01 07:41:13 -07:00
Baizhou Zhang
799789afed Bump Flashinfer to 0.2.5 (#5870)
Co-authored-by: Yuhao Chen <yxckeis8@gmail.com>
2025-04-29 19:50:57 -07:00
pengcuo
8e5a6d3441 [Fix] Fix a bug for flashmla to run R1 model (#5875)
Co-authored-by: pengcuo <dgpengcuo@gmail.com>
2025-04-29 01:03:13 -07:00
Trevor Morris
8d463fe351 Cutlass MLA decode - fix dtype error (#5868) 2025-04-28 21:12:58 -07:00
Trevor Morris
84810da4ae Add Cutlass MLA attention backend (#5390) 2025-04-27 20:58:53 -07:00
Ke Bao
799c4bb502 Fuse MLA set kv cache kernel (#5748) 2025-04-26 18:42:22 -07:00
liwenju0
4d1e52abea Add an assertion to enhance the robustness of the operator (#5736) 2025-04-26 18:09:12 -07:00
Ke Bao
6b6e748775 Remove q concat in FA3 backend for DeepSeek decode (#5638) 2025-04-22 11:43:12 -07:00
lukec
2ed96c7a8a fix flashmla bug (#5272) 2025-04-22 10:36:23 -07:00
Qingquan Song
188f0955fa Add Speculative Decoding Eagle3 topk > 1 (#5318)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
2025-04-20 22:58:28 -07:00
JieXin Liang
97cb762bb6 [misc] remove is_cuda_available (#5319) 2025-04-20 18:16:51 -07:00
Yineng Zhang
a6f892e5d0 Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544) 2025-04-18 16:50:21 -07:00
yhyang201
4db463b1ad [Model] Adding Qwen3 and Qwen3MoE (#4693) 2025-04-18 09:51:29 -07:00
Wenxuan Tan
bfa3922451 Avoid computing lse in Ragged Prefill when there's no prefix. (#5476)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-04-18 01:13:57 -07:00
Chang Su
c776234b45 Enable local attention during decode (#5479) 2025-04-17 02:07:43 -07:00
woodx
3bface15e6 Feat/support encoder model (like bert) (#4887) 2025-04-17 01:50:48 -07:00
Baizhou Zhang
4fb05583ef Deprecate disable-mla (#5481) 2025-04-17 01:43:14 -07:00
Baizhou Zhang
a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) 2025-04-15 22:01:22 -07:00
Yineng Zhang
fa909dc3c4 feat: update model_specific_adjustment (#5344)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-04-15 14:45:15 -07:00
Yineng Zhang
57de7c6b5f feat: use fa3 mla by default on hopper (#5210)
Co-authored-by: yundai424 <yundai424@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-04-12 01:09:25 -07:00
Qingquan Song
aea98512a8 Fix fa3 window size setup (#5316) 2025-04-11 23:37:52 -07:00
Mick
e53a0b3d5b [fix] fix mrope positions not picked up (#5265) 2025-04-11 01:29:45 -07:00
Chang Su
aee62d744b Optimize GPU memory usage in FlashAttentionBackend's strided indexing (#5262)
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-04-11 00:34:17 -07:00
Chang Su
aac531c53b [Bugfix] Fix index out of bounds in local attention with large sequences (#5173) 2025-04-08 18:43:13 -07:00
Yun Dai
2695ab0537 Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-04-08 09:11:35 -07:00
Yubo Wang
804d9f2e4c Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (#4760) 2025-04-07 23:20:51 -07:00
Chunan Zeng
a7c3f74bec [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (#5103) 2025-04-07 22:58:08 -07:00
Stefan He
93470a1411 Refactor and Optimize FA3 Code (#5090)
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
2025-04-07 11:52:42 -07:00
Chang Su
f04c80dc42 Add Llama4 support (#5092)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@163.com>
2025-04-07 00:29:36 -07:00
Baizhou Zhang
efbae697b3 [Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052) 2025-04-05 01:23:02 -07:00
Stefan He
ca8d02abd5 FA3 Spec Decoding to support top k = 1 and add cuda graph support (#5050)
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>
2025-04-04 23:03:59 -07:00
Lianmin Zheng
74885a848b Revert "Replace enable_flashinfer_mla argument with attention_backend" (#5048) 2025-04-03 13:30:56 -07:00
Baizhou Zhang
e8999b13b7 Replace enable_flashinfer_mla argument with attention_backend (#5005) 2025-04-03 02:53:58 -07:00
Qingquan Song
e983e43248 Add Eagle Speculative Decoding to FA3 Backend (#4951)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>
2025-04-02 13:09:02 -07:00
Yineng Zhang
1c63e79756 use fa3 in sgl-kernel (#4954) 2025-03-31 16:14:49 -07:00
Baizhou Zhang
e62d60fe6d [Fix] avoid stream sync and torch compile in prefill for fa3 backend (#4932) 2025-03-30 13:53:44 -07:00
Lianmin Zheng
b26bc86b36 Support page size > 1 + eagle (#4908) 2025-03-30 00:46:23 -07:00
Baizhou Zhang
20c90be23d [Feature] Support FA3 backend for MLA (#4831) 2025-03-28 18:30:14 -07:00
Qingquan Song
6ffb6bd47a Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (#4855) 2025-03-28 01:35:59 -07:00
Stefan He
26c0f13126 Support Page Size > 1 for FA3 (#4832)
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-03-27 22:07:14 -07:00
Stefan He
1b9175cb23 [FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 (#4745)
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
2025-03-27 00:45:11 -07:00
Yi Pan
45fdf1f7f3 Fix shared memory OOM on sm86 GPUs. (#4797) 2025-03-26 10:41:53 -07:00
Thysrael
ced35a0649 fix(typo): fix reply to replay in base_attn_backend.py (#4784) 2025-03-26 00:19:12 -07:00
lukec
57eec0bfbc fix FlashMLA cudagraph config (#4691)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-24 21:06:58 -07:00
Stefan He
5d7edc8e55 Support FA3 as Attention backend by using --attention-backend fa3 (#4680)
Co-authored-by: qsong <qsong@linkedin.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-03-23 23:28:11 -07:00
Mick
11577cedb7 refactor: bug fixes and refactor for vlm (#4661) 2025-03-22 22:48:49 -07:00
JieXin Liang
9e93ef3f8e [fix] fix illegal mem access and clean up triton attention backend (#4571) 2025-03-20 02:01:52 -07:00