Commit Graph

204 Commits

Author SHA1 Message Date
Qiaolin Yu
41650b0d70 feat: support compatibility between MTP and two-batch-overlap (#7225)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-06-27 01:10:27 -07:00
Yuhong Guo
a1c1ebe935 Fix FP8 KV Cache Support in FA3 Backend (#7148) 2025-06-25 02:14:40 -07:00
u4lr451
ed0a0b692c Perormance: Enable cuda graph for dp idle batch (#7269)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-23 17:34:13 -07:00
u4lr451
10d60cd41b feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-17 00:33:28 -07:00
Lianmin Zheng
c64290dcb5 Use seq_len_fill_value in the cuda graph runners (#7233) 2025-06-16 15:57:07 -07:00
Lianmin Zheng
7ddf8e83d2 [EAGLE] Fix draft kv cache layout for fa3 and topk > 1 (#7239) 2025-06-16 05:47:51 -07:00
Lianmin Zheng
b1286a116a [EAGLE] Refactor code for page size > 1 & more simplifications (#7213) 2025-06-16 03:04:29 -07:00
Lianmin Zheng
21615cc3fe Minor style and doc fix (#7228) 2025-06-16 01:03:13 -07:00
Lianmin Zheng
fff10809bf Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" (#7210) 2025-06-15 02:48:00 -07:00
Lianmin Zheng
5f1ab32717 [EAGLE] Refactor code for page size > 1 & more simplifications (#7163) 2025-06-14 23:16:23 -07:00
Lianmin Zheng
38af4f68a9 Fix grammar abort & Minor style fixes (#7204) 2025-06-14 22:49:41 -07:00
JieXin Liang
ed89837cf4 chore: upgrade sgl-kernel v0.1.8.post2 (#7186)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-06-14 18:26:18 -07:00
JieXin Liang
ab1a4fa5cb [fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla (#7184) 2025-06-14 12:45:41 -07:00
Lianmin Zheng
be2d985df8 Minor style change of triton backend (#7165) 2025-06-13 16:01:23 -07:00
fzyzcjy
c49c1d9226 Remove 200us slow concat kernel (part 2: srt) (#7020) 2025-06-13 15:19:31 -07:00
Binyao Jiang
22a6b9fc05 Remove unnecessary metadata_expand.max_seq_len_k operations in fa3 to… (#7140) 2025-06-12 23:25:52 -07:00
Mick
83d87685c5 vlm: adapt internvl to VisionAttention (#6870) 2025-06-11 01:16:04 -07:00
Lianmin Zheng
6b12d6a8d5 Simplify the heuristics for setting --mem-fraction-static (#7054) 2025-06-10 19:01:39 -07:00
kk
8ea7df6114 [WA] fix output data is nan in CI test "test_moe_eval_accuracy_large.py" (#7021)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-06-10 16:08:10 +00:00
Lianmin Zheng
2dae104dca Minor cleanup of fa3 backend (#6999) 2025-06-10 03:58:44 -07:00
fzyzcjy
e58423b2b9 Fix cutlass MLA gets almost zero accuracy (#6998) 2025-06-09 10:16:29 -07:00
Lianmin Zheng
9d5fa68b90 Use torch.compile to fuse flash attention decode metadata preparation (#6973) 2025-06-08 23:05:40 -07:00
Jianan Ji
5f91c82526 [Feature] Support Flashinfer fmha on Blackwell (#6930) 2025-06-06 12:57:50 -07:00
HAI
b819381fec AITER backend extension and workload optimizations (#6838)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
2025-06-05 23:00:18 -07:00
Ke Bao
a2cb5913a0 Add draft extend CUDA graph for flashinfer backend (#6805) 2025-06-02 01:51:26 -07:00
YanbingJiang
888cb175a6 Add intel_amx backend for Radix Attention for CPU (#6408)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
2025-05-30 21:37:42 -07:00
Jianan Ji
22630ca242 Support sliding window in triton backend (#6509) 2025-05-30 01:11:53 -07:00
Ke Bao
7e41290082 Add draft extend CUDA graph for Triton backend (#6705) 2025-05-29 00:13:07 -07:00
fzyzcjy
ae6a5b2950 Minor refactor two-batch overlap (#6682) 2025-05-28 15:54:17 -07:00
Ke Bao
631950280a Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
2025-05-27 02:35:17 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
kk
7a5e6ce1cb Fix GPU OOM (#6564)
Co-authored-by: michael <michael.zhang@amd.com>
2025-05-24 16:38:39 -07:00
HAI
5c0b38f369 aiter attention-backend (default enabled on AMD/ROCm) (#6381) 2025-05-20 22:52:41 -07:00
JieXin Liang
69af3ec35f [doc] add note for get_num_kv_splits in triton_backend (#6444) 2025-05-19 21:40:21 -07:00
Chang Su
066cf44546 [OAI] Add rid tracing for v1/embeddings and fix rid type in Chat (#6397) 2025-05-18 13:05:38 -07:00
JieXin Liang
1f30c05d4a [fix] fix fa3 forward_decode with spec_decode (#6395)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-05-18 12:50:15 -07:00
Mick
01dd39bac1 refactor: minor refactors regarding multimodal processing (#6187) 2025-05-17 22:53:20 -07:00
Chang Su
205d5cb407 perf: Optimize local attention memory allocation in FlashAttentionBackend (#6356) 2025-05-17 01:45:46 -07:00
quinnrong94
2e4babdb0a [Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109)
Co-authored-by: Yingyi <yingyihuang2000@outlook.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: kexueyu <kexueyu@tencent.com>
Co-authored-by: vincentmeng <vincentmeng@tencent.com>
Co-authored-by: pengmeng <pengmeng@tencent.com>
2025-05-15 00:48:09 -07:00
Chang Su
912788c095 perf: optimize local_block_table memory allocation (#6273) 2025-05-13 17:18:38 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Lianmin Zheng
fba8eccd7e Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-12 00:17:33 -07:00
applesaucethebun
2ce8793519 Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-11 12:55:00 +08:00
Chang Su
1940cdec61 [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (#6162) 2025-05-09 15:33:02 -07:00
xu-yfei
e30c273bc9 opt flashinfer mla cat (#5822)
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
2025-05-08 23:17:14 -07:00
Zhu Chen
fa7d7fd9e5 [Feature] Add FlashAttention3 as a backend for VisionAttention (#5764)
Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-05-08 10:01:19 -07:00
Minglei Zhu
79961afa82 optimize pad operations in fa3 to accelarate 100+us (#6077) 2025-05-07 23:40:08 -07:00
DefTruth
82653f6622 feat: Add a unified merge_state API (#5428) 2025-05-05 10:32:33 -07:00
Wenxuan Tan
22da3d978f Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (#5555) 2025-05-05 10:32:17 -07:00