Qiaolin Yu
|
41650b0d70
|
feat: support compatibility between MTP and two-batch-overlap (#7225)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-06-27 01:10:27 -07:00 |
|
Yuhong Guo
|
a1c1ebe935
|
Fix FP8 KV Cache Support in FA3 Backend (#7148)
|
2025-06-25 02:14:40 -07:00 |
|
u4lr451
|
ed0a0b692c
|
Perormance: Enable cuda graph for dp idle batch (#7269)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-23 17:34:13 -07:00 |
|
u4lr451
|
10d60cd41b
|
feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-17 00:33:28 -07:00 |
|
Lianmin Zheng
|
c64290dcb5
|
Use seq_len_fill_value in the cuda graph runners (#7233)
|
2025-06-16 15:57:07 -07:00 |
|
Lianmin Zheng
|
7ddf8e83d2
|
[EAGLE] Fix draft kv cache layout for fa3 and topk > 1 (#7239)
|
2025-06-16 05:47:51 -07:00 |
|
Lianmin Zheng
|
b1286a116a
|
[EAGLE] Refactor code for page size > 1 & more simplifications (#7213)
|
2025-06-16 03:04:29 -07:00 |
|
Lianmin Zheng
|
21615cc3fe
|
Minor style and doc fix (#7228)
|
2025-06-16 01:03:13 -07:00 |
|
Lianmin Zheng
|
fff10809bf
|
Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" (#7210)
|
2025-06-15 02:48:00 -07:00 |
|
Lianmin Zheng
|
5f1ab32717
|
[EAGLE] Refactor code for page size > 1 & more simplifications (#7163)
|
2025-06-14 23:16:23 -07:00 |
|
Lianmin Zheng
|
38af4f68a9
|
Fix grammar abort & Minor style fixes (#7204)
|
2025-06-14 22:49:41 -07:00 |
|
JieXin Liang
|
ed89837cf4
|
chore: upgrade sgl-kernel v0.1.8.post2 (#7186)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2025-06-14 18:26:18 -07:00 |
|
JieXin Liang
|
ab1a4fa5cb
|
[fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla (#7184)
|
2025-06-14 12:45:41 -07:00 |
|
Lianmin Zheng
|
be2d985df8
|
Minor style change of triton backend (#7165)
|
2025-06-13 16:01:23 -07:00 |
|
fzyzcjy
|
c49c1d9226
|
Remove 200us slow concat kernel (part 2: srt) (#7020)
|
2025-06-13 15:19:31 -07:00 |
|
Binyao Jiang
|
22a6b9fc05
|
Remove unnecessary metadata_expand.max_seq_len_k operations in fa3 to… (#7140)
|
2025-06-12 23:25:52 -07:00 |
|
Mick
|
83d87685c5
|
vlm: adapt internvl to VisionAttention (#6870)
|
2025-06-11 01:16:04 -07:00 |
|
Lianmin Zheng
|
6b12d6a8d5
|
Simplify the heuristics for setting --mem-fraction-static (#7054)
|
2025-06-10 19:01:39 -07:00 |
|
kk
|
8ea7df6114
|
[WA] fix output data is nan in CI test "test_moe_eval_accuracy_large.py" (#7021)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-06-10 16:08:10 +00:00 |
|
Lianmin Zheng
|
2dae104dca
|
Minor cleanup of fa3 backend (#6999)
|
2025-06-10 03:58:44 -07:00 |
|
fzyzcjy
|
e58423b2b9
|
Fix cutlass MLA gets almost zero accuracy (#6998)
|
2025-06-09 10:16:29 -07:00 |
|
Lianmin Zheng
|
9d5fa68b90
|
Use torch.compile to fuse flash attention decode metadata preparation (#6973)
|
2025-06-08 23:05:40 -07:00 |
|
Jianan Ji
|
5f91c82526
|
[Feature] Support Flashinfer fmha on Blackwell (#6930)
|
2025-06-06 12:57:50 -07:00 |
|
HAI
|
b819381fec
|
AITER backend extension and workload optimizations (#6838)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
|
2025-06-05 23:00:18 -07:00 |
|
Ke Bao
|
a2cb5913a0
|
Add draft extend CUDA graph for flashinfer backend (#6805)
|
2025-06-02 01:51:26 -07:00 |
|
YanbingJiang
|
888cb175a6
|
Add intel_amx backend for Radix Attention for CPU (#6408)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
|
2025-05-30 21:37:42 -07:00 |
|
Jianan Ji
|
22630ca242
|
Support sliding window in triton backend (#6509)
|
2025-05-30 01:11:53 -07:00 |
|
Ke Bao
|
7e41290082
|
Add draft extend CUDA graph for Triton backend (#6705)
|
2025-05-29 00:13:07 -07:00 |
|
fzyzcjy
|
ae6a5b2950
|
Minor refactor two-batch overlap (#6682)
|
2025-05-28 15:54:17 -07:00 |
|
Ke Bao
|
631950280a
|
Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
|
2025-05-27 02:35:17 -07:00 |
|
fzyzcjy
|
0d47788025
|
Support overlapping two batches (#4068)
|
2025-05-24 17:39:07 -07:00 |
|
kk
|
7a5e6ce1cb
|
Fix GPU OOM (#6564)
Co-authored-by: michael <michael.zhang@amd.com>
|
2025-05-24 16:38:39 -07:00 |
|
HAI
|
5c0b38f369
|
aiter attention-backend (default enabled on AMD/ROCm) (#6381)
|
2025-05-20 22:52:41 -07:00 |
|
JieXin Liang
|
69af3ec35f
|
[doc] add note for get_num_kv_splits in triton_backend (#6444)
|
2025-05-19 21:40:21 -07:00 |
|
Chang Su
|
066cf44546
|
[OAI] Add rid tracing for v1/embeddings and fix rid type in Chat (#6397)
|
2025-05-18 13:05:38 -07:00 |
|
JieXin Liang
|
1f30c05d4a
|
[fix] fix fa3 forward_decode with spec_decode (#6395)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
|
2025-05-18 12:50:15 -07:00 |
|
Mick
|
01dd39bac1
|
refactor: minor refactors regarding multimodal processing (#6187)
|
2025-05-17 22:53:20 -07:00 |
|
Chang Su
|
205d5cb407
|
perf: Optimize local attention memory allocation in FlashAttentionBackend (#6356)
|
2025-05-17 01:45:46 -07:00 |
|
quinnrong94
|
2e4babdb0a
|
[Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109)
Co-authored-by: Yingyi <yingyihuang2000@outlook.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: kexueyu <kexueyu@tencent.com>
Co-authored-by: vincentmeng <vincentmeng@tencent.com>
Co-authored-by: pengmeng <pengmeng@tencent.com>
|
2025-05-15 00:48:09 -07:00 |
|
Chang Su
|
912788c095
|
perf: optimize local_block_table memory allocation (#6273)
|
2025-05-13 17:18:38 -07:00 |
|
Lianmin Zheng
|
e8e18dcdcc
|
Revert "fix some typos" (#6244)
|
2025-05-12 12:53:26 -07:00 |
|
applesaucethebun
|
d738ab52f8
|
fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-13 01:42:38 +08:00 |
|
Lianmin Zheng
|
fba8eccd7e
|
Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-05-12 00:17:33 -07:00 |
|
applesaucethebun
|
2ce8793519
|
Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-11 12:55:00 +08:00 |
|
Chang Su
|
1940cdec61
|
[Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (#6162)
|
2025-05-09 15:33:02 -07:00 |
|
xu-yfei
|
e30c273bc9
|
opt flashinfer mla cat (#5822)
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
|
2025-05-08 23:17:14 -07:00 |
|
Zhu Chen
|
fa7d7fd9e5
|
[Feature] Add FlashAttention3 as a backend for VisionAttention (#5764)
Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
|
2025-05-08 10:01:19 -07:00 |
|
Minglei Zhu
|
79961afa82
|
optimize pad operations in fa3 to accelarate 100+us (#6077)
|
2025-05-07 23:40:08 -07:00 |
|
DefTruth
|
82653f6622
|
feat: Add a unified merge_state API (#5428)
|
2025-05-05 10:32:33 -07:00 |
|
Wenxuan Tan
|
22da3d978f
|
Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (#5555)
|
2025-05-05 10:32:17 -07:00 |
|