Chang Su
|
066cf44546
|
[OAI] Add rid tracing for v1/embeddings and fix rid type in Chat (#6397)
|
2025-05-18 13:05:38 -07:00 |
|
JieXin Liang
|
1f30c05d4a
|
[fix] fix fa3 forward_decode with spec_decode (#6395)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
|
2025-05-18 12:50:15 -07:00 |
|
Chang Su
|
205d5cb407
|
perf: Optimize local attention memory allocation in FlashAttentionBackend (#6356)
|
2025-05-17 01:45:46 -07:00 |
|
Chang Su
|
912788c095
|
perf: optimize local_block_table memory allocation (#6273)
|
2025-05-13 17:18:38 -07:00 |
|
applesaucethebun
|
2ce8793519
|
Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-11 12:55:00 +08:00 |
|
Chang Su
|
1940cdec61
|
[Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (#6162)
|
2025-05-09 15:33:02 -07:00 |
|
Minglei Zhu
|
79961afa82
|
optimize pad operations in fa3 to accelarate 100+us (#6077)
|
2025-05-07 23:40:08 -07:00 |
|
Lifu Huang
|
1acca3a2c6
|
FA3 speed up: skip len operation and get batch size directly from forward batch (#5969)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
|
2025-05-02 00:26:12 -07:00 |
|
Stefan He
|
6fc175968c
|
Optimize a pad operation to accelerate 25us (#5945)
|
2025-05-01 10:48:55 -07:00 |
|
Ke Bao
|
799c4bb502
|
Fuse MLA set kv cache kernel (#5748)
|
2025-04-26 18:42:22 -07:00 |
|
Ke Bao
|
6b6e748775
|
Remove q concat in FA3 backend for DeepSeek decode (#5638)
|
2025-04-22 11:43:12 -07:00 |
|
Qingquan Song
|
188f0955fa
|
Add Speculative Decoding Eagle3 topk > 1 (#5318)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
|
2025-04-20 22:58:28 -07:00 |
|
Chang Su
|
c776234b45
|
Enable local attention during decode (#5479)
|
2025-04-17 02:07:43 -07:00 |
|
Baizhou Zhang
|
4fb05583ef
|
Deprecate disable-mla (#5481)
|
2025-04-17 01:43:14 -07:00 |
|
Baizhou Zhang
|
a42736bbb8
|
Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113)
|
2025-04-15 22:01:22 -07:00 |
|
Yineng Zhang
|
fa909dc3c4
|
feat: update model_specific_adjustment (#5344)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-04-15 14:45:15 -07:00 |
|
Yineng Zhang
|
57de7c6b5f
|
feat: use fa3 mla by default on hopper (#5210)
Co-authored-by: yundai424 <yundai424@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-04-12 01:09:25 -07:00 |
|
Qingquan Song
|
aea98512a8
|
Fix fa3 window size setup (#5316)
|
2025-04-11 23:37:52 -07:00 |
|
Chang Su
|
aee62d744b
|
Optimize GPU memory usage in FlashAttentionBackend's strided indexing (#5262)
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-04-11 00:34:17 -07:00 |
|
Chang Su
|
aac531c53b
|
[Bugfix] Fix index out of bounds in local attention with large sequences (#5173)
|
2025-04-08 18:43:13 -07:00 |
|
Yun Dai
|
2695ab0537
|
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-04-08 09:11:35 -07:00 |
|
Yubo Wang
|
804d9f2e4c
|
Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (#4760)
|
2025-04-07 23:20:51 -07:00 |
|
Chunan Zeng
|
a7c3f74bec
|
[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (#5103)
|
2025-04-07 22:58:08 -07:00 |
|
Stefan He
|
93470a1411
|
Refactor and Optimize FA3 Code (#5090)
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
|
2025-04-07 11:52:42 -07:00 |
|
Chang Su
|
f04c80dc42
|
Add Llama4 support (#5092)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@163.com>
|
2025-04-07 00:29:36 -07:00 |
|
Stefan He
|
ca8d02abd5
|
FA3 Spec Decoding to support top k = 1 and add cuda graph support (#5050)
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>
|
2025-04-04 23:03:59 -07:00 |
|
Qingquan Song
|
e983e43248
|
Add Eagle Speculative Decoding to FA3 Backend (#4951)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>
|
2025-04-02 13:09:02 -07:00 |
|
Yineng Zhang
|
1c63e79756
|
use fa3 in sgl-kernel (#4954)
|
2025-03-31 16:14:49 -07:00 |
|
Baizhou Zhang
|
e62d60fe6d
|
[Fix] avoid stream sync and torch compile in prefill for fa3 backend (#4932)
|
2025-03-30 13:53:44 -07:00 |
|
Baizhou Zhang
|
20c90be23d
|
[Feature] Support FA3 backend for MLA (#4831)
|
2025-03-28 18:30:14 -07:00 |
|
Qingquan Song
|
6ffb6bd47a
|
Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (#4855)
|
2025-03-28 01:35:59 -07:00 |
|
Stefan He
|
26c0f13126
|
Support Page Size > 1 for FA3 (#4832)
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-03-27 22:07:14 -07:00 |
|
Stefan He
|
1b9175cb23
|
[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 (#4745)
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
|
2025-03-27 00:45:11 -07:00 |
|
Stefan He
|
5d7edc8e55
|
Support FA3 as Attention backend by using --attention-backend fa3 (#4680)
Co-authored-by: qsong <qsong@linkedin.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-03-23 23:28:11 -07:00 |
|