Commit Graph

62 Commits

Author SHA1 Message Date
narutolhy
839c93bd2d feat: add original logprobs to response (#8375)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
2025-08-29 11:43:57 -07:00
zyksir
aee094e430 add support for nvidia/gpt-oss-120b-Eagle3 (#9739) 2025-08-28 00:20:20 -07:00
Qiaolin Yu
9c0c1e30b2 Disable torch.compile for get_last_loc_large_page_size_large_top_k (#9507)
Co-authored-by: ispobock <ispobaoke@gmail.com>
2025-08-22 02:05:02 -07:00
Qiaolin Yu
9ec314c6ac Support speculative decoding in the trtllm_mha attention backend (#9331)
Co-authored-by: ispobock <ispobaoke@gmail.com>
2025-08-21 23:53:35 -07:00
pranavm-nvidia
64574ef8c0 Enables speculative decoding for the trtllm_mla attention backend (#9238) 2025-08-21 01:18:21 -07:00
Liangsheng Yin
eb19ccadae [bug] fix errors related to context length in SD (#9388) 2025-08-21 10:32:34 +08:00
zyksir
6a9d6ca33c fix unexcepted answer in EAGLE mode (#9252) 2025-08-16 17:45:36 -07:00
valarLip
53f7874ae6 refine aiter_backend for mtp (#7279)
Co-authored-by: HAI <hixiao@gmail.com>
2025-08-08 11:06:02 -07:00
Cheng Wan
7a1f7fc504 [Feature] Hybrid EP and TP (#8590) 2025-07-31 02:53:25 -07:00
Cheng Wan
c0fb25e949 DP Enhancement (#8280) 2025-07-24 21:36:21 -07:00
Cheng Wan
6c903611ca Fix incorrect spec_num_draft_tokens in draft_extend (#7757) 2025-07-05 02:18:16 -07:00
lukec
886d344964 support llama4 eagle3 (#6985)
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Shenggui Li <somerlee.9@gmail.com>
Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-06-30 22:34:10 -07:00
u4lr451
ed0a0b692c Perormance: Enable cuda graph for dp idle batch (#7269)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-23 17:34:13 -07:00
Cheng Wan
f8d48fd311 Fix dtype for idle input in spec decoding (#7456) 2025-06-23 11:23:25 -07:00
u4lr451
10d60cd41b feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-17 00:33:28 -07:00
Lianmin Zheng
c64290dcb5 Use seq_len_fill_value in the cuda graph runners (#7233) 2025-06-16 15:57:07 -07:00
Lianmin Zheng
53a525bf33 [Eagle] Fix kernel call after updating speculative sampling kernels (#7231) 2025-06-16 07:25:59 -07:00
Lianmin Zheng
b1286a116a [EAGLE] Refactor code for page size > 1 & more simplifications (#7213) 2025-06-16 03:04:29 -07:00
Lianmin Zheng
fff10809bf Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" (#7210) 2025-06-15 02:48:00 -07:00
Lianmin Zheng
5f1ab32717 [EAGLE] Refactor code for page size > 1 & more simplifications (#7163) 2025-06-14 23:16:23 -07:00
kyle-pena-kuzco
b56de8f943 Open AI API hidden states (#6716) 2025-06-10 14:37:29 -07:00
Lianmin Zheng
dc0705a504 Simplify prepare_extend_after_decode (#6987) 2025-06-09 16:39:21 -07:00
Lianmin Zheng
0c1f03a23d Sync cuda graph runners (#6976) 2025-06-08 16:12:25 -07:00
Ke Bao
a2cb5913a0 Add draft extend CUDA graph for flashinfer backend (#6805) 2025-06-02 01:51:26 -07:00
Ke Bao
7e41290082 Add draft extend CUDA graph for Triton backend (#6705) 2025-05-29 00:13:07 -07:00
Ke Bao
631950280a Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
2025-05-27 02:35:17 -07:00
Ke Bao
6ce0ed073b Apply constraint grammar to EAGLE (#6499)
Co-authored-by: merrymercy <lianminzheng@gmail.com>
2025-05-21 17:18:41 -07:00
Lifu Huang
3cf1473a09 Use monotonic clock for interval measurement (#6211)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-17 16:49:18 -07:00
quinnrong94
2e4babdb0a [Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109)
Co-authored-by: Yingyi <yingyihuang2000@outlook.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: kexueyu <kexueyu@tencent.com>
Co-authored-by: vincentmeng <vincentmeng@tencent.com>
Co-authored-by: pengmeng <pengmeng@tencent.com>
2025-05-15 00:48:09 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Lianmin Zheng
fba8eccd7e Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-12 00:17:33 -07:00
applesaucethebun
2ce8793519 Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-11 12:55:00 +08:00
Ying Sheng
11383cec3c [PP] Add pipeline parallelism (#5724) 2025-04-30 18:18:07 -07:00
JieXin Liang
97cb762bb6 [misc] remove is_cuda_available (#5319) 2025-04-20 18:16:51 -07:00
u4lr451
211c7b31b8 Fix: Incorrect parameters passed to forward_batch_generation (#5506) (#5511) 2025-04-17 18:49:59 -07:00
fzyzcjy
86a876d883 Optimize topk operation in llama4 (#5128) 2025-04-09 02:50:22 -07:00
Baizhou Zhang
efbae697b3 [Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052) 2025-04-05 01:23:02 -07:00
Qingquan Song
e983e43248 Add Eagle Speculative Decoding to FA3 Backend (#4951)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>
2025-04-02 13:09:02 -07:00
Lianmin Zheng
b26bc86b36 Support page size > 1 + eagle (#4908) 2025-03-30 00:46:23 -07:00
Brayden Zhong
e84f4ba0ab [Misc] Fix issues reported by torchfix (#4837) 2025-03-27 20:10:32 -07:00
James Liu
9e0186f352 [Feature] Support EAGLE 3 (#4247) 2025-03-18 07:35:23 -07:00
Ying Sheng
1b859295f4 [Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-16 02:48:55 -07:00
Baizhou Zhang
9fb48f951f Support nextn for flashinfer mla attention backend (#4218) 2025-03-09 00:01:54 -08:00
Lianmin Zheng
d4017a6b63 [EAGLE] many fixes for eagle (#4195)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-07 22:12:13 -08:00
Lianmin Zheng
bc1534ff32 Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134)
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-06 06:13:59 -08:00
Ying Sheng
d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2025-03-05 08:06:07 -08:00
Lianmin Zheng
77a3954bf7 Simplify eagle tests and TP sync in grammar backend (#4066) 2025-03-04 13:40:40 -08:00
William
0d4e3228cf [Feature] Add test for speculative_token_map (#4016) 2025-03-04 04:26:24 -08:00
Ke Bao
9fafa62db7 Share target model embed and head weights for nextn (#4033) 2025-03-03 13:30:04 -08:00