narutolhy
|
839c93bd2d
|
feat: add original logprobs to response (#8375)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
|
2025-08-29 11:43:57 -07:00 |
|
zyksir
|
aee094e430
|
add support for nvidia/gpt-oss-120b-Eagle3 (#9739)
|
2025-08-28 00:20:20 -07:00 |
|
Qiaolin Yu
|
9c0c1e30b2
|
Disable torch.compile for get_last_loc_large_page_size_large_top_k (#9507)
Co-authored-by: ispobock <ispobaoke@gmail.com>
|
2025-08-22 02:05:02 -07:00 |
|
Qiaolin Yu
|
9ec314c6ac
|
Support speculative decoding in the trtllm_mha attention backend (#9331)
Co-authored-by: ispobock <ispobaoke@gmail.com>
|
2025-08-21 23:53:35 -07:00 |
|
pranavm-nvidia
|
64574ef8c0
|
Enables speculative decoding for the trtllm_mla attention backend (#9238)
|
2025-08-21 01:18:21 -07:00 |
|
Liangsheng Yin
|
eb19ccadae
|
[bug] fix errors related to context length in SD (#9388)
|
2025-08-21 10:32:34 +08:00 |
|
zyksir
|
6a9d6ca33c
|
fix unexcepted answer in EAGLE mode (#9252)
|
2025-08-16 17:45:36 -07:00 |
|
valarLip
|
53f7874ae6
|
refine aiter_backend for mtp (#7279)
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-08-08 11:06:02 -07:00 |
|
Cheng Wan
|
7a1f7fc504
|
[Feature] Hybrid EP and TP (#8590)
|
2025-07-31 02:53:25 -07:00 |
|
Cheng Wan
|
c0fb25e949
|
DP Enhancement (#8280)
|
2025-07-24 21:36:21 -07:00 |
|
Cheng Wan
|
6c903611ca
|
Fix incorrect spec_num_draft_tokens in draft_extend (#7757)
|
2025-07-05 02:18:16 -07:00 |
|
lukec
|
886d344964
|
support llama4 eagle3 (#6985)
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Shenggui Li <somerlee.9@gmail.com>
Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-06-30 22:34:10 -07:00 |
|
u4lr451
|
ed0a0b692c
|
Perormance: Enable cuda graph for dp idle batch (#7269)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-23 17:34:13 -07:00 |
|
Cheng Wan
|
f8d48fd311
|
Fix dtype for idle input in spec decoding (#7456)
|
2025-06-23 11:23:25 -07:00 |
|
u4lr451
|
10d60cd41b
|
feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-17 00:33:28 -07:00 |
|
Lianmin Zheng
|
c64290dcb5
|
Use seq_len_fill_value in the cuda graph runners (#7233)
|
2025-06-16 15:57:07 -07:00 |
|
Lianmin Zheng
|
53a525bf33
|
[Eagle] Fix kernel call after updating speculative sampling kernels (#7231)
|
2025-06-16 07:25:59 -07:00 |
|
Lianmin Zheng
|
b1286a116a
|
[EAGLE] Refactor code for page size > 1 & more simplifications (#7213)
|
2025-06-16 03:04:29 -07:00 |
|
Lianmin Zheng
|
fff10809bf
|
Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" (#7210)
|
2025-06-15 02:48:00 -07:00 |
|
Lianmin Zheng
|
5f1ab32717
|
[EAGLE] Refactor code for page size > 1 & more simplifications (#7163)
|
2025-06-14 23:16:23 -07:00 |
|
kyle-pena-kuzco
|
b56de8f943
|
Open AI API hidden states (#6716)
|
2025-06-10 14:37:29 -07:00 |
|
Lianmin Zheng
|
dc0705a504
|
Simplify prepare_extend_after_decode (#6987)
|
2025-06-09 16:39:21 -07:00 |
|
Lianmin Zheng
|
0c1f03a23d
|
Sync cuda graph runners (#6976)
|
2025-06-08 16:12:25 -07:00 |
|
Ke Bao
|
a2cb5913a0
|
Add draft extend CUDA graph for flashinfer backend (#6805)
|
2025-06-02 01:51:26 -07:00 |
|
Ke Bao
|
7e41290082
|
Add draft extend CUDA graph for Triton backend (#6705)
|
2025-05-29 00:13:07 -07:00 |
|
Ke Bao
|
631950280a
|
Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
|
2025-05-27 02:35:17 -07:00 |
|
Ke Bao
|
6ce0ed073b
|
Apply constraint grammar to EAGLE (#6499)
Co-authored-by: merrymercy <lianminzheng@gmail.com>
|
2025-05-21 17:18:41 -07:00 |
|
Lifu Huang
|
3cf1473a09
|
Use monotonic clock for interval measurement (#6211)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
|
2025-05-17 16:49:18 -07:00 |
|
quinnrong94
|
2e4babdb0a
|
[Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109)
Co-authored-by: Yingyi <yingyihuang2000@outlook.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: kexueyu <kexueyu@tencent.com>
Co-authored-by: vincentmeng <vincentmeng@tencent.com>
Co-authored-by: pengmeng <pengmeng@tencent.com>
|
2025-05-15 00:48:09 -07:00 |
|
Lianmin Zheng
|
e8e18dcdcc
|
Revert "fix some typos" (#6244)
|
2025-05-12 12:53:26 -07:00 |
|
applesaucethebun
|
d738ab52f8
|
fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-13 01:42:38 +08:00 |
|
Lianmin Zheng
|
fba8eccd7e
|
Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-05-12 00:17:33 -07:00 |
|
applesaucethebun
|
2ce8793519
|
Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-11 12:55:00 +08:00 |
|
Ying Sheng
|
11383cec3c
|
[PP] Add pipeline parallelism (#5724)
|
2025-04-30 18:18:07 -07:00 |
|
JieXin Liang
|
97cb762bb6
|
[misc] remove is_cuda_available (#5319)
|
2025-04-20 18:16:51 -07:00 |
|
u4lr451
|
211c7b31b8
|
Fix: Incorrect parameters passed to forward_batch_generation (#5506) (#5511)
|
2025-04-17 18:49:59 -07:00 |
|
fzyzcjy
|
86a876d883
|
Optimize topk operation in llama4 (#5128)
|
2025-04-09 02:50:22 -07:00 |
|
Baizhou Zhang
|
efbae697b3
|
[Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052)
|
2025-04-05 01:23:02 -07:00 |
|
Qingquan Song
|
e983e43248
|
Add Eagle Speculative Decoding to FA3 Backend (#4951)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: zcnrex <zcnrex@gmail.com>
|
2025-04-02 13:09:02 -07:00 |
|
Lianmin Zheng
|
b26bc86b36
|
Support page size > 1 + eagle (#4908)
|
2025-03-30 00:46:23 -07:00 |
|
Brayden Zhong
|
e84f4ba0ab
|
[Misc] Fix issues reported by torchfix (#4837)
|
2025-03-27 20:10:32 -07:00 |
|
James Liu
|
9e0186f352
|
[Feature] Support EAGLE 3 (#4247)
|
2025-03-18 07:35:23 -07:00 |
|
Ying Sheng
|
1b859295f4
|
[Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-16 02:48:55 -07:00 |
|
Baizhou Zhang
|
9fb48f951f
|
Support nextn for flashinfer mla attention backend (#4218)
|
2025-03-09 00:01:54 -08:00 |
|
Lianmin Zheng
|
d4017a6b63
|
[EAGLE] many fixes for eagle (#4195)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-07 22:12:13 -08:00 |
|
Lianmin Zheng
|
bc1534ff32
|
Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134)
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-06 06:13:59 -08:00 |
|
Ying Sheng
|
d3d4d76758
|
[Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
|
2025-03-05 08:06:07 -08:00 |
|
Lianmin Zheng
|
77a3954bf7
|
Simplify eagle tests and TP sync in grammar backend (#4066)
|
2025-03-04 13:40:40 -08:00 |
|
William
|
0d4e3228cf
|
[Feature] Add test for speculative_token_map (#4016)
|
2025-03-04 04:26:24 -08:00 |
|
Ke Bao
|
9fafa62db7
|
Share target model embed and head weights for nextn (#4033)
|
2025-03-03 13:30:04 -08:00 |
|