Zhihao Zhang
|
24f7cb1ece
|
[speculative decoding] rename lookahead to ngram (#11010)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
|
2025-09-28 21:06:59 -07:00 |
|
Liangsheng Yin
|
4a762041d7
|
move environ into sglang.srt to avoid break SRT auto sync. (#10791)
|
2025-09-23 02:04:20 -07:00 |
|
Liangsheng Yin
|
632b7d8cc9
|
Use simulate acc len from sglang.environ (#10771)
|
2025-09-23 12:59:50 +08:00 |
|
Yineng Zhang
|
dab4663b4e
|
[Auto Sync] Update .clang-format (20250919) (#10670)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-09-19 12:31:44 -07:00 |
|
Yineng Zhang
|
610a6d6e86
|
fix: resolve sync issue (#10668)
|
2025-09-19 12:05:39 -07:00 |
|
Zhihao Zhang
|
e7bc600304
|
[Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
2025-09-18 16:42:41 -07:00 |
|
lijin23
|
05b01ef4da
|
fix duplicated logger in eager_utils (#10410)
|
2025-09-13 22:25:40 -07:00 |
|
Yi Zhang
|
297d374510
|
support qwen3_next blackwell (#10403)
|
2025-09-13 17:18:26 +08:00 |
|
Stefan He
|
6c18ab46a2
|
[Qwen3-Next] switch to triton and cache conv states to accelerate MTP from 300 tok/s to 341 tok/s (#10335)
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
|
2025-09-11 11:59:48 -07:00 |
|
Yi Zhang
|
30c6e1f569
|
Qwen3-Next support (#10233)
Co-authored-by: cao1zhg <114661107+cao1zhg@users.noreply.github.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
|
2025-09-11 04:11:49 -07:00 |
|
Baizhou Zhang
|
8ad700f735
|
Cleaning codes for speculative attention mode (#10149)
|
2025-09-08 17:38:06 -07:00 |
|
cicirori
|
8c5930f08a
|
Add speculator attention backend switch (#9981)
|
2025-09-07 21:44:36 -07:00 |
|
Qiaolin Yu
|
8cda5a622c
|
Standalone speculative decoding (#10090)
|
2025-09-07 20:55:09 -07:00 |
|
Lzhang-hub
|
37d83c6e6d
|
Qwen2.5-VL eagle3 infer (#8801)
|
2025-09-07 20:44:34 -07:00 |
|
Ximingwang-09
|
df397a72e8
|
[feat] Add P/D attention select for draft model (#9755)
Co-authored-by: 纬杭 <ximing.wxm@antgroup.com>
|
2025-09-02 22:47:23 -07:00 |
|
narutolhy
|
839c93bd2d
|
feat: add original logprobs to response (#8375)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
|
2025-08-29 11:43:57 -07:00 |
|
zyksir
|
aee094e430
|
add support for nvidia/gpt-oss-120b-Eagle3 (#9739)
|
2025-08-28 00:20:20 -07:00 |
|
datdo-msft
|
110a65989b
|
[MTP] Force greedy sampling on AMD (#9127)
|
2025-08-22 11:14:43 -07:00 |
|
Qiaolin Yu
|
9c0c1e30b2
|
Disable torch.compile for get_last_loc_large_page_size_large_top_k (#9507)
Co-authored-by: ispobock <ispobaoke@gmail.com>
|
2025-08-22 02:05:02 -07:00 |
|
Qiaolin Yu
|
9ec314c6ac
|
Support speculative decoding in the trtllm_mha attention backend (#9331)
Co-authored-by: ispobock <ispobaoke@gmail.com>
|
2025-08-21 23:53:35 -07:00 |
|
pranavm-nvidia
|
64574ef8c0
|
Enables speculative decoding for the trtllm_mla attention backend (#9238)
|
2025-08-21 01:18:21 -07:00 |
|
Liangsheng Yin
|
eb19ccadae
|
[bug] fix errors related to context length in SD (#9388)
|
2025-08-21 10:32:34 +08:00 |
|
jiapingW
|
e99729c9f3
|
Fixed the issue where eagle3 TPOT was not as good as without eagle3. (#9404)
|
2025-08-20 16:42:01 -07:00 |
|
Even Zhou
|
de2dd73831
|
Revert "[feature] Rework Ascend NPU graph support" (#9385)
|
2025-08-20 00:35:10 -07:00 |
|
Even Zhou
|
3680d6f88b
|
[feature] Rework Ascend NPU graph support (#9350)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: yezhifeng (D) <y00897525@china.huawei.com>
Co-authored-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Maksim <makcum888e@mail.ru>
Co-authored-by: ssshinigami <44640852+ssshinigami@users.noreply.github.com>
|
2025-08-19 20:32:27 -07:00 |
|
Even Zhou
|
f4fafacc5d
|
Revert "[feature] Ascend NPU graph support (#8027)" (#9348)
|
2025-08-19 10:11:23 -07:00 |
|
zyksir
|
6a9d6ca33c
|
fix unexcepted answer in EAGLE mode (#9252)
|
2025-08-16 17:45:36 -07:00 |
|
VDV1985
|
94371dbbd6
|
[feature] Ascend NPU graph support (#8027)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: yezhifeng (D) <y00897525@china.huawei.com>
Co-authored-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Maksim <makcum888e@mail.ru>
Co-authored-by: ssshinigami <44640852+ssshinigami@users.noreply.github.com>
|
2025-08-16 17:25:17 -07:00 |
|
Cheng Wan
|
b87aacb5c5
|
[DP Attention] Refactor: adding some utility functions (#9136)
|
2025-08-13 21:08:06 -07:00 |
|
valarLip
|
53f7874ae6
|
refine aiter_backend for mtp (#7279)
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-08-08 11:06:02 -07:00 |
|
Cheng Wan
|
cb099d2095
|
[CUDA Graph] save cuda graph memory by using next_token_logits_buffer (#8579)
|
2025-08-03 03:06:47 -07:00 |
|
Cheng Wan
|
7a1f7fc504
|
[Feature] Hybrid EP and TP (#8590)
|
2025-07-31 02:53:25 -07:00 |
|
Lianmin Zheng
|
ed2e313eb6
|
Clean up server_args, triton cache manager (#8332)
|
2025-07-25 14:14:51 -07:00 |
|
Cheng Wan
|
c0fb25e949
|
DP Enhancement (#8280)
|
2025-07-24 21:36:21 -07:00 |
|
J
|
0e7a5b2694
|
fix: prevent crashes due to logit bias dimension mismatch (#7685)
|
2025-07-23 15:30:55 -07:00 |
|
Jay Zhou
|
99aefa037e
|
Fix eagle3 cuda graph (#8163)
|
2025-07-20 15:28:06 +08:00 |
|
Lianmin Zheng
|
5589b75024
|
Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (#7756)
Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com>
|
2025-07-05 12:17:05 -07:00 |
|
Cheng Wan
|
6c903611ca
|
Fix incorrect spec_num_draft_tokens in draft_extend (#7757)
|
2025-07-05 02:18:16 -07:00 |
|
lukec
|
886d344964
|
support llama4 eagle3 (#6985)
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Shenggui Li <somerlee.9@gmail.com>
Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-06-30 22:34:10 -07:00 |
|
Lianmin Zheng
|
071a1f51ae
|
[Minor] clean up multimodal processor and tokenizer manager (#7624)
|
2025-06-29 02:50:14 -07:00 |
|
u4lr451
|
ed0a0b692c
|
Perormance: Enable cuda graph for dp idle batch (#7269)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-23 17:34:13 -07:00 |
|
Cheng Wan
|
f8d48fd311
|
Fix dtype for idle input in spec decoding (#7456)
|
2025-06-23 11:23:25 -07:00 |
|
Cheng Wan
|
ac5010e0ba
|
Fix CUDA Graph Check under Deepep with DP FFN (#7451)
|
2025-06-22 20:35:58 -07:00 |
|
Liangsheng Yin
|
05c9bc8956
|
[minor] simplify the TokenToKVPoolAllocator (#7414)
|
2025-06-22 12:37:18 +08:00 |
|
Cheng Wan
|
e879d8b7a8
|
[Feature] Comprehensive Hybrid Parallelism Support (#6389)
|
2025-06-20 14:43:11 -07:00 |
|
u4lr451
|
10d60cd41b
|
feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-17 00:33:28 -07:00 |
|
Lianmin Zheng
|
1a9c2c9214
|
Fix AMD speculative decoding (#7252)
|
2025-06-16 17:01:33 -07:00 |
|
Lianmin Zheng
|
c64290dcb5
|
Use seq_len_fill_value in the cuda graph runners (#7233)
|
2025-06-16 15:57:07 -07:00 |
|
Lianmin Zheng
|
53a525bf33
|
[Eagle] Fix kernel call after updating speculative sampling kernels (#7231)
|
2025-06-16 07:25:59 -07:00 |
|
Lianmin Zheng
|
b1286a116a
|
[EAGLE] Refactor code for page size > 1 & more simplifications (#7213)
|
2025-06-16 03:04:29 -07:00 |
|