Commit Graph

184 Commits

Author SHA1 Message Date
Zhiqiang Xie
a169b9f813 Fix oom error for large page size (#4913)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-03-30 21:34:21 -07:00
Baizhou Zhang
e62d60fe6d [Fix] avoid stream sync and torch compile in prefill for fa3 backend (#4932) 2025-03-30 13:53:44 -07:00
Lianmin Zheng
4ede6770cd Fix retract for page size > 1 (#4914) 2025-03-30 02:57:15 -07:00
Lianmin Zheng
b26bc86b36 Support page size > 1 + eagle (#4908) 2025-03-30 00:46:23 -07:00
Stefan He
1b9175cb23 [FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 (#4745)
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
2025-03-27 00:45:11 -07:00
Mick
1e86457c90 model: Minicpmo (#3023) 2025-03-24 20:08:40 -07:00
Mick
11577cedb7 refactor: bug fixes and refactor for vlm (#4661) 2025-03-22 22:48:49 -07:00
Zhiqiang Xie
ecbfe58bb0 Bug fix for metrics counter (#4660) 2025-03-22 13:39:21 -07:00
Byron Hsu
c7c7dbebbe [PD] Release initial code (#4654)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Ying1123 <sqy1415@gmail.com>
Co-authored-by: merrymercy <lianminzheng@gmail.com>
Co-authored-by: makro
Co-authored-by: dhou-xai
2025-03-21 14:47:47 -07:00
Jinyan Chen
f44db16c8e [Feature] Integrate DeepEP into SGLang (#4232)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
2025-03-19 08:16:31 -07:00
Mick
d373a48c98 fix: second_per_grid_ts should be used to get mrope position (#3682) 2025-03-17 18:12:38 -07:00
萝卜菜
d6d21640d3 [Feature] Support Deepseek-VL2 (#2798)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-03-16 23:07:59 -07:00
Mick
9d02bb3e2a Urgent model support: support gemma-3-it (#4424) 2025-03-16 17:37:32 -07:00
lukec
a53fe428f9 Support FlashMLA backend (#4472)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-16 09:07:06 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
Qiaolin Yu
85d2365d33 Fix the output of hidden states after HTTP requests (#4269) 2025-03-13 14:54:06 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
Lianmin Zheng
e35a93fa8a Move output processing logic from scheduler.py into a separate file (#4354) 2025-03-12 16:21:49 -07:00
Zhiqiang Xie
10b544ae9b Hierarchical Caching Refactoring and Fixing TP issue (#4082) 2025-03-12 11:22:35 -07:00
Mick
ff2ce0b86f refactor: move image processors to separate files (#4229) 2025-03-11 12:35:35 -07:00
Baizhou Zhang
fc91d08a8f [Revision] Add fast decode plan for flashinfer mla (#4012) 2025-03-05 11:20:41 -08:00
Ying Sheng
d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2025-03-05 08:06:07 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00
Lianmin Zheng
66301e124f Improve code styles (#4021) 2025-03-03 03:20:23 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
Lianmin Zheng
9e1014cf99 Revert "Add fast decode plan for flashinfer mla" (#4008) 2025-03-02 19:29:10 -08:00
Baizhou Zhang
fa56106731 Add fast decode plan for flashinfer mla (#3987) 2025-03-02 19:16:37 -08:00
Qiaolin Yu
40782f05d7 Refactor: Move return_hidden_states to the generate input (#3985)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
2025-03-01 17:51:29 -08:00
Baizhou Zhang
90a4b7d98a [Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-28 18:13:56 -08:00
Qiaolin Yu
d6898dd253 Add return hidden state in the native API (#3897)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-26 22:06:54 -08:00
Yineng Zhang
714f3e6362 feat: support flashinfer mla with prefix cache (#3643) 2025-02-18 02:06:43 +08:00
Yineng Zhang
70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) 2025-02-14 08:50:14 +08:00
Jackmin801
5f0e7de339 [Feat] Return hidden states (experimental) (#3364)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-10 15:54:37 -08:00
Lianmin Zheng
53cef81587 Improve weight loading and code style (#3174) 2025-01-27 03:00:41 -08:00
Lianmin Zheng
1dda8c5e4c Return more infos for computing average acceptance length (#3152) 2025-01-26 04:51:54 -08:00
Lianmin Zheng
d1a0863251 Add a test case for cached_tokens (#3145) 2025-01-26 01:39:28 -08:00
Lianmin Zheng
3d8f1c9bcf Use int64 as indices for set_kv_buffer (#3039) 2025-01-21 19:46:09 -08:00
996_icu
b730aa6b9e [EAGLE] Fix some boundary situation when retract reqs and req's max token = 1 (#2939)
Co-authored-by: josephyou <josephyou@tencent.com>
2025-01-20 17:46:43 -08:00
Hongpeng Guo
583697cd71 [Enhancement] Custom Logit Processor Improvement (#2998)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-20 02:00:35 -08:00
Hongpeng Guo
e403d23757 [Feature] Add sampler custom logits processor (#2396)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-19 14:46:53 -08:00
Lianmin Zheng
7906d1d298 Remove the unused write_with_records (#2972) 2025-01-18 20:20:23 -08:00
Chang Su
4d4cdb3fe7 Frontend: better error message handling for FINISH_ABORT in scheduler.py (#2956) 2025-01-18 19:37:30 -08:00
Yang Zheng
2bd18e2d76 Memory pool: Minor optimize to avoid to (#2901) 2025-01-18 19:35:12 -08:00
Mick
3d93f84a00 [Feature] Support minicpmv v2.6 (#2785)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-01-18 14:14:19 -08:00
Chunyuan WU
63051738a9 Enable CPU device on SGLang (#2806) 2025-01-16 21:22:53 -08:00
Lianmin Zheng
bc6915e3b9 Improve type annotation and styles (#2926) 2025-01-16 12:51:11 -08:00
Lianmin Zheng
8b6ce52e92 Support multi-node DP attention (#2925)
Co-authored-by: dhou-xai <dhou@x.ai>
2025-01-16 11:15:00 -08:00
Lianmin Zheng
f65c13b559 Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902) 2025-01-15 04:54:14 -08:00
Lianmin Zheng
b8574f6953 Clean up eagle code (#2756) 2025-01-06 14:54:18 -08:00
Lianmin Zheng
ad20b7957e Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-01-02 02:09:08 -08:00