Commit Graph

232 Commits

Author SHA1 Message Date
lukec
a53fe428f9 Support FlashMLA backend (#4472)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-16 09:07:06 -07:00
Ying Sheng
1b859295f4 [Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-16 02:48:55 -07:00
vikram singh shekhawat
bf63ee54ed Auto-detect device if not specified in server arguments. (#4423) 2025-03-15 21:13:51 -07:00
wangyu
1ce4878d31 feat(remote_model): support variable remote backend for model loader (#3964)
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
2025-03-14 00:40:44 -07:00
Lianmin Zheng
8e66fbecee Improve DP attention (#4390)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-03-13 08:23:56 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
William
56c39a05a2 Remove the choices in --speculative-eagle-topk argument (#4329) 2025-03-12 21:19:16 -07:00
Lianmin Zheng
e35a93fa8a Move output processing logic from scheduler.py into a separate file (#4354) 2025-03-12 16:21:49 -07:00
Lianmin Zheng
5a6400eec5 Test no vllm custom allreduce (#4256) 2025-03-10 10:08:25 -07:00
Lianmin Zheng
730d084f2a Minor style fix for sgl-kernel (#4243) 2025-03-09 20:15:13 -07:00
Lianmin Zheng
4a05bdfa86 Revert "Check eagle server args" (#4242) 2025-03-09 18:53:33 -07:00
Ying Sheng
34c8898755 Check eagle server args (#4217) 2025-03-09 01:10:43 -08:00
HandH1998
0dd6cda288 Apply sgl w8a8 fp8 kernel (#3148) 2025-03-09 00:03:32 -08:00
Yineng Zhang
eb61f5c9af Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" (#4186) 2025-03-07 10:27:52 -08:00
HAI
0beea4503f ROCm: Flex Attention Enablement with custom backends (#4178)
Co-authored-by: linsun12 <linsun12@amd.com>
2025-03-07 04:38:53 -08:00
Lianmin Zheng
286e6540a6 Remove prefill-only-one-req (#4117) 2025-03-05 20:58:48 -08:00
Ying Sheng
d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2025-03-05 08:06:07 -08:00
Xihuai Wang
95575aa76a Reasoning parser (#4000)
Co-authored-by: Lucas Pickup <lupickup@microsoft.com>
2025-03-03 21:16:36 -08:00
Ke Bao
9fafa62db7 Share target model embed and head weights for nextn (#4033) 2025-03-03 13:30:04 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00
Lianmin Zheng
66301e124f Improve code styles (#4021) 2025-03-03 03:20:23 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
Zhousx
7fbab730bd [feat] add small vocab table for eagle's draft model[1]. (#3822)
Co-authored-by: Achazwl <323163497@qq.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-03-02 18:58:45 -08:00
Baizhou Zhang
90a4b7d98a [Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-28 18:13:56 -08:00
fzyzcjy
e3e0bc50a9 [Feature] SPMD for SGLang + Verl (#3852) 2025-02-28 09:53:10 -08:00
Qiaolin Yu
d6898dd253 Add return hidden state in the native API (#3897)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-26 22:06:54 -08:00
JC1DA
7551498a69 [Feature] Support llguidance for constrained decoding (#3298) 2025-02-26 10:41:49 -08:00
Shenggui Li
c0bb9eb3b3 [improve] made timeout configurable (#3803) 2025-02-25 00:26:08 -08:00
Lianmin Zheng
f2388f6b95 Revert "Rename TokenizerManager to StdOrchestrator" (#3828) 2025-02-24 14:47:59 -08:00
Lianmin Zheng
c9745ee082 Fix pandas dependency in CI (#3818) 2025-02-24 05:56:57 -08:00
Lianmin Zheng
27a46317b6 Fix dependency (#3813) 2025-02-24 03:50:58 -08:00
fzyzcjy
45360b2fa9 Improve: Rename TokenizerManager to StdOrchestrator (#3116) 2025-02-23 00:30:58 -08:00
Ke Bao
862dd76c76 Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 (#3582) 2025-02-15 05:28:34 +08:00
Yineng Zhang
70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) 2025-02-14 08:50:14 +08:00
Ata Fatahi
b8318aec48 Make NCCL NVLS configurable (#3502) 2025-02-12 03:25:06 +08:00
Jackmin801
5f0e7de339 [Feat] Return hidden states (experimental) (#3364)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-10 15:54:37 -08:00
Xiaoyu Zhang
2f47d710ae refine some typo (#3473) 2025-02-10 23:35:44 +08:00
Baizhou Zhang
70817a7eae [Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-03 22:09:13 -08:00
Wen-Heng (Jack) Chung
d9eb9358cc Tune paged attention parameters for AMD GPU. (#3255) 2025-02-01 17:29:45 -08:00
Zhiqiang Xie
08104b56de Sanity check to prevent performance regression (#3171)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-01-27 12:28:17 -08:00
Lianmin Zheng
53cef81587 Improve weight loading and code style (#3174) 2025-01-27 03:00:41 -08:00
YAMY
b045841bae Feature/function calling update (#2700)
Co-authored-by: Mingyuan Ma <mamingyuan2001@berkeley.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
2025-01-26 09:57:51 -08:00
Ke Wen
862bcff833 Support loading of larger models with on-the-fly quantization (#3061) 2025-01-22 21:33:17 -08:00
Lianmin Zheng
73401fd016 Sync distributed package from vllm 0.6.4.post1 (#3010) 2025-01-20 04:57:14 -08:00
Hongpeng Guo
e403d23757 [Feature] Add sampler custom logits processor (#2396)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-19 14:46:53 -08:00
Chunyuan WU
63051738a9 Enable CPU device on SGLang (#2806) 2025-01-16 21:22:53 -08:00
Chang Su
a8ccacc8b8 [Frontend] Fix request length check and add option to disallow auto truncation in scheduler (#2876) 2025-01-16 14:51:19 -08:00
Lianmin Zheng
bc6915e3b9 Improve type annotation and styles (#2926) 2025-01-16 12:51:11 -08:00
Lianmin Zheng
8b6ce52e92 Support multi-node DP attention (#2925)
Co-authored-by: dhou-xai <dhou@x.ai>
2025-01-16 11:15:00 -08:00
Ke Bao
cc0485bef2 Support w8a8 int8 quantization config (#2881) 2025-01-14 17:07:49 +08:00