Lianmin Zheng
|
74885a848b
|
Revert "Replace enable_flashinfer_mla argument with attention_backend" (#5048)
|
2025-04-03 13:30:56 -07:00 |
|
fzyzcjy
|
8e10fec9a8
|
Small refactor DeepEPMode to clean up code a bit (#4992)
|
2025-04-03 02:56:44 -07:00 |
|
Baizhou Zhang
|
e8999b13b7
|
Replace enable_flashinfer_mla argument with attention_backend (#5005)
|
2025-04-03 02:53:58 -07:00 |
|
Jinyan Chen
|
23c764b18a
|
[Feature] Support DeepEP Low Latency (#4767)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-04-01 09:23:25 -07:00 |
|
Lianmin Zheng
|
4ede6770cd
|
Fix retract for page size > 1 (#4914)
|
2025-03-30 02:57:15 -07:00 |
|
Lianmin Zheng
|
b26bc86b36
|
Support page size > 1 + eagle (#4908)
|
2025-03-30 00:46:23 -07:00 |
|
fzyzcjy
|
cf29fe9e78
|
Fix Engine error when enabling DP attention (#4648)
|
2025-03-27 22:17:30 -07:00 |
|
Vincent
|
e2e2ab70e0
|
IPv6 support (#3949)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-03-27 21:42:13 -07:00 |
|
tarinkk
|
7f19e083c1
|
Support (1 <= dp < tp) in the dp attention in DeepEP (#4770)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
|
2025-03-27 17:09:35 -07:00 |
|
Xiaoyu Zhang
|
04e3ff6975
|
Support compressed tensors fp8w8a8 (#4743)
|
2025-03-26 13:21:25 -07:00 |
|
Stefan He
|
5d7edc8e55
|
Support FA3 as Attention backend by using --attention-backend fa3 (#4680)
Co-authored-by: qsong <qsong@linkedin.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-03-23 23:28:11 -07:00 |
|
Byron Hsu
|
c7c7dbebbe
|
[PD] Release initial code (#4654)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Ying1123 <sqy1415@gmail.com>
Co-authored-by: merrymercy <lianminzheng@gmail.com>
Co-authored-by: makro
Co-authored-by: dhou-xai
|
2025-03-21 14:47:47 -07:00 |
|
lukec
|
b6944f97a6
|
Support FlashMLA backend cuda graph (#4514)
Co-authored-by: yinfan98 <1106310035@qq.com>
Co-authored-by: Hongbosherlock <hongbosherlock@gmail.com>
Co-authored-by: ispobock <ispobaoke@163.com>
|
2025-03-19 08:25:34 -07:00 |
|
Jinyan Chen
|
f44db16c8e
|
[Feature] Integrate DeepEP into SGLang (#4232)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
|
2025-03-19 08:16:31 -07:00 |
|
James Liu
|
9e0186f352
|
[Feature] Support EAGLE 3 (#4247)
|
2025-03-18 07:35:23 -07:00 |
|
Zhiqiang Xie
|
a98290aea3
|
Unit test for Hierarchical Caching (#4486)
|
2025-03-17 17:45:00 -07:00 |
|
mlmz
|
452db50808
|
Constraint Decoding: Set xgrammar as the default grammar backend (#4386)
|
2025-03-16 18:53:43 -07:00 |
|
woodx
|
48efec7b05
|
Feature: support code completion (#3612)
|
2025-03-16 18:26:19 -07:00 |
|
lukec
|
a53fe428f9
|
Support FlashMLA backend (#4472)
Co-authored-by: yinfan98 <1106310035@qq.com>
|
2025-03-16 09:07:06 -07:00 |
|
Ying Sheng
|
1b859295f4
|
[Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-16 02:48:55 -07:00 |
|
vikram singh shekhawat
|
bf63ee54ed
|
Auto-detect device if not specified in server arguments. (#4423)
|
2025-03-15 21:13:51 -07:00 |
|
wangyu
|
1ce4878d31
|
feat(remote_model): support variable remote backend for model loader (#3964)
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
|
2025-03-14 00:40:44 -07:00 |
|
Lianmin Zheng
|
8e66fbecee
|
Improve DP attention (#4390)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-03-13 08:23:56 -07:00 |
|
Lianmin Zheng
|
c76040e31b
|
Support page size > 1 (#4356)
|
2025-03-12 22:22:39 -07:00 |
|
William
|
56c39a05a2
|
Remove the choices in --speculative-eagle-topk argument (#4329)
|
2025-03-12 21:19:16 -07:00 |
|
Lianmin Zheng
|
e35a93fa8a
|
Move output processing logic from scheduler.py into a separate file (#4354)
|
2025-03-12 16:21:49 -07:00 |
|
Lianmin Zheng
|
5a6400eec5
|
Test no vllm custom allreduce (#4256)
|
2025-03-10 10:08:25 -07:00 |
|
Lianmin Zheng
|
730d084f2a
|
Minor style fix for sgl-kernel (#4243)
|
2025-03-09 20:15:13 -07:00 |
|
Lianmin Zheng
|
4a05bdfa86
|
Revert "Check eagle server args" (#4242)
|
2025-03-09 18:53:33 -07:00 |
|
Ying Sheng
|
34c8898755
|
Check eagle server args (#4217)
|
2025-03-09 01:10:43 -08:00 |
|
HandH1998
|
0dd6cda288
|
Apply sgl w8a8 fp8 kernel (#3148)
|
2025-03-09 00:03:32 -08:00 |
|
Yineng Zhang
|
eb61f5c9af
|
Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" (#4186)
|
2025-03-07 10:27:52 -08:00 |
|
HAI
|
0beea4503f
|
ROCm: Flex Attention Enablement with custom backends (#4178)
Co-authored-by: linsun12 <linsun12@amd.com>
|
2025-03-07 04:38:53 -08:00 |
|
Lianmin Zheng
|
286e6540a6
|
Remove prefill-only-one-req (#4117)
|
2025-03-05 20:58:48 -08:00 |
|
Ying Sheng
|
d3d4d76758
|
[Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
|
2025-03-05 08:06:07 -08:00 |
|
Xihuai Wang
|
95575aa76a
|
Reasoning parser (#4000)
Co-authored-by: Lucas Pickup <lupickup@microsoft.com>
|
2025-03-03 21:16:36 -08:00 |
|
Ke Bao
|
9fafa62db7
|
Share target model embed and head weights for nextn (#4033)
|
2025-03-03 13:30:04 -08:00 |
|
Lianmin Zheng
|
935cda944b
|
Misc clean up; Remove the support of jump forward (#4032)
|
2025-03-03 07:02:14 -08:00 |
|
Lianmin Zheng
|
66301e124f
|
Improve code styles (#4021)
|
2025-03-03 03:20:23 -08:00 |
|
Lianmin Zheng
|
ac2387279e
|
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
|
2025-03-03 00:12:04 -08:00 |
|
Zhousx
|
7fbab730bd
|
[feat] add small vocab table for eagle's draft model[1]. (#3822)
Co-authored-by: Achazwl <323163497@qq.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-03-02 18:58:45 -08:00 |
|
Baizhou Zhang
|
90a4b7d98a
|
[Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
|
2025-02-28 18:13:56 -08:00 |
|
fzyzcjy
|
e3e0bc50a9
|
[Feature] SPMD for SGLang + Verl (#3852)
|
2025-02-28 09:53:10 -08:00 |
|
Qiaolin Yu
|
d6898dd253
|
Add return hidden state in the native API (#3897)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-02-26 22:06:54 -08:00 |
|
JC1DA
|
7551498a69
|
[Feature] Support llguidance for constrained decoding (#3298)
|
2025-02-26 10:41:49 -08:00 |
|
Shenggui Li
|
c0bb9eb3b3
|
[improve] made timeout configurable (#3803)
|
2025-02-25 00:26:08 -08:00 |
|
Lianmin Zheng
|
f2388f6b95
|
Revert "Rename TokenizerManager to StdOrchestrator" (#3828)
|
2025-02-24 14:47:59 -08:00 |
|
Lianmin Zheng
|
c9745ee082
|
Fix pandas dependency in CI (#3818)
|
2025-02-24 05:56:57 -08:00 |
|
Lianmin Zheng
|
27a46317b6
|
Fix dependency (#3813)
|
2025-02-24 03:50:58 -08:00 |
|
fzyzcjy
|
45360b2fa9
|
Improve: Rename TokenizerManager to StdOrchestrator (#3116)
|
2025-02-23 00:30:58 -08:00 |
|