Commit Graph

58 Commits

Author SHA1 Message Date
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Cheng Wan
25c83fff6a Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
2025-05-11 23:36:29 -07:00
Ximingwang-09
921e4a8185 [Docs]Delete duplicate content (#6146)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
2025-05-10 15:02:15 -07:00
Zhu Chen
fa7d7fd9e5 [Feature] Add FlashAttention3 as a backend for VisionAttention (#5764)
Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-05-08 10:01:19 -07:00
Baizhou Zhang
8f508cc77f Update doc for MLA attention backends (#6034) 2025-05-07 18:51:05 -07:00
Baizhou Zhang
fee37d9e8d [Doc]Fix description for dp_size argument (#6063) 2025-05-08 00:04:22 +08:00
Wenxuan Tan
22da3d978f Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (#5555) 2025-05-05 10:32:17 -07:00
Qiaolin Yu
8c0cfca87d Feat: support cuda graph for LoRA (#4115)
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
2025-04-28 23:30:44 -07:00
Lianmin Zheng
849c83a0c0 [CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00
Baizhou Zhang
f48b007c1d [Doc] Recover history of server_arguments.md (#5851) 2025-04-28 10:48:21 -07:00
Michael Yao
966eb90865 [Docs] Replace lists with tables for cleanup and readability in server_arguments (#5276) 2025-04-28 00:36:10 -07:00
Trevor Morris
84810da4ae Add Cutlass MLA attention backend (#5390) 2025-04-27 20:58:53 -07:00
Lianmin Zheng
155890e4d1 [Minor] fix documentations (#5756) 2025-04-26 17:48:43 -07:00
Michael Yao
92bb64bc86 [Doc] Fix a 404 link to llama-405b (#5615)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
2025-04-21 20:39:37 -07:00
Yineng Zhang
a6f892e5d0 Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544) 2025-04-18 16:50:21 -07:00
Wenxuan Tan
bfa3922451 Avoid computing lse in Ragged Prefill when there's no prefix. (#5476)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-04-18 01:13:57 -07:00
Baizhou Zhang
6fb29ffd9e Deprecate enable-flashinfer-mla and enable-flashmla (#5480) 2025-04-17 01:43:33 -07:00
Baizhou Zhang
4fb05583ef Deprecate disable-mla (#5481) 2025-04-17 01:43:14 -07:00
Baizhou Zhang
a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) 2025-04-15 22:01:22 -07:00
Mick
34ef6c8135 [VLM] Adopt fast image processor by default (#5065) 2025-04-11 21:46:58 -07:00
Baizhou Zhang
efbae697b3 [Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052) 2025-04-05 01:23:02 -07:00
Lianmin Zheng
74885a848b Revert "Replace enable_flashinfer_mla argument with attention_backend" (#5048) 2025-04-03 13:30:56 -07:00
Baizhou Zhang
e8999b13b7 Replace enable_flashinfer_mla argument with attention_backend (#5005) 2025-04-03 02:53:58 -07:00
Jinyan Chen
23c764b18a [Feature] Support DeepEP Low Latency (#4767)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-04-01 09:23:25 -07:00
tarinkk
7f19e083c1 Support (1 <= dp < tp) in the dp attention in DeepEP (#4770)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
2025-03-27 17:09:35 -07:00
Jiří Suchomel
f60f293195 [k8s] Clarified the usage of shared memory. (#4341) 2025-03-27 08:53:19 -07:00
Jinyan Chen
f44db16c8e [Feature] Integrate DeepEP into SGLang (#4232)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
2025-03-19 08:16:31 -07:00
Chayenne
e1a5e7e47d docs: hot fix torch compile cache (#4442) 2025-03-14 19:05:59 -07:00
Jun Liu
14344caa38 [docs] Update outdated description about torch.compile (#3844) 2025-03-12 22:09:38 -07:00
Peter Pan
016033188c docs: add parameter --log-requests-level (#4335) 2025-03-12 21:19:37 -07:00
Chayenne
ebddb65aed Docs: add torch compile cache (#4151)
Co-authored-by: ybyang <ybyang7@iflytek.com>
2025-03-06 14:27:09 -08:00
samzong
d2d0d061d9 fix cross-reference error and spelling mistakes (#4101)
Signed-off-by: samzong <samzong.lu@gmail.com>
2025-03-05 16:39:02 -08:00
Qiaolin Yu
357671e216 Add examples for server token-in-token-out (#4103)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-05 13:16:31 -08:00
Baizhou Zhang
fc91d08a8f [Revision] Add fast decode plan for flashinfer mla (#4012) 2025-03-05 11:20:41 -08:00
Mick
583d6af71b example: add vlm to token in & out example (#3941)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-04 22:18:26 -08:00
Chayenne
146ac8df07 Add examples in sampling parameters (#4039) 2025-03-03 13:04:32 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
Chayenne
728e175fc4 Add examples to token-in-token-out for LLM (#4010) 2025-03-02 21:03:49 -08:00
Lianmin Zheng
9e1014cf99 Revert "Add fast decode plan for flashinfer mla" (#4008) 2025-03-02 19:29:10 -08:00
Baizhou Zhang
fa56106731 Add fast decode plan for flashinfer mla (#3987) 2025-03-02 19:16:37 -08:00
Zhousx
7fbab730bd [feat] add small vocab table for eagle's draft model[1]. (#3822)
Co-authored-by: Achazwl <323163497@qq.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-03-02 18:58:45 -08:00
Baizhou Zhang
90a4b7d98a [Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-28 18:13:56 -08:00
Baizhou Zhang
3e02526b1f [Doc] Add experimental tag for flashinfer mla (#3925) 2025-02-27 01:55:36 -08:00
Baizhou Zhang
71ed01833d [doc] Update document for flashinfer mla (#3907) 2025-02-26 20:40:45 -08:00
simveit
44a2c4bd56 Docs: improve link to docs (#3860)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-26 10:29:25 -08:00
Yuanheng Zhao
3758d209a0 [Doc] Fix typo in server-argument description (#3641) 2025-02-24 16:57:13 -08:00
Baizhou Zhang
ac05310098 [Docs] Modify ep related server args and remove cublas part of deepseek (#3732) 2025-02-21 03:37:56 +08:00
Mick
7711ac6ed0 doc: emphasize and notify the usage of chat_template (#3589)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-15 00:10:32 -08:00