sglang

Author	SHA1	Message	Date
Baizhou Zhang	20c90be23d	[Feature] Support FA3 backend for MLA (#4831 )	2025-03-28 18:30:14 -07:00
tarinkk	7f19e083c1	Support (1 <= dp < tp) in the dp attention in DeepEP (#4770 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu>	2025-03-27 17:09:35 -07:00
fzyzcjy	92bb49a7f9	Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device (#4565 )	2025-03-27 00:22:33 -07:00
fzyzcjy	26f07294f1	Warn users when release_memory_occupation is called without memory saver enabled (#4566 )	2025-03-26 00:18:14 -07:00
Stefan He	5d7edc8e55	Support FA3 as Attention backend by using `--attention-backend fa3` (#4680 ) Co-authored-by: qsong <qsong@linkedin.com> Co-authored-by: qingquansong <ustcsqq@gmail.com>	2025-03-23 23:28:11 -07:00
Mick	11577cedb7	refactor: bug fixes and refactor for vlm (#4661 )	2025-03-22 22:48:49 -07:00
Jinyan Chen	f44db16c8e	[Feature] Integrate DeepEP into SGLang (#4232 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: Xuting Zhou <xutingz@nvidia.com>	2025-03-19 08:16:31 -07:00
aoshen524	588865f0e0	[Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274 ) Co-authored-by: ShenAo1111 <1377693092@qq.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-03-18 20:33:07 -07:00
James Liu	9e0186f352	[Feature] Support EAGLE 3 (#4247 )	2025-03-18 07:35:23 -07:00
Mick	d373a48c98	fix: second_per_grid_ts should be used to get mrope position (#3682 )	2025-03-17 18:12:38 -07:00
萝卜菜	d6d21640d3	[Feature] Support Deepseek-VL2 (#2798 ) Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Yi Zhang <1109276519@qq.com>	2025-03-16 23:07:59 -07:00
lukec	a53fe428f9	Support FlashMLA backend (#4472 ) Co-authored-by: yinfan98 <1106310035@qq.com>	2025-03-16 09:07:06 -07:00
wangyu	1ce4878d31	feat(remote_model): support variable remote backend for model loader (#3964 ) Signed-off-by: wangyu <wangyu.steph@bytedance.com>	2025-03-14 00:40:44 -07:00
Lianmin Zheng	45de89719c	Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367 )	2025-03-12 23:45:52 -07:00
Meng, Hengyu	71046fcd71	[XPU][CPU] Enable the native path of DeepSeek (#4086 ) Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>	2025-03-12 22:26:29 -07:00
Lianmin Zheng	c76040e31b	Support page size > 1 (#4356 )	2025-03-12 22:22:39 -07:00
Lianmin Zheng	e35a93fa8a	Move output processing logic from scheduler.py into a separate file (#4354 )	2025-03-12 16:21:49 -07:00
Lianmin Zheng	d40ee62b5d	Update nightly tests (#4352 )	2025-03-12 15:36:13 -07:00
Lianmin Zheng	00d25a7f5e	Fix quantization and nightly tests (#4258 )	2025-03-10 03:06:21 -07:00
Lianmin Zheng	08c4d764a5	lazy import attn backends (#4200 )	2025-03-08 00:41:35 -08:00
Lianmin Zheng	d4017a6b63	[EAGLE] many fixes for eagle (#4195 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Sehoon Kim <sehoon@x.ai>	2025-03-07 22:12:13 -08:00
Yineng Zhang	eb61f5c9af	Revert "ROCm: Flex Attention Enablement with custom backends (#4178 )" (#4186 )	2025-03-07 10:27:52 -08:00
HAI	0beea4503f	ROCm: Flex Attention Enablement with custom backends (#4178 ) Co-authored-by: linsun12 <linsun12@amd.com>	2025-03-07 04:38:53 -08:00
Lianmin Zheng	98c73d71cb	[Minor] make the `__init__` function of model_runner.py shorter (#4132 )	2025-03-06 01:51:12 -08:00
Zhiqiang Xie	aee30630d8	Add a pointer to the real KV cache pool (#4113 )	2025-03-05 21:39:07 -08:00
Ying Sheng	d3d4d76758	[Eagle] Refactor eagle speculative decoding (#3986 ) Co-authored-by: Ke Bao <ISPObaoke@163.com>	2025-03-05 08:06:07 -08:00
Lianmin Zheng	e074d84e5b	[Minor] more code cleanup (#4077 )	2025-03-04 21:23:47 -08:00
Chen Shengzhi	61261b3996	[XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. (#3954 )	2025-03-04 04:05:56 -08:00
Lianmin Zheng	ac2387279e	Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: dhou-xai <dhou@x.ai> Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>	2025-03-03 00:12:04 -08:00
Baizhou Zhang	90a4b7d98a	[Feature]Support ragged prefill in flashinfer mla backend (#3967 ) Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>	2025-02-28 18:13:56 -08:00
fzyzcjy	e3e0bc50a9	[Feature] SPMD for SGLang + Verl (#3852 )	2025-02-28 09:53:10 -08:00
Shenggui Li	c0bb9eb3b3	[improve] made timeout configurable (#3803 )	2025-02-25 00:26:08 -08:00
Baizhou Zhang	b110084654	Refactor flashinfer logic for deepseek v3 and fix accuracy bug (#3785 )	2025-02-24 04:07:25 -08:00
Yineng Zhang	714f3e6362	feat: support flashinfer mla with prefix cache (#3643 )	2025-02-18 02:06:43 +08:00
Yineng Zhang	70f894b810	feat: support flashinfer mla attention for deepseek v3 (#3550 )	2025-02-14 08:50:14 +08:00
Baizhou Zhang	70817a7eae	[Feature] Define backends and add Triton backend for Lora (#3161 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com>	2025-02-03 22:09:13 -08:00
Yineng Zhang	013021b6a1	refactor EAGLE 2 (#3269 ) Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: merrymercy <lianminzheng@gmail.com> Co-authored-by: Ying1123 <sqy1415@gmail.com>	2025-02-03 20:52:30 +08:00
Lianmin Zheng	53cef81587	Improve weight loading and code style (#3174 )	2025-01-27 03:00:41 -08:00
Ke Wen	862bcff833	Support loading of larger models with on-the-fly quantization (#3061 )	2025-01-22 21:33:17 -08:00
Lianmin Zheng	89cd923581	Roll back to use vllm custom allreduce (#3006 )	2025-01-20 04:03:15 -08:00
Lianmin Zheng	7906d1d298	Remove the unused write_with_records (#2972 )	2025-01-18 20:20:23 -08:00
Mick	3d93f84a00	[Feature] Support minicpmv v2.6 (#2785 ) Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: yizhang2077 <1109276519@qq.com>	2025-01-18 14:14:19 -08:00
Yineng Zhang	5dc54f1a62	feat: remove vllm distributed (#2907 ) Co-authored-by: Zhangyi <1109276519@qq.com>	2025-01-17 22:31:51 +08:00
Chunyuan WU	63051738a9	Enable CPU device on SGLang (#2806 )	2025-01-16 21:22:53 -08:00
Lianmin Zheng	bc6915e3b9	Improve type annotation and styles (#2926 )	2025-01-16 12:51:11 -08:00
Lianmin Zheng	8b6ce52e92	Support multi-node DP attention (#2925 ) Co-authored-by: dhou-xai <dhou@x.ai>	2025-01-16 11:15:00 -08:00
Lianmin Zheng	46d4431889	Add a new api configure_logging to allow dumping the requests (#2875 )	2025-01-13 14:24:00 -08:00
fzyzcjy	923f518337	CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630 )	2025-01-13 11:38:51 -08:00
bjmsong	0bb0f76311	Support FP8 E4M3 KV Cache (#2786 ) Co-authored-by: root <bjmsong@126.com>	2025-01-12 21:17:11 -08:00
Chang Su	f290bd4332	[Bugfix] Fix embedding model hangs with `--enable-metrics` (#2822 )	2025-01-10 13:14:51 -08:00

1 2 3 4

191 Commits