Baizhou Zhang
|
20c90be23d
|
[Feature] Support FA3 backend for MLA (#4831)
|
2025-03-28 18:30:14 -07:00 |
|
tarinkk
|
7f19e083c1
|
Support (1 <= dp < tp) in the dp attention in DeepEP (#4770)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
|
2025-03-27 17:09:35 -07:00 |
|
fzyzcjy
|
92bb49a7f9
|
Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device (#4565)
|
2025-03-27 00:22:33 -07:00 |
|
fzyzcjy
|
26f07294f1
|
Warn users when release_memory_occupation is called without memory saver enabled (#4566)
|
2025-03-26 00:18:14 -07:00 |
|
Stefan He
|
5d7edc8e55
|
Support FA3 as Attention backend by using --attention-backend fa3 (#4680)
Co-authored-by: qsong <qsong@linkedin.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-03-23 23:28:11 -07:00 |
|
Mick
|
11577cedb7
|
refactor: bug fixes and refactor for vlm (#4661)
|
2025-03-22 22:48:49 -07:00 |
|
Jinyan Chen
|
f44db16c8e
|
[Feature] Integrate DeepEP into SGLang (#4232)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
|
2025-03-19 08:16:31 -07:00 |
|
aoshen524
|
588865f0e0
|
[Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-03-18 20:33:07 -07:00 |
|
James Liu
|
9e0186f352
|
[Feature] Support EAGLE 3 (#4247)
|
2025-03-18 07:35:23 -07:00 |
|
Mick
|
d373a48c98
|
fix: second_per_grid_ts should be used to get mrope position (#3682)
|
2025-03-17 18:12:38 -07:00 |
|
萝卜菜
|
d6d21640d3
|
[Feature] Support Deepseek-VL2 (#2798)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
|
2025-03-16 23:07:59 -07:00 |
|
lukec
|
a53fe428f9
|
Support FlashMLA backend (#4472)
Co-authored-by: yinfan98 <1106310035@qq.com>
|
2025-03-16 09:07:06 -07:00 |
|
wangyu
|
1ce4878d31
|
feat(remote_model): support variable remote backend for model loader (#3964)
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
|
2025-03-14 00:40:44 -07:00 |
|
Lianmin Zheng
|
45de89719c
|
Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367)
|
2025-03-12 23:45:52 -07:00 |
|
Meng, Hengyu
|
71046fcd71
|
[XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
|
2025-03-12 22:26:29 -07:00 |
|
Lianmin Zheng
|
c76040e31b
|
Support page size > 1 (#4356)
|
2025-03-12 22:22:39 -07:00 |
|
Lianmin Zheng
|
e35a93fa8a
|
Move output processing logic from scheduler.py into a separate file (#4354)
|
2025-03-12 16:21:49 -07:00 |
|
Lianmin Zheng
|
d40ee62b5d
|
Update nightly tests (#4352)
|
2025-03-12 15:36:13 -07:00 |
|
Lianmin Zheng
|
00d25a7f5e
|
Fix quantization and nightly tests (#4258)
|
2025-03-10 03:06:21 -07:00 |
|
Lianmin Zheng
|
08c4d764a5
|
lazy import attn backends (#4200)
|
2025-03-08 00:41:35 -08:00 |
|
Lianmin Zheng
|
d4017a6b63
|
[EAGLE] many fixes for eagle (#4195)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-07 22:12:13 -08:00 |
|
Yineng Zhang
|
eb61f5c9af
|
Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" (#4186)
|
2025-03-07 10:27:52 -08:00 |
|
HAI
|
0beea4503f
|
ROCm: Flex Attention Enablement with custom backends (#4178)
Co-authored-by: linsun12 <linsun12@amd.com>
|
2025-03-07 04:38:53 -08:00 |
|
Lianmin Zheng
|
98c73d71cb
|
[Minor] make the __init__ function of model_runner.py shorter (#4132)
|
2025-03-06 01:51:12 -08:00 |
|
Zhiqiang Xie
|
aee30630d8
|
Add a pointer to the real KV cache pool (#4113)
|
2025-03-05 21:39:07 -08:00 |
|
Ying Sheng
|
d3d4d76758
|
[Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
|
2025-03-05 08:06:07 -08:00 |
|
Lianmin Zheng
|
e074d84e5b
|
[Minor] more code cleanup (#4077)
|
2025-03-04 21:23:47 -08:00 |
|
Chen Shengzhi
|
61261b3996
|
[XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. (#3954)
|
2025-03-04 04:05:56 -08:00 |
|
Lianmin Zheng
|
ac2387279e
|
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
|
2025-03-03 00:12:04 -08:00 |
|
Baizhou Zhang
|
90a4b7d98a
|
[Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
|
2025-02-28 18:13:56 -08:00 |
|
fzyzcjy
|
e3e0bc50a9
|
[Feature] SPMD for SGLang + Verl (#3852)
|
2025-02-28 09:53:10 -08:00 |
|
Shenggui Li
|
c0bb9eb3b3
|
[improve] made timeout configurable (#3803)
|
2025-02-25 00:26:08 -08:00 |
|
Baizhou Zhang
|
b110084654
|
Refactor flashinfer logic for deepseek v3 and fix accuracy bug (#3785)
|
2025-02-24 04:07:25 -08:00 |
|
Yineng Zhang
|
714f3e6362
|
feat: support flashinfer mla with prefix cache (#3643)
|
2025-02-18 02:06:43 +08:00 |
|
Yineng Zhang
|
70f894b810
|
feat: support flashinfer mla attention for deepseek v3 (#3550)
|
2025-02-14 08:50:14 +08:00 |
|
Baizhou Zhang
|
70817a7eae
|
[Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2025-02-03 22:09:13 -08:00 |
|
Yineng Zhang
|
013021b6a1
|
refactor EAGLE 2 (#3269)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: merrymercy <lianminzheng@gmail.com>
Co-authored-by: Ying1123 <sqy1415@gmail.com>
|
2025-02-03 20:52:30 +08:00 |
|
Lianmin Zheng
|
53cef81587
|
Improve weight loading and code style (#3174)
|
2025-01-27 03:00:41 -08:00 |
|
Ke Wen
|
862bcff833
|
Support loading of larger models with on-the-fly quantization (#3061)
|
2025-01-22 21:33:17 -08:00 |
|
Lianmin Zheng
|
89cd923581
|
Roll back to use vllm custom allreduce (#3006)
|
2025-01-20 04:03:15 -08:00 |
|
Lianmin Zheng
|
7906d1d298
|
Remove the unused write_with_records (#2972)
|
2025-01-18 20:20:23 -08:00 |
|
Mick
|
3d93f84a00
|
[Feature] Support minicpmv v2.6 (#2785)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-01-18 14:14:19 -08:00 |
|
Yineng Zhang
|
5dc54f1a62
|
feat: remove vllm distributed (#2907)
Co-authored-by: Zhangyi <1109276519@qq.com>
|
2025-01-17 22:31:51 +08:00 |
|
Chunyuan WU
|
63051738a9
|
Enable CPU device on SGLang (#2806)
|
2025-01-16 21:22:53 -08:00 |
|
Lianmin Zheng
|
bc6915e3b9
|
Improve type annotation and styles (#2926)
|
2025-01-16 12:51:11 -08:00 |
|
Lianmin Zheng
|
8b6ce52e92
|
Support multi-node DP attention (#2925)
Co-authored-by: dhou-xai <dhou@x.ai>
|
2025-01-16 11:15:00 -08:00 |
|
Lianmin Zheng
|
46d4431889
|
Add a new api configure_logging to allow dumping the requests (#2875)
|
2025-01-13 14:24:00 -08:00 |
|
fzyzcjy
|
923f518337
|
CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630)
|
2025-01-13 11:38:51 -08:00 |
|
bjmsong
|
0bb0f76311
|
Support FP8 E4M3 KV Cache (#2786)
Co-authored-by: root <bjmsong@126.com>
|
2025-01-12 21:17:11 -08:00 |
|
Chang Su
|
f290bd4332
|
[Bugfix] Fix embedding model hangs with --enable-metrics (#2822)
|
2025-01-10 13:14:51 -08:00 |
|