Commit Graph

193 Commits

Author SHA1 Message Date
Yineng Zhang
3289c1207d Update the retry count (#5051) 2025-04-03 17:07:38 -07:00
Xiaoyu Zhang
e9c6ce461d sgl scaled_fp8_quant support output padding (#4861) 2025-04-02 23:53:57 +08:00
Lianmin Zheng
4ede6770cd Fix retract for page size > 1 (#4914) 2025-03-30 02:57:15 -07:00
Lianmin Zheng
b26bc86b36 Support page size > 1 + eagle (#4908) 2025-03-30 00:46:23 -07:00
fzyzcjy
8690c40bb0 Improve stack trace of retry errors (#4845) 2025-03-29 08:21:31 -07:00
Lianmin Zheng
47e6628aae Fix CI tests (#4853) 2025-03-28 00:28:35 -07:00
Pan Lyu
c913ed4046 support clip embedding model (#4506) 2025-03-27 00:18:15 -07:00
fzyzcjy
fa3c9e0668 Fix popen_launch_server wait for 20 minutes when child process exits (#4777) 2025-03-26 00:32:19 -07:00
fzyzcjy
26f07294f1 Warn users when release_memory_occupation is called without memory saver enabled (#4566) 2025-03-26 00:18:14 -07:00
fzyzcjy
15ddd84322 Add retry for flaky tests in CI (#4755) 2025-03-25 16:53:12 -07:00
Stefan He
4c584fc632 Fix circular imports in gptq.py and unblock test explorer (#4736) 2025-03-24 18:07:08 -07:00
Stefan He
5d7edc8e55 Support FA3 as Attention backend by using --attention-backend fa3 (#4680)
Co-authored-by: qsong <qsong@linkedin.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-03-23 23:28:11 -07:00
Yun Dai
8cd4250401 [quantization] fix channelwise conversion with scalar weight scale (#4596) 2025-03-22 00:47:52 -07:00
aoshen524
588865f0e0 [Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-03-18 20:33:07 -07:00
JieXin Liang
0212d2e288 [Fix] use torch.inference_mode() instead of torch.no_grad() (#4372) 2025-03-16 22:54:16 -07:00
Byron Hsu
8cc300f536 Fix router test (#4483) 2025-03-16 22:49:47 -07:00
Lianmin Zheng
45de89719c Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367) 2025-03-12 23:45:52 -07:00
Meng, Hengyu
71046fcd71 [XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
2025-03-12 22:26:29 -07:00
Stefan He
e0917e6bd0 Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% (#4215)
Co-authored-by: Stefan He <bhe@linkedin.com>
2025-03-12 00:08:03 -07:00
lukec
dce303e279 linear support deepgemm (#4199)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-11 00:38:37 -07:00
HandH1998
2ac189edc8 Amd test fp8 (#4261) 2025-03-10 10:12:09 -07:00
Lianmin Zheng
00d25a7f5e Fix quantization and nightly tests (#4258) 2025-03-10 03:06:21 -07:00
Lianmin Zheng
fbd560028a Auto balance CI tests (#4238) 2025-03-09 21:05:55 -07:00
HandH1998
0dd6cda288 Apply sgl w8a8 fp8 kernel (#3148) 2025-03-09 00:03:32 -08:00
Pan Lyu
361971b859 Add Support for Qwen2-VL Multi-modal Embedding Models (#3694) 2025-03-06 16:46:20 -08:00
Mick
583d6af71b example: add vlm to token in & out example (#3941)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-04 22:18:26 -08:00
Lianmin Zheng
77a3954bf7 Simplify eagle tests and TP sync in grammar backend (#4066) 2025-03-04 13:40:40 -08:00
Xihuai Wang
95575aa76a Reasoning parser (#4000)
Co-authored-by: Lucas Pickup <lupickup@microsoft.com>
2025-03-03 21:16:36 -08:00
Qiaolin Yu
57a404fd55 Remove outdated test utils and fix links for the doc of sampling params (#3999) 2025-03-03 09:41:38 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00
Lianmin Zheng
1a8f995c46 remove cache configs in model definitions (#4031) 2025-03-03 05:00:50 -08:00
Lianmin Zheng
66301e124f Improve code styles (#4021) 2025-03-03 03:20:23 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
fzyzcjy
e3e0bc50a9 [Feature] SPMD for SGLang + Verl (#3852) 2025-02-28 09:53:10 -08:00
lukec
21463e321a Expert Parallelism (EP) Support for DeepSeek V3/R1 (#3602)
Co-authored-by: laixin <xielx@shanghaitech.edu.cn>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: laixin <q865809639@gmail.com>
2025-02-26 02:29:37 -08:00
Shenggui Li
c0bb9eb3b3 [improve] made timeout configurable (#3803) 2025-02-25 00:26:08 -08:00
aoshen524
e79f7420be [Fix] Fix bugs and refactor codes in lora for better scalability. (#3652)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-02-20 11:51:57 -08:00
Baizhou Zhang
70817a7eae [Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-03 22:09:13 -08:00
Lianmin Zheng
53cef81587 Improve weight loading and code style (#3174) 2025-01-27 03:00:41 -08:00
Lianmin Zheng
a4331cd260 Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00
Lianmin Zheng
287d07a669 Misc fixes for eagle (flush_cache, CPU overhead) (#3014) 2025-01-20 20:27:38 -08:00
Lianmin Zheng
60b2a44a80 Fix flaky tests in test_programs.py (#3022) 2025-01-20 16:50:39 -08:00
Lianmin Zheng
03464890e0 Separate two entry points: Engine and HTTP server (#2996)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-01-19 22:09:24 -08:00
Lianmin Zheng
61f42b5732 Move sgl.Runtime under sglang/lang (#2990) 2025-01-19 17:10:29 -08:00
Mick
3d93f84a00 [Feature] Support minicpmv v2.6 (#2785)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-01-18 14:14:19 -08:00
bjmsong
d3024f4fc8 support e4m3 kvcache in qwen2 & add kv scaling facotr json (#2894)
Co-authored-by: bjmsong <bjmsong@126.com>
2025-01-18 11:43:22 +08:00
Lianmin Zheng
8f2c522aba Improve benchmark scripts and error message printing (#2922) 2025-01-16 06:24:31 -08:00
Lianmin Zheng
f65c13b559 Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902) 2025-01-15 04:54:14 -08:00
Lianmin Zheng
b22f3f6475 Fix nightly accuracy tests (#2780) 2025-01-07 21:02:35 -08:00
Lianmin Zheng
bdc1acf6cd Misc fix for min_p_sampling, --cuda-graph-bs (#2761) 2025-01-07 02:52:53 -08:00