aoshen524
|
588865f0e0
|
[Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-03-18 20:33:07 -07:00 |
|
JieXin Liang
|
0212d2e288
|
[Fix] use torch.inference_mode() instead of torch.no_grad() (#4372)
|
2025-03-16 22:54:16 -07:00 |
|
Byron Hsu
|
8cc300f536
|
Fix router test (#4483)
|
2025-03-16 22:49:47 -07:00 |
|
Lianmin Zheng
|
45de89719c
|
Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367)
|
2025-03-12 23:45:52 -07:00 |
|
Meng, Hengyu
|
71046fcd71
|
[XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
|
2025-03-12 22:26:29 -07:00 |
|
Stefan He
|
e0917e6bd0
|
Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% (#4215)
Co-authored-by: Stefan He <bhe@linkedin.com>
|
2025-03-12 00:08:03 -07:00 |
|
lukec
|
dce303e279
|
linear support deepgemm (#4199)
Co-authored-by: yinfan98 <1106310035@qq.com>
|
2025-03-11 00:38:37 -07:00 |
|
HandH1998
|
2ac189edc8
|
Amd test fp8 (#4261)
|
2025-03-10 10:12:09 -07:00 |
|
Lianmin Zheng
|
00d25a7f5e
|
Fix quantization and nightly tests (#4258)
|
2025-03-10 03:06:21 -07:00 |
|
Lianmin Zheng
|
fbd560028a
|
Auto balance CI tests (#4238)
|
2025-03-09 21:05:55 -07:00 |
|
HandH1998
|
0dd6cda288
|
Apply sgl w8a8 fp8 kernel (#3148)
|
2025-03-09 00:03:32 -08:00 |
|
Pan Lyu
|
361971b859
|
Add Support for Qwen2-VL Multi-modal Embedding Models (#3694)
|
2025-03-06 16:46:20 -08:00 |
|
Mick
|
583d6af71b
|
example: add vlm to token in & out example (#3941)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
|
2025-03-04 22:18:26 -08:00 |
|
Lianmin Zheng
|
77a3954bf7
|
Simplify eagle tests and TP sync in grammar backend (#4066)
|
2025-03-04 13:40:40 -08:00 |
|
Xihuai Wang
|
95575aa76a
|
Reasoning parser (#4000)
Co-authored-by: Lucas Pickup <lupickup@microsoft.com>
|
2025-03-03 21:16:36 -08:00 |
|
Qiaolin Yu
|
57a404fd55
|
Remove outdated test utils and fix links for the doc of sampling params (#3999)
|
2025-03-03 09:41:38 -08:00 |
|
Lianmin Zheng
|
935cda944b
|
Misc clean up; Remove the support of jump forward (#4032)
|
2025-03-03 07:02:14 -08:00 |
|
Lianmin Zheng
|
1a8f995c46
|
remove cache configs in model definitions (#4031)
|
2025-03-03 05:00:50 -08:00 |
|
Lianmin Zheng
|
66301e124f
|
Improve code styles (#4021)
|
2025-03-03 03:20:23 -08:00 |
|
Lianmin Zheng
|
ac2387279e
|
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
|
2025-03-03 00:12:04 -08:00 |
|
fzyzcjy
|
e3e0bc50a9
|
[Feature] SPMD for SGLang + Verl (#3852)
|
2025-02-28 09:53:10 -08:00 |
|
lukec
|
21463e321a
|
Expert Parallelism (EP) Support for DeepSeek V3/R1 (#3602)
Co-authored-by: laixin <xielx@shanghaitech.edu.cn>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: laixin <q865809639@gmail.com>
|
2025-02-26 02:29:37 -08:00 |
|
Shenggui Li
|
c0bb9eb3b3
|
[improve] made timeout configurable (#3803)
|
2025-02-25 00:26:08 -08:00 |
|
aoshen524
|
e79f7420be
|
[Fix] Fix bugs and refactor codes in lora for better scalability. (#3652)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
|
2025-02-20 11:51:57 -08:00 |
|
Baizhou Zhang
|
70817a7eae
|
[Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2025-02-03 22:09:13 -08:00 |
|
Lianmin Zheng
|
53cef81587
|
Improve weight loading and code style (#3174)
|
2025-01-27 03:00:41 -08:00 |
|
Lianmin Zheng
|
a4331cd260
|
Add accuracy and latency tests of eagle into CI (#3027)
|
2025-01-21 02:55:14 -08:00 |
|
Lianmin Zheng
|
287d07a669
|
Misc fixes for eagle (flush_cache, CPU overhead) (#3014)
|
2025-01-20 20:27:38 -08:00 |
|
Lianmin Zheng
|
60b2a44a80
|
Fix flaky tests in test_programs.py (#3022)
|
2025-01-20 16:50:39 -08:00 |
|
Lianmin Zheng
|
03464890e0
|
Separate two entry points: Engine and HTTP server (#2996)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-01-19 22:09:24 -08:00 |
|
Lianmin Zheng
|
61f42b5732
|
Move sgl.Runtime under sglang/lang (#2990)
|
2025-01-19 17:10:29 -08:00 |
|
Mick
|
3d93f84a00
|
[Feature] Support minicpmv v2.6 (#2785)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-01-18 14:14:19 -08:00 |
|
bjmsong
|
d3024f4fc8
|
support e4m3 kvcache in qwen2 & add kv scaling facotr json (#2894)
Co-authored-by: bjmsong <bjmsong@126.com>
|
2025-01-18 11:43:22 +08:00 |
|
Lianmin Zheng
|
8f2c522aba
|
Improve benchmark scripts and error message printing (#2922)
|
2025-01-16 06:24:31 -08:00 |
|
Lianmin Zheng
|
f65c13b559
|
Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902)
|
2025-01-15 04:54:14 -08:00 |
|
Lianmin Zheng
|
b22f3f6475
|
Fix nightly accuracy tests (#2780)
|
2025-01-07 21:02:35 -08:00 |
|
Lianmin Zheng
|
bdc1acf6cd
|
Misc fix for min_p_sampling, --cuda-graph-bs (#2761)
|
2025-01-07 02:52:53 -08:00 |
|
Xingyao Wang
|
1acbaf1b5a
|
Add generator-style run_batch function (#2513)
Co-authored-by: openhands <openhands@all-hands.dev>
|
2025-01-06 15:04:55 -08:00 |
|
HandH1998
|
53aed988cb
|
Refactor MoE (#2575)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2024-12-26 00:02:14 +08:00 |
|
Lianmin Zheng
|
a6ca736c8e
|
Simplify stream_output (#2398)
|
2024-12-08 12:27:13 -08:00 |
|
Lianmin Zheng
|
a2486eb58f
|
Fix a bug with logprob streaming + chunked prefill (#2403)
|
2024-12-08 03:55:27 -08:00 |
|
Lianmin Zheng
|
ccaf1f997c
|
[CI] Print summary on github actions (#2274)
|
2024-11-29 23:48:54 -08:00 |
|
Chayenne
|
7d5d1d3d29
|
udate weights from disk (#2265)
|
2024-11-30 01:17:00 +00:00 |
|
bjmsong
|
01017d4c20
|
Support LoRA in Completion API (#2243)
Co-authored-by: root <bjmsong@126.com>
|
2024-11-29 16:13:38 -08:00 |
|
Lianmin Zheng
|
b2ccf36d4d
|
Fix memory leak during abort (#2238)
|
2024-11-28 02:22:15 -08:00 |
|
Lianmin Zheng
|
d4fc1a70e3
|
Crash the server correctly during error (#2231)
|
2024-11-28 00:22:39 -08:00 |
|
Lianmin Zheng
|
fb6e04a0c2
|
Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2222)
|
2024-11-27 02:52:46 -08:00 |
|
Lianmin Zheng
|
6997e28f6e
|
Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" (#2221)
|
2024-11-27 02:02:01 -08:00 |
|
Lianmin Zheng
|
a0e58740a8
|
Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2217)
|
2024-11-27 01:13:41 -08:00 |
|
Lianmin Zheng
|
8e1adb8441
|
Allow overwrite flashinfer use_tensorcore (#2169)
|
2024-11-24 20:58:17 -08:00 |
|