Commit Graph

1455 Commits

Author SHA1 Message Date
Yineng Zhang
6128f7cff5 fix: specify dtype with begin_forward aka plan (#2404) 2024-12-08 20:07:30 +08:00
Lianmin Zheng
a2486eb58f Fix a bug with logprob streaming + chunked prefill (#2403) 2024-12-08 03:55:27 -08:00
Ke Bao
61dec545b0 Remove unused vars in the triton backend (#2401) 2024-12-08 03:37:03 -08:00
Lianmin Zheng
96db0f666d Update killall_sglang.sh (#2397) 2024-12-08 01:56:26 -08:00
Ke Bao
7dc66fcb40 Optimize Triton decoding kernel for long context (#2394) 2024-12-08 01:17:37 -08:00
SangBin Cho
1f09e84b9a nit: Remove busy waiting on scheduler (#2382) 2024-12-08 01:06:15 -08:00
Sangchun Ha (Patrick)
63dfab1bea Fix shape error that occurred when loading lora weight of gemma2 model. (#2330) 2024-12-08 01:04:08 -08:00
Byron Hsu
ef995dae1e [router] Health check on worker before adding to the router (#2392) 2024-12-07 15:39:54 -08:00
Yineng Zhang
75ae968959 minor: update killall script (#2391) 2024-12-08 04:21:00 +08:00
HAI
95f93f493a Fp8 MoE optimizations on AMD (#2388) 2024-12-07 21:18:26 +08:00
Yineng Zhang
aaac33fd8d fix: update xgrammar v0.1.6 (#2390) 2024-12-07 21:09:16 +08:00
Yineng Zhang
d332aa3b0c fix: resolve fp8 moe issue (#2387) 2024-12-07 19:28:53 +08:00
Byron Hsu
c36736c841 [router] Add remove worker api (#2380) 2024-12-06 17:16:03 -08:00
Byron Hsu
1bf9e34745 [router] add remove tenant method in the radix tree (#2379) 2024-12-06 11:53:15 -08:00
Byron Hsu
499c85f131 [Router] remove duplicate char count (#2378) 2024-12-06 11:26:07 -08:00
Lianmin Zheng
e5f227c0ee Release v0.4.0.post1 (#2375) 2024-12-06 06:08:19 -08:00
Lianmin Zheng
0e7409adb6 Fix the overlap for xgrammar (#2377) 2024-12-06 05:49:29 -08:00
vchzls
3cde5eb629 docs: Improve instructions for supporting new models (#2363)
Co-authored-by: zhaohoulong <zhaohoulong@xiaomi.com>
2024-12-06 04:27:17 -08:00
Lianmin Zheng
f5b2a3aa67 Use proc.join instead of busy waiting (#2374) 2024-12-06 02:01:23 -08:00
Yineng Zhang
f68175967c docs: update adoption (Meituan) (#2373) 2024-12-06 01:59:26 -08:00
Byron Hsu
67b657945a [router] support /add_worker api (#2369) 2024-12-06 01:17:04 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Xiaoyu Zhang
34b364e073 optimize cuda graph max_bs_settings on low-end gpus (#2360) 2024-12-06 01:13:04 -08:00
Yineng Zhang
84d96b3ae5 Move FP8 to SGLang (#2370)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2024-12-06 15:42:10 +08:00
xiaobochen
3d32e4a32c Resubmit MoE-EP (#2371) 2024-12-06 15:05:21 +08:00
Byron Hsu
64fceab8af [router] use 2-gpu-runner (#2368) 2024-12-06 14:13:57 +08:00
Lianmin Zheng
71e2a27753 Fix the cuda graph capture range for small #max-running-requests (#2359) 2024-12-06 14:13:57 +08:00
Ke Bao
4a63c181f1 Fix AWQ with enable MLA (#2364) 2024-12-06 00:46:48 +08:00
Lianmin Zheng
2b0fc5941d [Minor] Code style improvements (#2355) 2024-12-04 19:02:08 -08:00
Jerry Zhang
9cc733b38c move apply_torchao_config_ to model_runner (#2342) 2024-12-04 17:26:42 -08:00
Ke Wen
d693ec0427 Make torch TP composable with torch.compile (#2352) 2024-12-04 17:26:00 -08:00
Chayenne
18ea841f40 Add Docs For SGLang Native Router (#2308) 2024-12-04 15:41:22 -08:00
Chayenne
786be44da5 Fix Docs CI When Compile Error (#2323) 2024-12-04 11:19:46 -08:00
Yineng Zhang
2db4469808 minor: limit the range of vllm versions (#2350) 2024-12-05 02:00:34 +08:00
Ata Fatahi
ed45e509df Check gpu availability at server args creation (#2340)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-05 01:53:02 +08:00
Ke Bao
ec52464dde MLA prefill w/o weight absorption (#2349) 2024-12-05 01:50:28 +08:00
Yineng Zhang
eb0c1f5373 docs: add SGLang v0.4 blog (#2341) 2024-12-05 01:24:51 +08:00
HAI
b2986d7aa5 Adding SGLang FP8 Utils (#2348) 2024-12-04 03:01:33 -08:00
Yineng Zhang
f8b0326934 chore: bump v0.4.0 (#2338) 2024-12-03 11:55:41 -08:00
Byron Hsu
0495796517 [router] Copy license when publishing & bump version (#2339) 2024-12-03 10:27:43 -08:00
Lianmin Zheng
1228f7ca69 Fix gptq for moe layers (#2300)
Co-authored-by: root <me@zhyncs.com>
2024-12-03 23:12:33 +08:00
Yineng Zhang
fda628d8f2 fix: resolve cmake url for Dockerfile.dev (#2335) 2024-12-03 21:22:19 +08:00
Lianmin Zheng
07ec07ad1f Improve torch compile for fused moe (#2327) 2024-12-03 01:58:25 -08:00
Ata Fatahi
83b340e371 Add missing license for router wheel (#2324)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-03 00:06:25 -08:00
HAI
0639bf15d1 ROCm Container: set SGLANG_SET_CPU_AFFINITY=1 (#2328) 2024-12-02 23:20:33 -08:00
Ying Sheng
aa47f64223 Revert "[feat] Enable chunked prefill for llava-onevision" (#2329) 2024-12-02 23:11:13 -08:00
Lianmin Zheng
3ddb1c4679 [Minor] Fix logger and style (#2325) 2024-12-02 20:45:53 -08:00
Ying Sheng
480e38a733 [feat] Enable chunked prefill for llava-onevision (#2281) 2024-12-02 20:19:02 -08:00
HAI
69e2d4fb66 Relax to include more AMD GPUs (#2319) 2024-12-02 19:05:58 -08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00