Commit Graph

1443 Commits

Author SHA1 Message Date
Byron Hsu
c36736c841 [router] Add remove worker api (#2380) 2024-12-06 17:16:03 -08:00
Byron Hsu
1bf9e34745 [router] add remove tenant method in the radix tree (#2379) 2024-12-06 11:53:15 -08:00
Byron Hsu
499c85f131 [Router] remove duplicate char count (#2378) 2024-12-06 11:26:07 -08:00
Lianmin Zheng
e5f227c0ee Release v0.4.0.post1 (#2375) 2024-12-06 06:08:19 -08:00
Lianmin Zheng
0e7409adb6 Fix the overlap for xgrammar (#2377) 2024-12-06 05:49:29 -08:00
vchzls
3cde5eb629 docs: Improve instructions for supporting new models (#2363)
Co-authored-by: zhaohoulong <zhaohoulong@xiaomi.com>
2024-12-06 04:27:17 -08:00
Lianmin Zheng
f5b2a3aa67 Use proc.join instead of busy waiting (#2374) 2024-12-06 02:01:23 -08:00
Yineng Zhang
f68175967c docs: update adoption (Meituan) (#2373) 2024-12-06 01:59:26 -08:00
Byron Hsu
67b657945a [router] support /add_worker api (#2369) 2024-12-06 01:17:04 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Xiaoyu Zhang
34b364e073 optimize cuda graph max_bs_settings on low-end gpus (#2360) 2024-12-06 01:13:04 -08:00
Yineng Zhang
84d96b3ae5 Move FP8 to SGLang (#2370)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2024-12-06 15:42:10 +08:00
xiaobochen
3d32e4a32c Resubmit MoE-EP (#2371) 2024-12-06 15:05:21 +08:00
Byron Hsu
64fceab8af [router] use 2-gpu-runner (#2368) 2024-12-06 14:13:57 +08:00
Lianmin Zheng
71e2a27753 Fix the cuda graph capture range for small #max-running-requests (#2359) 2024-12-06 14:13:57 +08:00
Ke Bao
4a63c181f1 Fix AWQ with enable MLA (#2364) 2024-12-06 00:46:48 +08:00
Lianmin Zheng
2b0fc5941d [Minor] Code style improvements (#2355) 2024-12-04 19:02:08 -08:00
Jerry Zhang
9cc733b38c move apply_torchao_config_ to model_runner (#2342) 2024-12-04 17:26:42 -08:00
Ke Wen
d693ec0427 Make torch TP composable with torch.compile (#2352) 2024-12-04 17:26:00 -08:00
Chayenne
18ea841f40 Add Docs For SGLang Native Router (#2308) 2024-12-04 15:41:22 -08:00
Chayenne
786be44da5 Fix Docs CI When Compile Error (#2323) 2024-12-04 11:19:46 -08:00
Yineng Zhang
2db4469808 minor: limit the range of vllm versions (#2350) 2024-12-05 02:00:34 +08:00
Ata Fatahi
ed45e509df Check gpu availability at server args creation (#2340)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-05 01:53:02 +08:00
Ke Bao
ec52464dde MLA prefill w/o weight absorption (#2349) 2024-12-05 01:50:28 +08:00
Yineng Zhang
eb0c1f5373 docs: add SGLang v0.4 blog (#2341) 2024-12-05 01:24:51 +08:00
HAI
b2986d7aa5 Adding SGLang FP8 Utils (#2348) 2024-12-04 03:01:33 -08:00
Yineng Zhang
f8b0326934 chore: bump v0.4.0 (#2338) 2024-12-03 11:55:41 -08:00
Byron Hsu
0495796517 [router] Copy license when publishing & bump version (#2339) 2024-12-03 10:27:43 -08:00
Lianmin Zheng
1228f7ca69 Fix gptq for moe layers (#2300)
Co-authored-by: root <me@zhyncs.com>
2024-12-03 23:12:33 +08:00
Yineng Zhang
fda628d8f2 fix: resolve cmake url for Dockerfile.dev (#2335) 2024-12-03 21:22:19 +08:00
Lianmin Zheng
07ec07ad1f Improve torch compile for fused moe (#2327) 2024-12-03 01:58:25 -08:00
Ata Fatahi
83b340e371 Add missing license for router wheel (#2324)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-03 00:06:25 -08:00
HAI
0639bf15d1 ROCm Container: set SGLANG_SET_CPU_AFFINITY=1 (#2328) 2024-12-02 23:20:33 -08:00
Ying Sheng
aa47f64223 Revert "[feat] Enable chunked prefill for llava-onevision" (#2329) 2024-12-02 23:11:13 -08:00
Lianmin Zheng
3ddb1c4679 [Minor] Fix logger and style (#2325) 2024-12-02 20:45:53 -08:00
Ying Sheng
480e38a733 [feat] Enable chunked prefill for llava-onevision (#2281) 2024-12-02 20:19:02 -08:00
HAI
69e2d4fb66 Relax to include more AMD GPUs (#2319) 2024-12-02 19:05:58 -08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Lianmin Zheng
33deca81b5 Add more fused moe benchmark utilities (#2314) 2024-12-02 04:26:55 -08:00
Lianmin Zheng
18108abe5d [Minor] Fix code style (#2311) 2024-12-02 02:27:36 -08:00
HAI
c54bda300a Use rocminfo instead of rocm-smi for more OS/WSL support (#2310) 2024-12-02 00:15:45 -08:00
Lianmin Zheng
3c79ad35ca [Fix] Fix the padded hash value for image tokens (#2309) 2024-12-01 23:36:28 -08:00
Chayenne
983bfcf386 Online weight updates from torch.distributed (#2279) 2024-12-01 23:23:18 -08:00
Yineng Zhang
28bc60dcab misc: update build setup (#2306) 2024-12-02 02:03:49 +08:00
Yineng Zhang
7301a39b13 fix: resolve CodeQL cpp issue (#2305) 2024-12-01 23:55:19 +08:00
Yineng Zhang
47eb139f81 feat: use warp reduce as a simple example (#2304) 2024-12-01 22:43:50 +08:00
Lianmin Zheng
5c18a03733 Fix logprob for completions (#2301) 2024-12-01 05:17:05 -08:00
Yineng Zhang
5c91a315d7 feat: support sgl-kernel pypi (#2302) 2024-12-01 20:11:21 +08:00
Yineng Zhang
3dbd73d319 minor: rm unused _grouped_size_compiled_for_decode_kernels (#2299) 2024-12-01 19:24:12 +08:00
Yineng Zhang
e9a6203dee feat: skip good first issue (#2298) 2024-12-01 19:18:57 +08:00