Ata Fatahi
|
ce094a5d79
|
Clean up GPU memory after killing sglang processes (#2457)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
|
2024-12-17 03:42:40 -08:00 |
|
bjmsong
|
e21026690d
|
benchmark decoding attention kernel with cudnn (#2467)
Co-authored-by: root <bjmsong@126.com>
|
2024-12-17 03:31:57 -08:00 |
|
Lianmin Zheng
|
bd6196163e
|
Small fix for the order of apply_torchao_config (#2495)
|
2024-12-16 19:21:11 -08:00 |
|
Lianmin Zheng
|
56198b45d9
|
Add a benchmark script for in-batch prefix caching (#2494)
|
2024-12-16 18:49:02 -08:00 |
|
Lianmin Zheng
|
ba36b5520a
|
Revert "Small fixes for torchao quant" (#2493)
|
2024-12-16 15:04:16 -08:00 |
|
Lianmin Zheng
|
9cd9dc83b3
|
Temporarily disable unit test of torch native attention backend (#2492)
|
2024-12-16 14:17:27 -08:00 |
|
Lianmin Zheng
|
7a1aecb938
|
Simplify pytorch sampling kernel and logit processor (#2491)
|
2024-12-16 14:11:09 -08:00 |
|
Jerry Zhang
|
82699474fd
|
Small fixes for torchao quant (#2476)
|
2024-12-16 14:08:12 -08:00 |
|
Yineng Zhang
|
7154b4b1df
|
minor: update flashinfer nightly (#2490)
|
2024-12-16 23:02:49 +08:00 |
|
xiaobochen
|
b532a5fd16
|
fix moe-ep accuracy issue for fp8 (#2489)
|
2024-12-16 20:54:02 +08:00 |
|
Xiaoyu Zhang
|
a0592c059f
|
[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm (#2486)
|
2024-12-15 13:52:08 +08:00 |
|
Yineng Zhang
|
e8dbdf75bc
|
fix typo (#2487)
|
2024-12-15 13:44:55 +08:00 |
|
yizhang2077
|
e04d3f2897
|
adapt tensorrt llm custom all reduce to sgl-kernel (#2481)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-12-15 13:15:59 +08:00 |
|
Yineng Zhang
|
5f2595be43
|
hotfix: checking for HIP (#2485)
|
2024-12-15 02:47:26 +08:00 |
|
Ke Bao
|
0ba2c58947
|
Remove cuda graph batch size adjustment for dp attention (#2484)
|
2024-12-14 23:53:54 +08:00 |
|
Yineng Zhang
|
fccbfa3752
|
format: add clang-format for sgl-kernel (#2483)
|
2024-12-14 22:36:04 +08:00 |
|
Ke Bao
|
2f9bd0fafd
|
Fix correctness issue for triton decoding kernel (#2479)
|
2024-12-14 16:50:54 +08:00 |
|
Lianmin Zheng
|
5282a4735f
|
[Minor] Fix grok model loader (#2473)
|
2024-12-12 14:34:47 -08:00 |
|
Yineng Zhang
|
f0ed9c353e
|
feat: support dev image (#2469)
|
2024-12-13 02:23:52 +08:00 |
|
Ata Fatahi
|
e3b3acfa6f
|
Rename rust folder to sgl-router (#2464)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
|
2024-12-12 09:40:41 -08:00 |
|
Yineng Zhang
|
2673fa29d4
|
fix: set runtime path (#2466)
|
2024-12-12 18:05:48 +08:00 |
|
Yineng Zhang
|
dedaf8cd48
|
minor: update pypi tag (#2463)
|
2024-12-12 15:21:45 +08:00 |
|
Yineng Zhang
|
32ed016041
|
chore: bump v0.0.2 for sgl-kernel (#2462)
|
2024-12-12 14:58:05 +08:00 |
|
Ata Fatahi
|
6efa9e4a6d
|
Bump sglang-router to 0.1.1 (#2459)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
|
2024-12-11 17:40:03 -08:00 |
|
Ata Fatahi
|
7791fd9948
|
Include version info into the router package (#2456)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
|
2024-12-11 17:31:20 -08:00 |
|
Ata Fatahi
|
2ac36b9a7b
|
Make request payload size configurable (#2444)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
|
2024-12-11 16:55:21 -08:00 |
|
Byron Hsu
|
2d60a5ee75
|
Update v0.1.0.md
|
2024-12-11 13:48:18 -08:00 |
|
Byron Hsu
|
2e4a5907c9
|
[router] Release router 0.1.0 with dynamic scaling and fault tolerance (#2455)
|
2024-12-11 13:42:35 -08:00 |
|
Byron Hsu
|
c0ee46fe10
|
[router] Update doc for dynamic scaling and fault tolerance (#2454)
|
2024-12-11 13:11:42 -08:00 |
|
SangBin Cho
|
9208618b3e
|
[Core] in batch prefix caching by delay scheduling (#2442)
|
2024-12-11 12:51:50 -08:00 |
|
Byron Hsu
|
864bf2ba00
|
[router] remove main.rs because only lib.rs is used for py binding (#2453)
|
2024-12-11 12:13:19 -08:00 |
|
Byron Hsu
|
a4cca7fc53
|
[router] Add retries based fault tolerance (#2452)
|
2024-12-11 12:13:08 -08:00 |
|
Fred Reiss
|
993956c6b1
|
Add support for IBM Granite 3.x models (#2437)
|
2024-12-11 06:30:23 -08:00 |
|
Lianmin Zheng
|
f8548295d6
|
Fix warmup in bench_offline_throughput.py (#2449)
|
2024-12-11 06:16:01 -08:00 |
|
Lianmin Zheng
|
959735fc9e
|
Fix model loader for more quantization formats (#2448)
|
2024-12-11 05:21:23 -08:00 |
|
bjmsong
|
f67723940d
|
decoding attention kernel benchmark (#2425)
Co-authored-by: root <bjmsong@126.com>
|
2024-12-11 04:46:59 -08:00 |
|
Yineng Zhang
|
626a99ac13
|
chore: update ao v0.7.0 (#2447)
|
2024-12-11 04:44:28 -08:00 |
|
Ke Wen
|
ece724910a
|
Make torch TP composable with torchao (#2436)
|
2024-12-11 04:21:42 -08:00 |
|
Byron Hsu
|
0fb88aaa77
|
[router] Use borrow if possible to save cost (#2441)
|
2024-12-11 01:38:50 -08:00 |
|
Byron Hsu
|
d4de9a6235
|
[router] Refactor: decouple select and send stage (#2440)
|
2024-12-11 00:51:21 -08:00 |
|
Yineng Zhang
|
7310aede97
|
fix: compatible with PEP 440 (#2435)
|
2024-12-11 06:48:45 +08:00 |
|
Yineng Zhang
|
5de9a58eca
|
fix: use manylinux2014_x86_64 tag (#2434)
|
2024-12-11 06:17:41 +08:00 |
|
Yineng Zhang
|
56fcd8e8a5
|
feat: support sgl-kernel PyPI (#2433)
Co-authored-by: Zhangyi <1109276519@qq.com>
|
2024-12-11 06:06:19 +08:00 |
|
Adarsh Shirawalmath
|
2b340adfb1
|
Typo fix in router.md (#2424)
|
2024-12-09 21:49:40 -08:00 |
|
Ying Sheng
|
8586b72da0
|
[feat] Enable chunked prefill for llava-onevision (#2412)
|
2024-12-09 09:52:38 -08:00 |
|
Lianmin Zheng
|
641b7d0ae0
|
[Minor] Improve code style (#2422)
|
2024-12-09 06:30:35 -08:00 |
|
Lianmin Zheng
|
0ce091a82d
|
[Minor] Improve code style (#2419)
|
2024-12-09 03:05:59 -08:00 |
|
Lianmin Zheng
|
835f8afc77
|
Migrate llama_classification to use the /classify interface (#2417)
|
2024-12-08 23:30:51 -08:00 |
|
Xiaoyu Zhang
|
3844feb9bb
|
Add a unittest for fused_moe (#2416)
|
2024-12-08 22:46:10 -08:00 |
|
Byron Hsu
|
27f7bed7a7
|
reduce watchdog interval to 5s (#2410)
|
2024-12-08 21:17:31 -08:00 |
|