Byron Hsu
|
864bf2ba00
|
[router] remove main.rs because only lib.rs is used for py binding (#2453)
|
2024-12-11 12:13:19 -08:00 |
|
Byron Hsu
|
a4cca7fc53
|
[router] Add retries based fault tolerance (#2452)
|
2024-12-11 12:13:08 -08:00 |
|
Fred Reiss
|
993956c6b1
|
Add support for IBM Granite 3.x models (#2437)
|
2024-12-11 06:30:23 -08:00 |
|
Lianmin Zheng
|
f8548295d6
|
Fix warmup in bench_offline_throughput.py (#2449)
|
2024-12-11 06:16:01 -08:00 |
|
Lianmin Zheng
|
959735fc9e
|
Fix model loader for more quantization formats (#2448)
|
2024-12-11 05:21:23 -08:00 |
|
bjmsong
|
f67723940d
|
decoding attention kernel benchmark (#2425)
Co-authored-by: root <bjmsong@126.com>
|
2024-12-11 04:46:59 -08:00 |
|
Yineng Zhang
|
626a99ac13
|
chore: update ao v0.7.0 (#2447)
|
2024-12-11 04:44:28 -08:00 |
|
Ke Wen
|
ece724910a
|
Make torch TP composable with torchao (#2436)
|
2024-12-11 04:21:42 -08:00 |
|
Byron Hsu
|
0fb88aaa77
|
[router] Use borrow if possible to save cost (#2441)
|
2024-12-11 01:38:50 -08:00 |
|
Byron Hsu
|
d4de9a6235
|
[router] Refactor: decouple select and send stage (#2440)
|
2024-12-11 00:51:21 -08:00 |
|
Yineng Zhang
|
7310aede97
|
fix: compatible with PEP 440 (#2435)
|
2024-12-11 06:48:45 +08:00 |
|
Yineng Zhang
|
5de9a58eca
|
fix: use manylinux2014_x86_64 tag (#2434)
|
2024-12-11 06:17:41 +08:00 |
|
Yineng Zhang
|
56fcd8e8a5
|
feat: support sgl-kernel PyPI (#2433)
Co-authored-by: Zhangyi <1109276519@qq.com>
|
2024-12-11 06:06:19 +08:00 |
|
Adarsh Shirawalmath
|
2b340adfb1
|
Typo fix in router.md (#2424)
|
2024-12-09 21:49:40 -08:00 |
|
Ying Sheng
|
8586b72da0
|
[feat] Enable chunked prefill for llava-onevision (#2412)
|
2024-12-09 09:52:38 -08:00 |
|
Lianmin Zheng
|
641b7d0ae0
|
[Minor] Improve code style (#2422)
|
2024-12-09 06:30:35 -08:00 |
|
Lianmin Zheng
|
0ce091a82d
|
[Minor] Improve code style (#2419)
|
2024-12-09 03:05:59 -08:00 |
|
Lianmin Zheng
|
835f8afc77
|
Migrate llama_classification to use the /classify interface (#2417)
|
2024-12-08 23:30:51 -08:00 |
|
Xiaoyu Zhang
|
3844feb9bb
|
Add a unittest for fused_moe (#2416)
|
2024-12-08 22:46:10 -08:00 |
|
Byron Hsu
|
27f7bed7a7
|
reduce watchdog interval to 5s (#2410)
|
2024-12-08 21:17:31 -08:00 |
|
Byron Hsu
|
6387098f5f
|
[router] add health checking in router init (#2393)
|
2024-12-08 17:17:37 -08:00 |
|
Byron Hsu
|
2a717c5078
|
[Router] fix interrupt from terminal (#2413)
|
2024-12-08 16:58:41 -08:00 |
|
Byron Hsu
|
a1e697b25b
|
[router] Improve cleanup logic (#2411)
|
2024-12-08 15:24:02 -08:00 |
|
Lianmin Zheng
|
a6ca736c8e
|
Simplify stream_output (#2398)
|
2024-12-08 12:27:13 -08:00 |
|
Yineng Zhang
|
f62055b528
|
minor: add random flashinfer vs triton use case (#2409)
|
2024-12-09 04:15:21 +08:00 |
|
Yineng Zhang
|
74bc9184c3
|
minor: add random use case (#2408)
|
2024-12-09 03:21:35 +08:00 |
|
Yineng Zhang
|
0f8eb15323
|
feat: support custom task runner (#2407)
|
2024-12-09 02:29:55 +08:00 |
|
Yineng Zhang
|
67470bbb28
|
minor: update correct measurement unit (#2406)
|
2024-12-08 20:55:04 +08:00 |
|
Lianmin Zheng
|
cc858953a0
|
Fix recv_requests (#2405)
|
2024-12-08 04:08:04 -08:00 |
|
Yineng Zhang
|
6128f7cff5
|
fix: specify dtype with begin_forward aka plan (#2404)
|
2024-12-08 20:07:30 +08:00 |
|
Lianmin Zheng
|
a2486eb58f
|
Fix a bug with logprob streaming + chunked prefill (#2403)
|
2024-12-08 03:55:27 -08:00 |
|
Ke Bao
|
61dec545b0
|
Remove unused vars in the triton backend (#2401)
|
2024-12-08 03:37:03 -08:00 |
|
Lianmin Zheng
|
96db0f666d
|
Update killall_sglang.sh (#2397)
|
2024-12-08 01:56:26 -08:00 |
|
Ke Bao
|
7dc66fcb40
|
Optimize Triton decoding kernel for long context (#2394)
|
2024-12-08 01:17:37 -08:00 |
|
SangBin Cho
|
1f09e84b9a
|
nit: Remove busy waiting on scheduler (#2382)
|
2024-12-08 01:06:15 -08:00 |
|
Sangchun Ha (Patrick)
|
63dfab1bea
|
Fix shape error that occurred when loading lora weight of gemma2 model. (#2330)
|
2024-12-08 01:04:08 -08:00 |
|
Byron Hsu
|
ef995dae1e
|
[router] Health check on worker before adding to the router (#2392)
|
2024-12-07 15:39:54 -08:00 |
|
Yineng Zhang
|
75ae968959
|
minor: update killall script (#2391)
|
2024-12-08 04:21:00 +08:00 |
|
HAI
|
95f93f493a
|
Fp8 MoE optimizations on AMD (#2388)
|
2024-12-07 21:18:26 +08:00 |
|
Yineng Zhang
|
aaac33fd8d
|
fix: update xgrammar v0.1.6 (#2390)
|
2024-12-07 21:09:16 +08:00 |
|
Yineng Zhang
|
d332aa3b0c
|
fix: resolve fp8 moe issue (#2387)
|
2024-12-07 19:28:53 +08:00 |
|
Byron Hsu
|
c36736c841
|
[router] Add remove worker api (#2380)
|
2024-12-06 17:16:03 -08:00 |
|
Byron Hsu
|
1bf9e34745
|
[router] add remove tenant method in the radix tree (#2379)
|
2024-12-06 11:53:15 -08:00 |
|
Byron Hsu
|
499c85f131
|
[Router] remove duplicate char count (#2378)
|
2024-12-06 11:26:07 -08:00 |
|
Lianmin Zheng
|
e5f227c0ee
|
Release v0.4.0.post1 (#2375)
|
2024-12-06 06:08:19 -08:00 |
|
Lianmin Zheng
|
0e7409adb6
|
Fix the overlap for xgrammar (#2377)
|
2024-12-06 05:49:29 -08:00 |
|
vchzls
|
3cde5eb629
|
docs: Improve instructions for supporting new models (#2363)
Co-authored-by: zhaohoulong <zhaohoulong@xiaomi.com>
|
2024-12-06 04:27:17 -08:00 |
|
Lianmin Zheng
|
f5b2a3aa67
|
Use proc.join instead of busy waiting (#2374)
|
2024-12-06 02:01:23 -08:00 |
|
Yineng Zhang
|
f68175967c
|
docs: update adoption (Meituan) (#2373)
|
2024-12-06 01:59:26 -08:00 |
|
Byron Hsu
|
67b657945a
|
[router] support /add_worker api (#2369)
|
2024-12-06 01:17:04 -08:00 |
|