Commit Graph

1527 Commits

Author SHA1 Message Date
Lei
19ba2b0ea9 Add lora_paths to v1_chat_generate_request (#2529) 2024-12-22 02:23:33 -08:00
Yineng Zhang
4e1e3cff20 fix #2528 (#2541) 2024-12-22 00:14:41 +08:00
Yineng Zhang
8f4d04e540 chore: bump v0.4.0.post2 (#2525) 2024-12-21 21:16:34 +08:00
Jerry Zhang
feb2b768ba Add integration with gemlite weight only quant (#2528) 2024-12-21 00:25:25 +08:00
Yineng Zhang
d95a5f5bf5 fix followup #2517 (#2524) 2024-12-19 23:24:30 +08:00
Yineng Zhang
4b83db24f1 fix: continue to use flashinfer 0.1.6 temporarily (#2517) 2024-12-19 14:03:24 +08:00
Yineng Zhang
64456cf023 docs: update README (#2516) 2024-12-19 13:44:02 +08:00
Yineng Zhang
bb4a922023 feat: add llama3 eval (#2515) 2024-12-19 13:37:09 +08:00
Lianmin Zheng
21e9e63ad5 Print progress bar during cuda graph capture (#2502) 2024-12-17 06:33:46 -08:00
Lianmin Zheng
1fc84cf60b Update readme (#2500)
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: “yixin-huang1” <yixinhuang1@berkeley.edu>
2024-12-17 04:33:36 -08:00
Lianmin Zheng
361ea8d912 Fix openai protocols and pass top_k, min_p (#2499) 2024-12-17 04:14:14 -08:00
Lei
33c5ff2845 Add lora_path to chat completion (#2438) 2024-12-17 03:47:49 -08:00
Hui Liu
5ce9daea59 ROCm support for sglang.check_env (#2426) 2024-12-17 03:45:14 -08:00
Ata Fatahi
ce094a5d79 Clean up GPU memory after killing sglang processes (#2457)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-17 03:42:40 -08:00
bjmsong
e21026690d benchmark decoding attention kernel with cudnn (#2467)
Co-authored-by: root <bjmsong@126.com>
2024-12-17 03:31:57 -08:00
Lianmin Zheng
bd6196163e Small fix for the order of apply_torchao_config (#2495) 2024-12-16 19:21:11 -08:00
Lianmin Zheng
56198b45d9 Add a benchmark script for in-batch prefix caching (#2494) 2024-12-16 18:49:02 -08:00
Lianmin Zheng
ba36b5520a Revert "Small fixes for torchao quant" (#2493) 2024-12-16 15:04:16 -08:00
Lianmin Zheng
9cd9dc83b3 Temporarily disable unit test of torch native attention backend (#2492) 2024-12-16 14:17:27 -08:00
Lianmin Zheng
7a1aecb938 Simplify pytorch sampling kernel and logit processor (#2491) 2024-12-16 14:11:09 -08:00
Jerry Zhang
82699474fd Small fixes for torchao quant (#2476) 2024-12-16 14:08:12 -08:00
Yineng Zhang
7154b4b1df minor: update flashinfer nightly (#2490) 2024-12-16 23:02:49 +08:00
xiaobochen
b532a5fd16 fix moe-ep accuracy issue for fp8 (#2489) 2024-12-16 20:54:02 +08:00
Xiaoyu Zhang
a0592c059f [Benchmark] add a benchmark for hf/vllm/sglang rmsnorm (#2486) 2024-12-15 13:52:08 +08:00
Yineng Zhang
e8dbdf75bc fix typo (#2487) 2024-12-15 13:44:55 +08:00
yizhang2077
e04d3f2897 adapt tensorrt llm custom all reduce to sgl-kernel (#2481)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-12-15 13:15:59 +08:00
Yineng Zhang
5f2595be43 hotfix: checking for HIP (#2485) 2024-12-15 02:47:26 +08:00
Ke Bao
0ba2c58947 Remove cuda graph batch size adjustment for dp attention (#2484) 2024-12-14 23:53:54 +08:00
Yineng Zhang
fccbfa3752 format: add clang-format for sgl-kernel (#2483) 2024-12-14 22:36:04 +08:00
Ke Bao
2f9bd0fafd Fix correctness issue for triton decoding kernel (#2479) 2024-12-14 16:50:54 +08:00
Lianmin Zheng
5282a4735f [Minor] Fix grok model loader (#2473) 2024-12-12 14:34:47 -08:00
Yineng Zhang
f0ed9c353e feat: support dev image (#2469) 2024-12-13 02:23:52 +08:00
Ata Fatahi
e3b3acfa6f Rename rust folder to sgl-router (#2464)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-12 09:40:41 -08:00
Yineng Zhang
2673fa29d4 fix: set runtime path (#2466) 2024-12-12 18:05:48 +08:00
Yineng Zhang
dedaf8cd48 minor: update pypi tag (#2463) 2024-12-12 15:21:45 +08:00
Yineng Zhang
32ed016041 chore: bump v0.0.2 for sgl-kernel (#2462) 2024-12-12 14:58:05 +08:00
Ata Fatahi
6efa9e4a6d Bump sglang-router to 0.1.1 (#2459)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-11 17:40:03 -08:00
Ata Fatahi
7791fd9948 Include version info into the router package (#2456)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2024-12-11 17:31:20 -08:00
Ata Fatahi
2ac36b9a7b Make request payload size configurable (#2444)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-11 16:55:21 -08:00
Byron Hsu
2d60a5ee75 Update v0.1.0.md 2024-12-11 13:48:18 -08:00
Byron Hsu
2e4a5907c9 [router] Release router 0.1.0 with dynamic scaling and fault tolerance (#2455) 2024-12-11 13:42:35 -08:00
Byron Hsu
c0ee46fe10 [router] Update doc for dynamic scaling and fault tolerance (#2454) 2024-12-11 13:11:42 -08:00
SangBin Cho
9208618b3e [Core] in batch prefix caching by delay scheduling (#2442) 2024-12-11 12:51:50 -08:00
Byron Hsu
864bf2ba00 [router] remove main.rs because only lib.rs is used for py binding (#2453) 2024-12-11 12:13:19 -08:00
Byron Hsu
a4cca7fc53 [router] Add retries based fault tolerance (#2452) 2024-12-11 12:13:08 -08:00
Fred Reiss
993956c6b1 Add support for IBM Granite 3.x models (#2437) 2024-12-11 06:30:23 -08:00
Lianmin Zheng
f8548295d6 Fix warmup in bench_offline_throughput.py (#2449) 2024-12-11 06:16:01 -08:00
Lianmin Zheng
959735fc9e Fix model loader for more quantization formats (#2448) 2024-12-11 05:21:23 -08:00
bjmsong
f67723940d decoding attention kernel benchmark (#2425)
Co-authored-by: root <bjmsong@126.com>
2024-12-11 04:46:59 -08:00
Yineng Zhang
626a99ac13 chore: update ao v0.7.0 (#2447) 2024-12-11 04:44:28 -08:00