Commit Graph

  • 21e9e63ad5 Print progress bar during cuda graph capture (#2502) Lianmin Zheng 2024-12-17 06:20:44 -08:00
  • 1fc84cf60b Update readme (#2500) Lianmin Zheng 2024-12-17 04:33:36 -08:00
  • 361ea8d912 Fix openai protocols and pass top_k, min_p (#2499) Lianmin Zheng 2024-12-17 04:14:14 -08:00
  • 33c5ff2845 Add lora_path to chat completion (#2438) Lei 2024-12-17 03:47:49 -08:00
  • 5ce9daea59 ROCm support for sglang.check_env (#2426) Hui Liu 2024-12-17 03:45:14 -08:00
  • ce094a5d79 Clean up GPU memory after killing sglang processes (#2457) Ata Fatahi 2024-12-17 06:42:40 -05:00
  • e21026690d benchmark decoding attention kernel with cudnn (#2467) bjmsong 2024-12-17 19:31:57 +08:00
  • bd6196163e Small fix for the order of apply_torchao_config (#2495) Lianmin Zheng 2024-12-16 19:21:11 -08:00
  • 56198b45d9 Add a benchmark script for in-batch prefix caching (#2494) Lianmin Zheng 2024-12-16 18:49:02 -08:00
  • ba36b5520a Revert "Small fixes for torchao quant" (#2493) Lianmin Zheng 2024-12-16 15:04:16 -08:00
  • 9cd9dc83b3 Temporarily disable unit test of torch native attention backend (#2492) Lianmin Zheng 2024-12-16 14:17:27 -08:00
  • 7a1aecb938 Simplify pytorch sampling kernel and logit processor (#2491) Lianmin Zheng 2024-12-16 14:11:09 -08:00
  • 82699474fd Small fixes for torchao quant (#2476) Jerry Zhang 2024-12-16 14:08:12 -08:00
  • 7154b4b1df minor: update flashinfer nightly (#2490) Yineng Zhang 2024-12-16 23:02:49 +08:00
  • b532a5fd16 fix moe-ep accuracy issue for fp8 (#2489) xiaobochen 2024-12-16 20:54:02 +08:00
  • a0592c059f [Benchmark] add a benchmark for hf/vllm/sglang rmsnorm (#2486) Xiaoyu Zhang 2024-12-15 13:52:08 +08:00
  • e8dbdf75bc fix typo (#2487) Yineng Zhang 2024-12-15 13:44:55 +08:00
  • e04d3f2897 adapt tensorrt llm custom all reduce to sgl-kernel (#2481) yizhang2077 2024-12-15 13:15:59 +08:00
  • 5f2595be43 hotfix: checking for HIP (#2485) Yineng Zhang 2024-12-15 02:47:26 +08:00
  • 0ba2c58947 Remove cuda graph batch size adjustment for dp attention (#2484) Ke Bao 2024-12-14 23:53:54 +08:00
  • fccbfa3752 format: add clang-format for sgl-kernel (#2483) Yineng Zhang 2024-12-14 22:36:04 +08:00
  • 2f9bd0fafd Fix correctness issue for triton decoding kernel (#2479) Ke Bao 2024-12-14 16:50:54 +08:00
  • 5282a4735f [Minor] Fix grok model loader (#2473) Lianmin Zheng 2024-12-12 14:34:47 -08:00
  • f0ed9c353e feat: support dev image (#2469) Yineng Zhang 2024-12-13 02:23:52 +08:00
  • e3b3acfa6f Rename rust folder to sgl-router (#2464) Ata Fatahi 2024-12-12 12:40:41 -05:00
  • 2673fa29d4 fix: set runtime path (#2466) Yineng Zhang 2024-12-12 18:05:48 +08:00
  • dedaf8cd48 minor: update pypi tag (#2463) Yineng Zhang 2024-12-12 15:21:45 +08:00
  • 32ed016041 chore: bump v0.0.2 for sgl-kernel (#2462) Yineng Zhang 2024-12-12 14:58:05 +08:00
  • 6efa9e4a6d Bump sglang-router to 0.1.1 (#2459) Ata Fatahi 2024-12-11 20:40:03 -05:00
  • 7791fd9948 Include version info into the router package (#2456) Ata Fatahi 2024-12-11 20:31:20 -05:00
  • 2ac36b9a7b Make request payload size configurable (#2444) Ata Fatahi 2024-12-11 19:55:21 -05:00
  • 2d60a5ee75 Update v0.1.0.md Byron Hsu 2024-12-11 13:48:18 -08:00
  • 2e4a5907c9 [router] Release router 0.1.0 with dynamic scaling and fault tolerance (#2455) Byron Hsu 2024-12-11 13:42:35 -08:00
  • c0ee46fe10 [router] Update doc for dynamic scaling and fault tolerance (#2454) Byron Hsu 2024-12-11 13:11:42 -08:00
  • 9208618b3e [Core] in batch prefix caching by delay scheduling (#2442) SangBin Cho 2024-12-11 12:51:50 -08:00
  • 864bf2ba00 [router] remove main.rs because only lib.rs is used for py binding (#2453) Byron Hsu 2024-12-11 12:13:19 -08:00
  • a4cca7fc53 [router] Add retries based fault tolerance (#2452) Byron Hsu 2024-12-11 12:13:08 -08:00
  • 993956c6b1 Add support for IBM Granite 3.x models (#2437) Fred Reiss 2024-12-11 06:30:23 -08:00
  • f8548295d6 Fix warmup in bench_offline_throughput.py (#2449) Lianmin Zheng 2024-12-11 06:16:01 -08:00
  • 959735fc9e Fix model loader for more quantization formats (#2448) Lianmin Zheng 2024-12-11 05:21:23 -08:00
  • f67723940d decoding attention kernel benchmark (#2425) bjmsong 2024-12-11 20:46:59 +08:00
  • 626a99ac13 chore: update ao v0.7.0 (#2447) Yineng Zhang 2024-12-11 20:44:28 +08:00
  • ece724910a Make torch TP composable with torchao (#2436) Ke Wen 2024-12-11 04:21:42 -08:00
  • 0fb88aaa77 [router] Use borrow if possible to save cost (#2441) Byron Hsu 2024-12-11 01:38:50 -08:00
  • d4de9a6235 [router] Refactor: decouple select and send stage (#2440) Byron Hsu 2024-12-11 00:51:21 -08:00
  • 7310aede97 fix: compatible with PEP 440 (#2435) Yineng Zhang 2024-12-11 06:48:45 +08:00
  • 5de9a58eca fix: use manylinux2014_x86_64 tag (#2434) Yineng Zhang 2024-12-11 06:17:41 +08:00
  • 56fcd8e8a5 feat: support sgl-kernel PyPI (#2433) Yineng Zhang 2024-12-11 06:06:19 +08:00
  • 2b340adfb1 Typo fix in router.md (#2424) Adarsh Shirawalmath 2024-12-10 11:19:40 +05:30
  • 8586b72da0 [feat] Enable chunked prefill for llava-onevision (#2412) Ying Sheng 2024-12-09 09:52:38 -08:00
  • 641b7d0ae0 [Minor] Improve code style (#2422) Lianmin Zheng 2024-12-09 06:30:35 -08:00
  • 0ce091a82d [Minor] Improve code style (#2419) Lianmin Zheng 2024-12-09 03:05:59 -08:00
  • 835f8afc77 Migrate llama_classification to use the /classify interface (#2417) Lianmin Zheng 2024-12-08 23:30:51 -08:00
  • 3844feb9bb Add a unittest for fused_moe (#2416) Xiaoyu Zhang 2024-12-09 14:46:10 +08:00
  • 27f7bed7a7 reduce watchdog interval to 5s (#2410) Byron Hsu 2024-12-08 21:17:31 -08:00
  • 6387098f5f [router] add health checking in router init (#2393) Byron Hsu 2024-12-08 17:17:37 -08:00
  • 2a717c5078 [Router] fix interrupt from terminal (#2413) Byron Hsu 2024-12-08 16:58:41 -08:00
  • a1e697b25b [router] Improve cleanup logic (#2411) Byron Hsu 2024-12-08 15:24:02 -08:00
  • a6ca736c8e Simplify stream_output (#2398) Lianmin Zheng 2024-12-08 12:27:13 -08:00
  • f62055b528 minor: add random flashinfer vs triton use case (#2409) Yineng Zhang 2024-12-09 04:15:21 +08:00
  • 74bc9184c3 minor: add random use case (#2408) Yineng Zhang 2024-12-09 03:21:35 +08:00
  • 0f8eb15323 feat: support custom task runner (#2407) Yineng Zhang 2024-12-09 02:29:55 +08:00
  • 67470bbb28 minor: update correct measurement unit (#2406) Yineng Zhang 2024-12-08 20:55:04 +08:00
  • cc858953a0 Fix recv_requests (#2405) Lianmin Zheng 2024-12-08 04:08:04 -08:00
  • 6128f7cff5 fix: specify dtype with begin_forward aka plan (#2404) Yineng Zhang 2024-12-08 20:07:30 +08:00
  • a2486eb58f Fix a bug with logprob streaming + chunked prefill (#2403) Lianmin Zheng 2024-12-08 03:55:27 -08:00
  • 61dec545b0 Remove unused vars in the triton backend (#2401) Ke Bao 2024-12-08 19:37:03 +08:00
  • 96db0f666d Update killall_sglang.sh (#2397) Lianmin Zheng 2024-12-08 01:56:26 -08:00
  • 7dc66fcb40 Optimize Triton decoding kernel for long context (#2394) Ke Bao 2024-12-08 17:17:37 +08:00
  • 1f09e84b9a nit: Remove busy waiting on scheduler (#2382) SangBin Cho 2024-12-08 01:06:15 -08:00
  • 63dfab1bea Fix shape error that occurred when loading lora weight of gemma2 model. (#2330) Sangchun Ha (Patrick) 2024-12-08 18:04:08 +09:00
  • ef995dae1e [router] Health check on worker before adding to the router (#2392) Byron Hsu 2024-12-07 15:39:54 -08:00
  • 75ae968959 minor: update killall script (#2391) Yineng Zhang 2024-12-08 04:21:00 +08:00
  • 95f93f493a Fp8 MoE optimizations on AMD (#2388) HAI 2024-12-07 05:18:26 -08:00
  • aaac33fd8d fix: update xgrammar v0.1.6 (#2390) Yineng Zhang 2024-12-07 21:09:16 +08:00
  • d332aa3b0c fix: resolve fp8 moe issue (#2387) Yineng Zhang 2024-12-07 19:28:53 +08:00
  • c36736c841 [router] Add remove worker api (#2380) Byron Hsu 2024-12-06 17:16:03 -08:00
  • 1bf9e34745 [router] add remove tenant method in the radix tree (#2379) Byron Hsu 2024-12-06 11:53:15 -08:00
  • 499c85f131 [Router] remove duplicate char count (#2378) Byron Hsu 2024-12-06 11:26:07 -08:00
  • e5f227c0ee Release v0.4.0.post1 (#2375) Lianmin Zheng 2024-12-06 06:08:19 -08:00
  • 0e7409adb6 Fix the overlap for xgrammar (#2377) Lianmin Zheng 2024-12-06 05:49:29 -08:00
  • 3cde5eb629 docs: Improve instructions for supporting new models (#2363) vchzls 2024-12-06 20:27:17 +08:00
  • f5b2a3aa67 Use proc.join instead of busy waiting (#2374) Lianmin Zheng 2024-12-06 02:01:23 -08:00
  • f68175967c docs: update adoption (Meituan) (#2373) Yineng Zhang 2024-12-06 17:59:26 +08:00
  • 67b657945a [router] support /add_worker api (#2369) Byron Hsu 2024-12-06 01:17:04 -08:00
  • 37ee906f61 Add more support for intel Gaudi accelerators (#2357) Qun Yang 2024-12-06 17:16:33 +08:00
  • 34b364e073 optimize cuda graph max_bs_settings on low-end gpus (#2360) Xiaoyu Zhang 2024-12-06 17:13:04 +08:00
  • 84d96b3ae5 Move FP8 to SGLang (#2370) Yineng Zhang 2024-12-06 15:42:10 +08:00
  • 3d32e4a32c Resubmit MoE-EP (#2371) xiaobochen 2024-12-06 15:05:21 +08:00
  • 64fceab8af [router] use 2-gpu-runner (#2368) Byron Hsu 2024-12-05 17:46:21 -08:00
  • 71e2a27753 Fix the cuda graph capture range for small #max-running-requests (#2359) Lianmin Zheng 2024-12-05 13:42:47 -08:00
  • 4a63c181f1 Fix AWQ with enable MLA (#2364) Ke Bao 2024-12-06 00:46:48 +08:00
  • 2b0fc5941d [Minor] Code style improvements (#2355) Lianmin Zheng 2024-12-04 19:02:08 -08:00
  • 9cc733b38c move apply_torchao_config_ to model_runner (#2342) Jerry Zhang 2024-12-04 17:26:42 -08:00
  • d693ec0427 Make torch TP composable with torch.compile (#2352) Ke Wen 2024-12-04 17:26:00 -08:00
  • 18ea841f40 Add Docs For SGLang Native Router (#2308) Chayenne 2024-12-04 15:41:22 -08:00
  • 786be44da5 Fix Docs CI When Compile Error (#2323) Chayenne 2024-12-04 11:19:46 -08:00
  • 2db4469808 minor: limit the range of vllm versions (#2350) Yineng Zhang 2024-12-05 02:00:34 +08:00
  • ed45e509df Check gpu availability at server args creation (#2340) Ata Fatahi 2024-12-04 09:53:02 -08:00
  • ec52464dde MLA prefill w/o weight absorption (#2349) Ke Bao 2024-12-05 01:50:28 +08:00