Hui Liu
|
5ce9daea59
|
ROCm support for sglang.check_env (#2426)
|
2024-12-17 03:45:14 -08:00 |
|
Lianmin Zheng
|
bd6196163e
|
Small fix for the order of apply_torchao_config (#2495)
|
2024-12-16 19:21:11 -08:00 |
|
Lianmin Zheng
|
56198b45d9
|
Add a benchmark script for in-batch prefix caching (#2494)
|
2024-12-16 18:49:02 -08:00 |
|
Lianmin Zheng
|
ba36b5520a
|
Revert "Small fixes for torchao quant" (#2493)
|
2024-12-16 15:04:16 -08:00 |
|
Lianmin Zheng
|
7a1aecb938
|
Simplify pytorch sampling kernel and logit processor (#2491)
|
2024-12-16 14:11:09 -08:00 |
|
Jerry Zhang
|
82699474fd
|
Small fixes for torchao quant (#2476)
|
2024-12-16 14:08:12 -08:00 |
|
xiaobochen
|
b532a5fd16
|
fix moe-ep accuracy issue for fp8 (#2489)
|
2024-12-16 20:54:02 +08:00 |
|
Yineng Zhang
|
5f2595be43
|
hotfix: checking for HIP (#2485)
|
2024-12-15 02:47:26 +08:00 |
|
Ke Bao
|
0ba2c58947
|
Remove cuda graph batch size adjustment for dp attention (#2484)
|
2024-12-14 23:53:54 +08:00 |
|
Ke Bao
|
2f9bd0fafd
|
Fix correctness issue for triton decoding kernel (#2479)
|
2024-12-14 16:50:54 +08:00 |
|
Lianmin Zheng
|
5282a4735f
|
[Minor] Fix grok model loader (#2473)
|
2024-12-12 14:34:47 -08:00 |
|
SangBin Cho
|
9208618b3e
|
[Core] in batch prefix caching by delay scheduling (#2442)
|
2024-12-11 12:51:50 -08:00 |
|
Fred Reiss
|
993956c6b1
|
Add support for IBM Granite 3.x models (#2437)
|
2024-12-11 06:30:23 -08:00 |
|
Lianmin Zheng
|
f8548295d6
|
Fix warmup in bench_offline_throughput.py (#2449)
|
2024-12-11 06:16:01 -08:00 |
|
Lianmin Zheng
|
959735fc9e
|
Fix model loader for more quantization formats (#2448)
|
2024-12-11 05:21:23 -08:00 |
|
Yineng Zhang
|
626a99ac13
|
chore: update ao v0.7.0 (#2447)
|
2024-12-11 04:44:28 -08:00 |
|
Ke Wen
|
ece724910a
|
Make torch TP composable with torchao (#2436)
|
2024-12-11 04:21:42 -08:00 |
|
Ying Sheng
|
8586b72da0
|
[feat] Enable chunked prefill for llava-onevision (#2412)
|
2024-12-09 09:52:38 -08:00 |
|
Lianmin Zheng
|
641b7d0ae0
|
[Minor] Improve code style (#2422)
|
2024-12-09 06:30:35 -08:00 |
|
Lianmin Zheng
|
0ce091a82d
|
[Minor] Improve code style (#2419)
|
2024-12-09 03:05:59 -08:00 |
|
Lianmin Zheng
|
835f8afc77
|
Migrate llama_classification to use the /classify interface (#2417)
|
2024-12-08 23:30:51 -08:00 |
|
Xiaoyu Zhang
|
3844feb9bb
|
Add a unittest for fused_moe (#2416)
|
2024-12-08 22:46:10 -08:00 |
|
Byron Hsu
|
27f7bed7a7
|
reduce watchdog interval to 5s (#2410)
|
2024-12-08 21:17:31 -08:00 |
|
Lianmin Zheng
|
a6ca736c8e
|
Simplify stream_output (#2398)
|
2024-12-08 12:27:13 -08:00 |
|
Lianmin Zheng
|
cc858953a0
|
Fix recv_requests (#2405)
|
2024-12-08 04:08:04 -08:00 |
|
Yineng Zhang
|
6128f7cff5
|
fix: specify dtype with begin_forward aka plan (#2404)
|
2024-12-08 20:07:30 +08:00 |
|
Lianmin Zheng
|
a2486eb58f
|
Fix a bug with logprob streaming + chunked prefill (#2403)
|
2024-12-08 03:55:27 -08:00 |
|
Ke Bao
|
61dec545b0
|
Remove unused vars in the triton backend (#2401)
|
2024-12-08 03:37:03 -08:00 |
|
Ke Bao
|
7dc66fcb40
|
Optimize Triton decoding kernel for long context (#2394)
|
2024-12-08 01:17:37 -08:00 |
|
SangBin Cho
|
1f09e84b9a
|
nit: Remove busy waiting on scheduler (#2382)
|
2024-12-08 01:06:15 -08:00 |
|
Sangchun Ha (Patrick)
|
63dfab1bea
|
Fix shape error that occurred when loading lora weight of gemma2 model. (#2330)
|
2024-12-08 01:04:08 -08:00 |
|
Yineng Zhang
|
75ae968959
|
minor: update killall script (#2391)
|
2024-12-08 04:21:00 +08:00 |
|
HAI
|
95f93f493a
|
Fp8 MoE optimizations on AMD (#2388)
|
2024-12-07 21:18:26 +08:00 |
|
Yineng Zhang
|
aaac33fd8d
|
fix: update xgrammar v0.1.6 (#2390)
|
2024-12-07 21:09:16 +08:00 |
|
Yineng Zhang
|
d332aa3b0c
|
fix: resolve fp8 moe issue (#2387)
|
2024-12-07 19:28:53 +08:00 |
|
Lianmin Zheng
|
e5f227c0ee
|
Release v0.4.0.post1 (#2375)
|
2024-12-06 06:08:19 -08:00 |
|
Lianmin Zheng
|
0e7409adb6
|
Fix the overlap for xgrammar (#2377)
|
2024-12-06 05:49:29 -08:00 |
|
Lianmin Zheng
|
f5b2a3aa67
|
Use proc.join instead of busy waiting (#2374)
|
2024-12-06 02:01:23 -08:00 |
|
Qun Yang
|
37ee906f61
|
Add more support for intel Gaudi accelerators (#2357)
|
2024-12-06 01:16:33 -08:00 |
|
Xiaoyu Zhang
|
34b364e073
|
optimize cuda graph max_bs_settings on low-end gpus (#2360)
|
2024-12-06 01:13:04 -08:00 |
|
Yineng Zhang
|
84d96b3ae5
|
Move FP8 to SGLang (#2370)
Co-authored-by: HaiShaw <hixiao@gmail.com>
|
2024-12-06 15:42:10 +08:00 |
|
xiaobochen
|
3d32e4a32c
|
Resubmit MoE-EP (#2371)
|
2024-12-06 15:05:21 +08:00 |
|
Lianmin Zheng
|
71e2a27753
|
Fix the cuda graph capture range for small #max-running-requests (#2359)
|
2024-12-06 14:13:57 +08:00 |
|
Ke Bao
|
4a63c181f1
|
Fix AWQ with enable MLA (#2364)
|
2024-12-06 00:46:48 +08:00 |
|
Lianmin Zheng
|
2b0fc5941d
|
[Minor] Code style improvements (#2355)
|
2024-12-04 19:02:08 -08:00 |
|
Jerry Zhang
|
9cc733b38c
|
move apply_torchao_config_ to model_runner (#2342)
|
2024-12-04 17:26:42 -08:00 |
|
Ke Wen
|
d693ec0427
|
Make torch TP composable with torch.compile (#2352)
|
2024-12-04 17:26:00 -08:00 |
|
Chayenne
|
786be44da5
|
Fix Docs CI When Compile Error (#2323)
|
2024-12-04 11:19:46 -08:00 |
|
Yineng Zhang
|
2db4469808
|
minor: limit the range of vllm versions (#2350)
|
2024-12-05 02:00:34 +08:00 |
|
Ata Fatahi
|
ed45e509df
|
Check gpu availability at server args creation (#2340)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
|
2024-12-05 01:53:02 +08:00 |
|