Commit Graph

1120 Commits

Author SHA1 Message Date
Ke Wen
ece724910a Make torch TP composable with torchao (#2436) 2024-12-11 04:21:42 -08:00
Ying Sheng
8586b72da0 [feat] Enable chunked prefill for llava-onevision (#2412) 2024-12-09 09:52:38 -08:00
Lianmin Zheng
641b7d0ae0 [Minor] Improve code style (#2422) 2024-12-09 06:30:35 -08:00
Lianmin Zheng
0ce091a82d [Minor] Improve code style (#2419) 2024-12-09 03:05:59 -08:00
Lianmin Zheng
835f8afc77 Migrate llama_classification to use the /classify interface (#2417) 2024-12-08 23:30:51 -08:00
Xiaoyu Zhang
3844feb9bb Add a unittest for fused_moe (#2416) 2024-12-08 22:46:10 -08:00
Byron Hsu
27f7bed7a7 reduce watchdog interval to 5s (#2410) 2024-12-08 21:17:31 -08:00
Lianmin Zheng
a6ca736c8e Simplify stream_output (#2398) 2024-12-08 12:27:13 -08:00
Lianmin Zheng
cc858953a0 Fix recv_requests (#2405) 2024-12-08 04:08:04 -08:00
Yineng Zhang
6128f7cff5 fix: specify dtype with begin_forward aka plan (#2404) 2024-12-08 20:07:30 +08:00
Lianmin Zheng
a2486eb58f Fix a bug with logprob streaming + chunked prefill (#2403) 2024-12-08 03:55:27 -08:00
Ke Bao
61dec545b0 Remove unused vars in the triton backend (#2401) 2024-12-08 03:37:03 -08:00
Ke Bao
7dc66fcb40 Optimize Triton decoding kernel for long context (#2394) 2024-12-08 01:17:37 -08:00
SangBin Cho
1f09e84b9a nit: Remove busy waiting on scheduler (#2382) 2024-12-08 01:06:15 -08:00
Sangchun Ha (Patrick)
63dfab1bea Fix shape error that occurred when loading lora weight of gemma2 model. (#2330) 2024-12-08 01:04:08 -08:00
Yineng Zhang
75ae968959 minor: update killall script (#2391) 2024-12-08 04:21:00 +08:00
HAI
95f93f493a Fp8 MoE optimizations on AMD (#2388) 2024-12-07 21:18:26 +08:00
Yineng Zhang
aaac33fd8d fix: update xgrammar v0.1.6 (#2390) 2024-12-07 21:09:16 +08:00
Yineng Zhang
d332aa3b0c fix: resolve fp8 moe issue (#2387) 2024-12-07 19:28:53 +08:00
Lianmin Zheng
e5f227c0ee Release v0.4.0.post1 (#2375) 2024-12-06 06:08:19 -08:00
Lianmin Zheng
0e7409adb6 Fix the overlap for xgrammar (#2377) 2024-12-06 05:49:29 -08:00
Lianmin Zheng
f5b2a3aa67 Use proc.join instead of busy waiting (#2374) 2024-12-06 02:01:23 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Xiaoyu Zhang
34b364e073 optimize cuda graph max_bs_settings on low-end gpus (#2360) 2024-12-06 01:13:04 -08:00
Yineng Zhang
84d96b3ae5 Move FP8 to SGLang (#2370)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2024-12-06 15:42:10 +08:00
xiaobochen
3d32e4a32c Resubmit MoE-EP (#2371) 2024-12-06 15:05:21 +08:00
Lianmin Zheng
71e2a27753 Fix the cuda graph capture range for small #max-running-requests (#2359) 2024-12-06 14:13:57 +08:00
Ke Bao
4a63c181f1 Fix AWQ with enable MLA (#2364) 2024-12-06 00:46:48 +08:00
Lianmin Zheng
2b0fc5941d [Minor] Code style improvements (#2355) 2024-12-04 19:02:08 -08:00
Jerry Zhang
9cc733b38c move apply_torchao_config_ to model_runner (#2342) 2024-12-04 17:26:42 -08:00
Ke Wen
d693ec0427 Make torch TP composable with torch.compile (#2352) 2024-12-04 17:26:00 -08:00
Chayenne
786be44da5 Fix Docs CI When Compile Error (#2323) 2024-12-04 11:19:46 -08:00
Yineng Zhang
2db4469808 minor: limit the range of vllm versions (#2350) 2024-12-05 02:00:34 +08:00
Ata Fatahi
ed45e509df Check gpu availability at server args creation (#2340)
Signed-off-by: Ata Fatahi <immrata@gmail.com>
2024-12-05 01:53:02 +08:00
Ke Bao
ec52464dde MLA prefill w/o weight absorption (#2349) 2024-12-05 01:50:28 +08:00
HAI
b2986d7aa5 Adding SGLang FP8 Utils (#2348) 2024-12-04 03:01:33 -08:00
Yineng Zhang
f8b0326934 chore: bump v0.4.0 (#2338) 2024-12-03 11:55:41 -08:00
Lianmin Zheng
1228f7ca69 Fix gptq for moe layers (#2300)
Co-authored-by: root <me@zhyncs.com>
2024-12-03 23:12:33 +08:00
Lianmin Zheng
07ec07ad1f Improve torch compile for fused moe (#2327) 2024-12-03 01:58:25 -08:00
Ying Sheng
aa47f64223 Revert "[feat] Enable chunked prefill for llava-onevision" (#2329) 2024-12-02 23:11:13 -08:00
Lianmin Zheng
3ddb1c4679 [Minor] Fix logger and style (#2325) 2024-12-02 20:45:53 -08:00
Ying Sheng
480e38a733 [feat] Enable chunked prefill for llava-onevision (#2281) 2024-12-02 20:19:02 -08:00
HAI
69e2d4fb66 Relax to include more AMD GPUs (#2319) 2024-12-02 19:05:58 -08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Lianmin Zheng
18108abe5d [Minor] Fix code style (#2311) 2024-12-02 02:27:36 -08:00
HAI
c54bda300a Use rocminfo instead of rocm-smi for more OS/WSL support (#2310) 2024-12-02 00:15:45 -08:00
Lianmin Zheng
3c79ad35ca [Fix] Fix the padded hash value for image tokens (#2309) 2024-12-01 23:36:28 -08:00
Chayenne
983bfcf386 Online weight updates from torch.distributed (#2279) 2024-12-01 23:23:18 -08:00
Lianmin Zheng
5c18a03733 Fix logprob for completions (#2301) 2024-12-01 05:17:05 -08:00
Qun Yang
62c516ac45 Add a simple torch native attention backend (#2241) 2024-12-01 03:01:25 -08:00