Commit Graph

165 Commits

Author SHA1 Message Date
Lianmin Zheng
a509552087 [minor] Improve code style and compatibility (#1961) 2024-11-08 02:19:41 -08:00
Lianmin Zheng
0abbf289a8 Unify the model type checking (#1905) 2024-11-03 12:25:39 -08:00
Lianmin Zheng
b548801ddb Update docs (#1839) 2024-10-30 02:49:08 -07:00
Lianmin Zheng
86e0dde555 Improve the user control of new_token_ratio (#1811) 2024-10-26 16:39:41 -07:00
Lianmin Zheng
2b80978859 Provide an argument to set the maximum batch size for cuda graph (#1809) 2024-10-26 15:09:33 -07:00
Lianmin Zheng
e646c5901e Fix logprob in the overlapped mode (#1795) 2024-10-25 11:06:57 -07:00
yizhang2077
def55bc876 Qwen2vl support cuda graph and disable radix cache (#1780) 2024-10-25 10:45:17 -04:00
Lianmin Zheng
86a2c473b7 [Fix] Fix seq_lens_sum for cuda graph runner in padded cases (#1789) 2024-10-24 21:26:05 -07:00
Lianmin Zheng
384d85ba35 Re-introduce get_cuda_graph_seq_len_fill_value (#1783) 2024-10-24 13:30:11 -07:00
Lianmin Zheng
fc82f5a743 [Fix] Fix cuda graph padding for triton attention backend (#1782) 2024-10-24 12:33:15 -07:00
Lianmin Zheng
0089c4bc96 [Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer (#1779) 2024-10-24 04:16:59 -07:00
Lianmin Zheng
05b3bf5e8e Crash the server on warnings in CI (#1772) 2024-10-23 16:27:13 -07:00
Lianmin Zheng
ad4125d1a9 Fuse more ops & Simplify token mapping (#1758) 2024-10-22 23:20:43 -07:00
Liangsheng Yin
94cde10920 Llama3.2 vision model support (#1551) 2024-10-21 15:01:21 -07:00
Lianmin Zheng
09603c6dc9 Maintain seq_lens_sum to make more FlashInfer operations non-blocking (#1741) 2024-10-21 01:43:16 -07:00
Lianmin Zheng
b121bc03a3 Simplify batch result resolution (#1735) 2024-10-20 19:47:14 -07:00
yizhang2077
554fbf93cd [Bugfix] qwen2vl forward_extend (#1727) 2024-10-20 02:38:35 -07:00
Lianmin Zheng
b48edff67f Split the overlapped version of TpModelWorkerClient into a separate file (#1726) 2024-10-20 00:29:29 -07:00
Lianmin Zheng
59cbf47626 Unify the memory pool api and tp worker API (#1724) 2024-10-19 23:19:26 -07:00
Yineng Zhang
cbbc82b7b8 Support qwen2 vl model (#1721)
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: ispobock <ISPObaoke@163.com>
2024-10-19 21:44:38 -07:00
Yineng Zhang
8bee20f80b Update vllm to 0.6.3 (#1711) (#1720)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2024-10-19 20:45:41 -07:00
Lianmin Zheng
f0f8a7699b Simplify the nan detection and greedy check in sampler (#1709) 2024-10-18 20:21:24 -07:00
Lianmin Zheng
2bcfba1b08 Skip unnecessary penalizer (#1707) 2024-10-18 17:54:03 -07:00
Lianmin Zheng
392f2863c8 Add dtype for more operations (#1705) 2024-10-18 12:18:15 -07:00
Lianmin Zheng
6d0fa73ece Simplify flashinfer utilities (#1704) 2024-10-17 22:54:14 -07:00
Shuo Yang
061e546313 Support double sparsity (#1459) 2024-10-14 02:00:41 -07:00
Lianmin Zheng
9da5a60b18 Add an option to disable penalizer (#1651) 2024-10-12 17:53:23 -07:00
Zhang, Liangang
5d638c92f5 [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480) 2024-10-12 18:10:32 +00:00
Lianmin Zheng
23cc66f7b6 Add back data parallelism (#1635) 2024-10-11 07:22:48 -07:00
Zhang, Liangang
8275049ce3 Add device support (#1607) 2024-10-11 02:05:58 -07:00
Amos You
c996e8ccd4 [Minor] Fix logging typo (#1615) 2024-10-08 21:11:19 -07:00
Lianmin Zheng
45473d4b2b Make input_ids a torch.Tensor (#1568) 2024-10-04 01:09:59 -07:00
Lianmin Zheng
32eb6e96f2 Organize sampling batch info better (#1562) 2024-10-03 18:29:49 -07:00
Lianmin Zheng
4ae0969c0a Move status check in the memory pool to CPU (#1557) 2024-10-02 18:23:35 -07:00
Liangsheng Yin
100f5b8bc9 Simplify flashinfer dispatch (#1552) 2024-10-01 00:28:42 -07:00
Liangsheng Yin
99ec439da4 Organize Attention Backends (#1547) 2024-09-30 15:54:18 -07:00
Lianmin Zheng
63ba2f8d7b Clean up batch data structures: Introducing ModelWorkerBatch (#1544) 2024-09-30 06:41:49 -07:00
Lianmin Zheng
36d5acfca5 Rename InputMetadata -> ForwardBatch (#1543) 2024-09-30 02:41:11 -07:00
Lianmin Zheng
3f0fe08d37 Let ModelRunner take InputMetadata as input, instead of ScheduleBatch (#1541) 2024-09-29 20:28:45 -07:00
Lianmin Zheng
f86c1e611f Move scheduler code from tp_worker.py to scheduler.py (#1538) 2024-09-29 17:42:45 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00
Liangsheng Yin
fd9ad817ec Organize image inputs (#1531) 2024-09-29 06:28:55 +00:00
Lianmin Zheng
9ae1db0bdc [Fix] Ignore import error (#1513) 2024-09-25 11:32:21 -07:00
Ke Bao
8d4ed42ad5 MoE torch compile (#1497) 2024-09-24 01:46:59 -07:00
Lianmin Zheng
2854a5ea9f Fix the overhead due to penalizer in bench_latency (#1496) 2024-09-23 07:38:14 -07:00
Lianmin Zheng
39bb49d156 Update dockerfile to include datamodel_code_generator (#1492) 2024-09-22 04:49:16 -07:00
Lianmin Zheng
2d346a57c2 Fix padding in the cuda graph (#1469) 2024-09-19 01:52:15 -07:00
Lianmin Zheng
7f24ea95c3 Fuse top_k and top_k in the sampler (#1457) 2024-09-18 04:35:35 -07:00
Ke Bao
b3710d2c93 Fix attention backend (#1448) 2024-09-17 14:07:53 +00:00
Ke Bao
c6b6d2e71b Enable MLA by default (#1447) 2024-09-17 11:42:48 +00:00