Commit Graph

115 Commits

Author SHA1 Message Date
Lianmin Zheng
4936be8acc Revert "Revert "[FEAT] Support GGUF format"" (#2287) 2024-11-30 22:14:48 -08:00
Lianmin Zheng
7e4c6dd8da Revert "[FEAT] Support GGUF format" (#2285) 2024-11-30 19:03:26 -08:00
Yang Zheng
883c955489 [FEAT] Support GGUF format (#2215)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-11-30 00:44:48 -08:00
Ying Sheng
8b48496aaf Revert "Revert "Add simple CPU offloading support"" (#2253)
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-28 23:58:54 -08:00
Ying Sheng
4057ea82c9 Revert "Add simple CPU offloading support" (#2252)
We'll re-add the commit to correctly ack Kaichao's authorship
2024-11-28 23:36:55 -08:00
Lianmin Zheng
d4fc1a70e3 Crash the server correctly during error (#2231) 2024-11-28 00:22:39 -08:00
Lianmin Zheng
fed4c6946a Release v0.3.6.post2 (#2214)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-11-27 03:35:30 -08:00
Lianmin Zheng
fb6e04a0c2 Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2222) 2024-11-27 02:52:46 -08:00
Lianmin Zheng
6997e28f6e Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" (#2221) 2024-11-27 02:02:01 -08:00
Lianmin Zheng
a0e58740a8 Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2217) 2024-11-27 01:13:41 -08:00
HAI
10189d08dd [Performance]: Process affinity to CPU cores with multiple sockets support (#2171) 2024-11-25 14:57:32 -08:00
Lianmin Zheng
8e1adb8441 Allow overwrite flashinfer use_tensorcore (#2169) 2024-11-24 20:58:17 -08:00
Yineng Zhang
e3938b2f9c feat: update other MoE models deps (#2156) 2024-11-24 21:36:34 +08:00
Yineng Zhang
b509db5832 feat: remove the dependency on FusedMoE (#2153) 2024-11-24 20:09:27 +08:00
Jani Monoses
d98fa1e93d Add simple CPU offloading support. (#2081) 2024-11-23 06:23:53 +00:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Yineng Zhang
766192610e feat: update torch 2.5.1 (#2069) 2024-11-18 21:29:13 +08:00
Lianmin Zheng
df7fe4521a Crash the CI jobs on model import errors (#2072) 2024-11-17 22:18:11 -08:00
Lianmin Zheng
11f881d173 Deprecate --disable-flashinfer and --disable-flashinfer-sampling (#2065) 2024-11-17 16:20:58 -08:00
Lianmin Zheng
38625e2139 Remove monkey_patch_vllm_dummy_weight_loader (#2064) 2024-11-17 15:48:12 -08:00
Lianmin Zheng
c1f401fc58 Revert "chore: update torch v2.5.1" (#2063) 2024-11-17 15:29:38 -08:00
Yineng Zhang
3b878863f7 chore: update torch v2.5.1 (#1849) 2024-11-18 00:06:00 +08:00
Lianmin Zheng
f719d9aebc Launch dp ranks in parallel (#2053)
Co-authored-by: Haotian Liu <6631389+haotian-liu@users.noreply.github.com>
2024-11-16 17:39:39 -08:00
HAI
2ffe0a7363 Add get_amdgpu_memory_capacity() (#2049) 2024-11-15 22:51:48 -08:00
Lianmin Zheng
b01df48cf2 [Fix] Adjust default chunked prefill size and cuda graph max bs according to GPU memory capacity (#2044) 2024-11-15 06:21:57 -08:00
Lianmin Zheng
1929c06762 Simplify prometheus metrics (#1981)
Co-authored-by: Mohit Reddy <mohitreddy1996@users.noreply.github.com>
2024-11-10 04:39:32 -08:00
Lianmin Zheng
9c939a3d8b Clean up metrics code (#1972) 2024-11-09 15:43:20 -08:00
Lianmin Zheng
a509552087 [minor] Improve code style and compatibility (#1961) 2024-11-08 02:19:41 -08:00
Lianmin Zheng
0abbf289a8 Unify the model type checking (#1905) 2024-11-03 12:25:39 -08:00
Lianmin Zheng
86fc0d79d0 Add a watch dog thread (#1816) 2024-10-27 02:00:50 -07:00
Liangsheng Yin
a628dd8e31 Set ZMQ buffer size heuristic (#1801) 2024-10-25 23:15:56 -07:00
Liangsheng Yin
1e8903414a Fix possible ZMQ hanging (#1800) 2024-10-25 23:07:07 -07:00
Liangsheng Yin
94cde10920 Llama3.2 vision model support (#1551) 2024-10-21 15:01:21 -07:00
Yineng Zhang
cbbc82b7b8 Support qwen2 vl model (#1721)
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: ispobock <ISPObaoke@163.com>
2024-10-19 21:44:38 -07:00
Yineng Zhang
8bee20f80b Update vllm to 0.6.3 (#1711) (#1720)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2024-10-19 20:45:41 -07:00
Zeng Zhongchao
2782132be8 Add date to logging messages (#1623) (#1679) 2024-10-16 18:54:55 -07:00
Michael Feil
b0facb3316 add orjson for jsonresponse (#1688) 2024-10-16 18:14:30 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Ying Sheng
4876117171 [Fix] fix eos trim inconsistency (#1650) 2024-10-13 01:07:09 -07:00
Zhang, Liangang
8275049ce3 Add device support (#1607) 2024-10-11 02:05:58 -07:00
Ying Sheng
c5325aba75 [Profile] Add pytorch profiler (#1604) 2024-10-07 14:37:16 -07:00
Lianmin Zheng
ebbc42d989 Optimize broadcast & Reorg code (#1598) 2024-10-07 13:19:23 -07:00
Lianmin Zheng
6a5b352aaf Use is_flashinfer_available to replace is_hip for flashinfer check (#1596)
Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>
2024-10-06 22:54:05 -07:00
Lianmin Zheng
b6aad70ab1 [Fix] Fix the case where prompt_len = 0 (#1593) 2024-10-06 20:30:02 -07:00
Lianmin Zheng
9244f27f0a [Minor] Improve the style and fix flaky tests (#1584) 2024-10-06 00:10:48 -07:00
Lianmin Zheng
114bbc8651 Use ipc instead of tcp in zmq (#1566) 2024-10-04 00:45:52 -07:00
Xinyu Yang
acaffd233f [Fix] fix ipv6 url when warm up model (#1537) 2024-09-29 11:02:40 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00
Ying Sheng
9aa6553d2a [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B (#1525) 2024-09-27 23:32:11 -07:00
Yineng Zhang
b4408b0d16 feat: update linear deps 1/N (#1305) 2024-09-19 20:53:11 +08:00