Commit Graph

180 Commits

Author SHA1 Message Date
Lianmin Zheng
4936be8acc Revert "Revert "[FEAT] Support GGUF format"" (#2287) 2024-11-30 22:14:48 -08:00
Lianmin Zheng
7e4c6dd8da Revert "[FEAT] Support GGUF format" (#2285) 2024-11-30 19:03:26 -08:00
Yang Zheng
883c955489 [FEAT] Support GGUF format (#2215)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-11-30 00:44:48 -08:00
Chayenne
7d1485d376 Add get weights by parameter name for llama (#2266) 2024-11-29 23:36:38 -08:00
Lianmin Zheng
afe1e46586 [Minor] fix the style for multimodal models (#2257) 2024-11-29 04:24:20 -08:00
Lianmin Zheng
f50a6cf443 Fix hash collision for multi modal models (#2256) 2024-11-29 03:15:58 -08:00
Ying Sheng
8b48496aaf Revert "Revert "Add simple CPU offloading support"" (#2253)
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-28 23:58:54 -08:00
Ying Sheng
4057ea82c9 Revert "Add simple CPU offloading support" (#2252)
We'll re-add the commit to correctly ack Kaichao's authorship
2024-11-28 23:36:55 -08:00
Ying Sheng
b7038fec9b [fix] Fix prefix caching for multi-image/video (#2239) 2024-11-28 12:08:13 -08:00
Jani Monoses
db674e3d24 Add OLMo2 model. (#2233) 2024-11-28 00:15:20 -08:00
Lianmin Zheng
dd5eba4c88 Remove fused_moe_grok (#2223) 2024-11-27 14:28:55 -08:00
Ying Sheng
37c8a5761f [feat] Support session control for vision language models (#2210) 2024-11-27 00:03:29 -08:00
Lianmin Zheng
be0124bda0 Rename triton_fused_moe -> fused_moe_triton (#2163) 2024-11-24 08:12:35 -08:00
Yineng Zhang
e3938b2f9c feat: update other MoE models deps (#2156) 2024-11-24 21:36:34 +08:00
Yineng Zhang
b509db5832 feat: remove the dependency on FusedMoE (#2153) 2024-11-24 20:09:27 +08:00
Jani Monoses
d98fa1e93d Add simple CPU offloading support. (#2081) 2024-11-23 06:23:53 +00:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Lianmin Zheng
dfec7fca06 Rename sglang.bench_latency to sglang.bench_one_batch (#2118) 2024-11-21 20:07:48 -08:00
James Xu
f6f713797b Add support for Qwen2-VL-based embedding models (#2055) 2024-11-21 14:24:25 -08:00
Jerry Zhang
7f8fcd39cd Turn off autotune for scaled mm for fp8 dynamic quant in torchao (#2116) 2024-11-21 12:19:49 -08:00
Jani Monoses
66318ffe96 Rename layer_idx to layer_id for consistency (#2078) 2024-11-18 13:00:02 -08:00
Lianmin Zheng
4af3f889fc Simplify flashinfer indices update for prefill (#2074)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: kavioyu <kavioyu@gmail.com>
2024-11-18 00:02:36 -08:00
Lianmin Zheng
df7fe4521a Crash the CI jobs on model import errors (#2072) 2024-11-17 22:18:11 -08:00
Tanjiro
8c280cee55 add phi-3 small support (#2062)
Co-authored-by: Tushar Goel <114812108+AI-Tushar@users.noreply.github.com>
2024-11-17 18:47:43 -08:00
Lianmin Zheng
ebaa2f3199 Rename arguments --disable-nan-detection to --enable-nan-detection (#2066) 2024-11-17 16:53:44 -08:00
Ke Bao
976bc302e5 Support DP MLA (#1970) 2024-11-16 09:01:43 +00:00
Ke Wen
cf2489762b Add Tensor Parallel to torch_native_llama (#1876) 2024-11-15 21:26:00 -08:00
Lianmin Zheng
530ae1bdc8 Fix weight loading for tied word embedding when TP > 1 (#2009) 2024-11-11 17:52:42 -08:00
Lianmin Zheng
59a5ba9be0 [Minor] Remove unused imports (#2006) 2024-11-11 15:36:14 -08:00
RangiLyu
f18b9c7252 support internlm2-reward (#1994) 2024-11-11 15:09:58 -08:00
yizhang2077
a8aad9357d qwen2vl fix bug for #1971 #1897 (#1984) 2024-11-10 08:10:45 -08:00
aqweteddy
4ade15dd32 Adjust reward model's score module and pooler module order for reducing computation (#1956) 2024-11-08 00:10:54 -08:00
aqweteddy
f16eb15d0d Gemma2 reward model support (#1954) 2024-11-07 22:42:27 -08:00
Chayenne
c77c1e05ba fix black in pre-commit (#1940) 2024-11-08 07:42:47 +08:00
Xuehai Pan
a5e0defb5a minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 13:46:04 +00:00
Lianmin Zheng
2ce32db6fb Let reward model take text inputs instead of message lists (#1907)
Co-authored-by: Kyle Corbitt <kyle@corbt.com>
2024-11-03 13:27:12 -08:00
Lianmin Zheng
0abbf289a8 Unify the model type checking (#1905) 2024-11-03 12:25:39 -08:00
Ke Bao
16eb33ffe2 Update vocab embedding deps and add TP switch (#1856) 2024-10-31 20:13:07 -07:00
DanielC12321
5e00ddebc0 Add new model: Gpt2 (#1833) 2024-10-29 17:52:33 -07:00
HAI
54dd3ea122 [FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… (#1835) 2024-10-29 13:58:03 -07:00
Liangsheng Yin
94cde10920 Llama3.2 vision model support (#1551) 2024-10-21 15:01:21 -07:00
sixgod
45d5af2416 Add GLM-4 TextGeneration Model support for SGLang (#1736) 2024-10-21 04:08:30 +00:00
Yineng Zhang
cbbc82b7b8 Support qwen2 vl model (#1721)
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: ispobock <ISPObaoke@163.com>
2024-10-19 21:44:38 -07:00
Yineng Zhang
8bee20f80b Update vllm to 0.6.3 (#1711) (#1720)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2024-10-19 20:45:41 -07:00
Lianmin Zheng
d17d19e5b8 Fix mixed batch for multi modal models (#1702) 2024-10-17 10:27:26 -07:00
Jani Monoses
5ab20cceba Use SGLang imports for linear layer (#1696) 2024-10-17 07:50:01 -07:00
Jani Monoses
a5114b6f91 Add OLMo model (#1676) 2024-10-16 00:11:18 -07:00
Byron Hsu
56503d9bc9 [1/N] Remove CacheConfig import in all model files (#1658) 2024-10-14 09:06:34 -07:00
Lianmin Zheng
6a5b352aaf Use is_flashinfer_available to replace is_hip for flashinfer check (#1596)
Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>
2024-10-06 22:54:05 -07:00
Jerry Zhang
9b0926ceeb Add llama implementation with no tensor parallel linears (#1561) 2024-10-05 11:22:27 -07:00