Commit Graph

165 Commits

Author SHA1 Message Date
xiaobochen
3d32e4a32c Resubmit MoE-EP (#2371) 2024-12-06 15:05:21 +08:00
Lianmin Zheng
71e2a27753 Fix the cuda graph capture range for small #max-running-requests (#2359) 2024-12-06 14:13:57 +08:00
Lianmin Zheng
2b0fc5941d [Minor] Code style improvements (#2355) 2024-12-04 19:02:08 -08:00
Jerry Zhang
9cc733b38c move apply_torchao_config_ to model_runner (#2342) 2024-12-04 17:26:42 -08:00
Lianmin Zheng
07ec07ad1f Improve torch compile for fused moe (#2327) 2024-12-03 01:58:25 -08:00
Ying Sheng
aa47f64223 Revert "[feat] Enable chunked prefill for llava-onevision" (#2329) 2024-12-02 23:11:13 -08:00
Lianmin Zheng
3ddb1c4679 [Minor] Fix logger and style (#2325) 2024-12-02 20:45:53 -08:00
Ying Sheng
480e38a733 [feat] Enable chunked prefill for llava-onevision (#2281) 2024-12-02 20:19:02 -08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Lianmin Zheng
18108abe5d [Minor] Fix code style (#2311) 2024-12-02 02:27:36 -08:00
Chayenne
983bfcf386 Online weight updates from torch.distributed (#2279) 2024-12-01 23:23:18 -08:00
Qun Yang
62c516ac45 Add a simple torch native attention backend (#2241) 2024-12-01 03:01:25 -08:00
Lianmin Zheng
4936be8acc Revert "Revert "[FEAT] Support GGUF format"" (#2287) 2024-11-30 22:14:48 -08:00
Lianmin Zheng
7e4c6dd8da Revert "[FEAT] Support GGUF format" (#2285) 2024-11-30 19:03:26 -08:00
Yang Zheng
883c955489 [FEAT] Support GGUF format (#2215)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-11-30 00:44:48 -08:00
Chayenne
7d1485d376 Add get weights by parameter name for llama (#2266) 2024-11-29 23:36:38 -08:00
Chayenne
7d5d1d3d29 udate weights from disk (#2265) 2024-11-30 01:17:00 +00:00
Lianmin Zheng
94e167ea5a Fix the default chunked prefill size (#2268) 2024-11-29 16:03:32 -08:00
Ying Sheng
8b48496aaf Revert "Revert "Add simple CPU offloading support"" (#2253)
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-28 23:58:54 -08:00
Ying Sheng
4057ea82c9 Revert "Add simple CPU offloading support" (#2252)
We'll re-add the commit to correctly ack Kaichao's authorship
2024-11-28 23:36:55 -08:00
Rin Intachuen
1aea19f64b Input_embeds support (#2052) 2024-11-25 16:35:04 -08:00
Lianmin Zheng
3c5538f781 Update CI threshold (#2186) 2024-11-25 15:24:17 -08:00
Lianmin Zheng
c4336b2b60 Use custom allreduce w/ torch.compile (#2185) 2024-11-25 14:55:01 -08:00
Lianmin Zheng
5652c56535 Update CI threshold & Improve code style (#2159) 2024-11-24 06:29:38 -08:00
Lianmin Zheng
a78d8f8db3 [CI] Fix test cases (#2137) 2024-11-23 01:00:07 -08:00
Jani Monoses
d98fa1e93d Add simple CPU offloading support. (#2081) 2024-11-23 06:23:53 +00:00
Ankur Neog
865233e256 Add initial support for intel Gaudi accelerators (#2121) 2024-11-22 20:22:23 -08:00
Lianmin Zheng
66d4859acf Revert "Only stream output on tp rank 0" (#2130) 2024-11-22 15:46:16 -08:00
Lianmin Zheng
e1b63624d7 Only stream output on tp rank 0 (#2124) 2024-11-22 15:13:44 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Lianmin Zheng
722530fa01 Enable overlap scheduler by default for the triton attention backend (#2105) 2024-11-20 02:58:35 -08:00
Lianmin Zheng
7d671e4ad2 Enable overlap by default (#2067) 2024-11-19 22:07:58 -08:00
Lianmin Zheng
ffd20fcd03 Make constrained decoding work for overlap scheduler (#2095) 2024-11-19 15:04:43 -08:00
HAI
e57c3e12b8 Use native fp8 format on MI300X (#2094) 2024-11-19 14:06:29 -08:00
Yineng Zhang
766192610e feat: update torch 2.5.1 (#2069) 2024-11-18 21:29:13 +08:00
Lianmin Zheng
4af3f889fc Simplify flashinfer indices update for prefill (#2074)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: kavioyu <kavioyu@gmail.com>
2024-11-18 00:02:36 -08:00
Lianmin Zheng
df7fe4521a Crash the CI jobs on model import errors (#2072) 2024-11-17 22:18:11 -08:00
Lianmin Zheng
a9e90b4bce [Minor] Fix styles for overlap mode (#2068) 2024-11-17 19:49:20 -08:00
DarkSharpness
9c745d078e [Performance] Update xgrammar-related constrained decoding (#2056) 2024-11-17 16:58:49 -08:00
Lianmin Zheng
ebaa2f3199 Rename arguments --disable-nan-detection to --enable-nan-detection (#2066) 2024-11-17 16:53:44 -08:00
Ke Bao
62832bb272 Support cuda graph for DP attention (#2061) 2024-11-17 16:29:20 -08:00
Lianmin Zheng
38625e2139 Remove monkey_patch_vllm_dummy_weight_loader (#2064) 2024-11-17 15:48:12 -08:00
Lianmin Zheng
c1f401fc58 Revert "chore: update torch v2.5.1" (#2063) 2024-11-17 15:29:38 -08:00
Yineng Zhang
3b878863f7 chore: update torch v2.5.1 (#1849) 2024-11-18 00:06:00 +08:00
Lianmin Zheng
edad373135 Fix illegal memory access in overlap mode & Use more fused triton kernels for building meta data (#2051) 2024-11-16 16:14:23 -08:00
Ke Bao
976bc302e5 Support DP MLA (#1970) 2024-11-16 09:01:43 +00:00
Ke Wen
cf2489762b Add Tensor Parallel to torch_native_llama (#1876) 2024-11-15 21:26:00 -08:00
Patrick Yi
13ce3e4b5d Add download_dir ServerArgs property (#2027) 2024-11-13 23:26:56 -08:00
DarkSharpness
125b1199c5 support parallel grammar preprocessing (#1996)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-11-12 08:45:28 -08:00
yizhang2077
a8aad9357d qwen2vl fix bug for #1971 #1897 (#1984) 2024-11-10 08:10:45 -08:00