xiaobochen
|
3d32e4a32c
|
Resubmit MoE-EP (#2371)
|
2024-12-06 15:05:21 +08:00 |
|
Lianmin Zheng
|
71e2a27753
|
Fix the cuda graph capture range for small #max-running-requests (#2359)
|
2024-12-06 14:13:57 +08:00 |
|
Lianmin Zheng
|
2b0fc5941d
|
[Minor] Code style improvements (#2355)
|
2024-12-04 19:02:08 -08:00 |
|
Jerry Zhang
|
9cc733b38c
|
move apply_torchao_config_ to model_runner (#2342)
|
2024-12-04 17:26:42 -08:00 |
|
Lianmin Zheng
|
07ec07ad1f
|
Improve torch compile for fused moe (#2327)
|
2024-12-03 01:58:25 -08:00 |
|
Ying Sheng
|
aa47f64223
|
Revert "[feat] Enable chunked prefill for llava-onevision" (#2329)
|
2024-12-02 23:11:13 -08:00 |
|
Lianmin Zheng
|
3ddb1c4679
|
[Minor] Fix logger and style (#2325)
|
2024-12-02 20:45:53 -08:00 |
|
Ying Sheng
|
480e38a733
|
[feat] Enable chunked prefill for llava-onevision (#2281)
|
2024-12-02 20:19:02 -08:00 |
|
Yineng Zhang
|
85e1a6f3aa
|
Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2024-12-02 23:22:13 +08:00 |
|
Lianmin Zheng
|
18108abe5d
|
[Minor] Fix code style (#2311)
|
2024-12-02 02:27:36 -08:00 |
|
Chayenne
|
983bfcf386
|
Online weight updates from torch.distributed (#2279)
|
2024-12-01 23:23:18 -08:00 |
|
Qun Yang
|
62c516ac45
|
Add a simple torch native attention backend (#2241)
|
2024-12-01 03:01:25 -08:00 |
|
Lianmin Zheng
|
4936be8acc
|
Revert "Revert "[FEAT] Support GGUF format"" (#2287)
|
2024-11-30 22:14:48 -08:00 |
|
Lianmin Zheng
|
7e4c6dd8da
|
Revert "[FEAT] Support GGUF format" (#2285)
|
2024-11-30 19:03:26 -08:00 |
|
Yang Zheng
|
883c955489
|
[FEAT] Support GGUF format (#2215)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
|
2024-11-30 00:44:48 -08:00 |
|
Chayenne
|
7d1485d376
|
Add get weights by parameter name for llama (#2266)
|
2024-11-29 23:36:38 -08:00 |
|
Chayenne
|
7d5d1d3d29
|
udate weights from disk (#2265)
|
2024-11-30 01:17:00 +00:00 |
|
Lianmin Zheng
|
94e167ea5a
|
Fix the default chunked prefill size (#2268)
|
2024-11-29 16:03:32 -08:00 |
|
Ying Sheng
|
8b48496aaf
|
Revert "Revert "Add simple CPU offloading support"" (#2253)
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2024-11-28 23:58:54 -08:00 |
|
Ying Sheng
|
4057ea82c9
|
Revert "Add simple CPU offloading support" (#2252)
We'll re-add the commit to correctly ack Kaichao's authorship
|
2024-11-28 23:36:55 -08:00 |
|
Rin Intachuen
|
1aea19f64b
|
Input_embeds support (#2052)
|
2024-11-25 16:35:04 -08:00 |
|
Lianmin Zheng
|
3c5538f781
|
Update CI threshold (#2186)
|
2024-11-25 15:24:17 -08:00 |
|
Lianmin Zheng
|
c4336b2b60
|
Use custom allreduce w/ torch.compile (#2185)
|
2024-11-25 14:55:01 -08:00 |
|
Lianmin Zheng
|
5652c56535
|
Update CI threshold & Improve code style (#2159)
|
2024-11-24 06:29:38 -08:00 |
|
Lianmin Zheng
|
a78d8f8db3
|
[CI] Fix test cases (#2137)
|
2024-11-23 01:00:07 -08:00 |
|
Jani Monoses
|
d98fa1e93d
|
Add simple CPU offloading support. (#2081)
|
2024-11-23 06:23:53 +00:00 |
|
Ankur Neog
|
865233e256
|
Add initial support for intel Gaudi accelerators (#2121)
|
2024-11-22 20:22:23 -08:00 |
|
Lianmin Zheng
|
66d4859acf
|
Revert "Only stream output on tp rank 0" (#2130)
|
2024-11-22 15:46:16 -08:00 |
|
Lianmin Zheng
|
e1b63624d7
|
Only stream output on tp rank 0 (#2124)
|
2024-11-22 15:13:44 -08:00 |
|
Xuehai Pan
|
62a4a339eb
|
docs: fix module docstrings and copyright headers (#2077)
|
2024-11-22 22:16:53 +08:00 |
|
Lianmin Zheng
|
722530fa01
|
Enable overlap scheduler by default for the triton attention backend (#2105)
|
2024-11-20 02:58:35 -08:00 |
|
Lianmin Zheng
|
7d671e4ad2
|
Enable overlap by default (#2067)
|
2024-11-19 22:07:58 -08:00 |
|
Lianmin Zheng
|
ffd20fcd03
|
Make constrained decoding work for overlap scheduler (#2095)
|
2024-11-19 15:04:43 -08:00 |
|
HAI
|
e57c3e12b8
|
Use native fp8 format on MI300X (#2094)
|
2024-11-19 14:06:29 -08:00 |
|
Yineng Zhang
|
766192610e
|
feat: update torch 2.5.1 (#2069)
|
2024-11-18 21:29:13 +08:00 |
|
Lianmin Zheng
|
4af3f889fc
|
Simplify flashinfer indices update for prefill (#2074)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: kavioyu <kavioyu@gmail.com>
|
2024-11-18 00:02:36 -08:00 |
|
Lianmin Zheng
|
df7fe4521a
|
Crash the CI jobs on model import errors (#2072)
|
2024-11-17 22:18:11 -08:00 |
|
Lianmin Zheng
|
a9e90b4bce
|
[Minor] Fix styles for overlap mode (#2068)
|
2024-11-17 19:49:20 -08:00 |
|
DarkSharpness
|
9c745d078e
|
[Performance] Update xgrammar-related constrained decoding (#2056)
|
2024-11-17 16:58:49 -08:00 |
|
Lianmin Zheng
|
ebaa2f3199
|
Rename arguments --disable-nan-detection to --enable-nan-detection (#2066)
|
2024-11-17 16:53:44 -08:00 |
|
Ke Bao
|
62832bb272
|
Support cuda graph for DP attention (#2061)
|
2024-11-17 16:29:20 -08:00 |
|
Lianmin Zheng
|
38625e2139
|
Remove monkey_patch_vllm_dummy_weight_loader (#2064)
|
2024-11-17 15:48:12 -08:00 |
|
Lianmin Zheng
|
c1f401fc58
|
Revert "chore: update torch v2.5.1" (#2063)
|
2024-11-17 15:29:38 -08:00 |
|
Yineng Zhang
|
3b878863f7
|
chore: update torch v2.5.1 (#1849)
|
2024-11-18 00:06:00 +08:00 |
|
Lianmin Zheng
|
edad373135
|
Fix illegal memory access in overlap mode & Use more fused triton kernels for building meta data (#2051)
|
2024-11-16 16:14:23 -08:00 |
|
Ke Bao
|
976bc302e5
|
Support DP MLA (#1970)
|
2024-11-16 09:01:43 +00:00 |
|
Ke Wen
|
cf2489762b
|
Add Tensor Parallel to torch_native_llama (#1876)
|
2024-11-15 21:26:00 -08:00 |
|
Patrick Yi
|
13ce3e4b5d
|
Add download_dir ServerArgs property (#2027)
|
2024-11-13 23:26:56 -08:00 |
|
DarkSharpness
|
125b1199c5
|
support parallel grammar preprocessing (#1996)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-11-12 08:45:28 -08:00 |
|
yizhang2077
|
a8aad9357d
|
qwen2vl fix bug for #1971 #1897 (#1984)
|
2024-11-10 08:10:45 -08:00 |
|