Jerry Zhang
|
a7c47e0f02
|
Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-09-09 05:32:41 -07:00 |
|
Lianmin Zheng
|
e4d68afcf0
|
[Minor] Many cleanup (#1357)
|
2024-09-09 04:14:11 -07:00 |
|
Kai-Hsun Chen
|
c9b75917d5
|
[server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
|
2024-09-09 02:14:25 -07:00 |
|
Kaichen Zhang - NTU
|
662ecd9368
|
[Feat] Add modalities for vision server when handling pixel values for llava (#1346)
|
2024-09-09 02:07:34 -07:00 |
|
Byron Hsu
|
8e6bdf851c
|
[triton] Support head_dim not 2^n in triton extend and decode attention (#1281)
|
2024-09-09 01:30:24 -07:00 |
|
Liangsheng Yin
|
05bea6883c
|
Fix some online scheduling delay (#1345)
|
2024-09-07 20:46:27 -07:00 |
|
Liangsheng Yin
|
ab4a83b259
|
Optimize schedule (#1339)
|
2024-09-05 14:30:26 -07:00 |
|
Lianmin Zheng
|
eda7c09048
|
Remove useless fields in global_config.py (#1328)
|
2024-09-04 05:37:32 -07:00 |
|
Yineng Zhang
|
a63c8275c6
|
chore: bump v0.3.0 (#1320)
|
2024-09-04 04:32:18 +08:00 |
|
Yineng Zhang
|
dc67d97693
|
misc: speedup load safetensors (#1319)
Co-authored-by: ispobock <ISPObaoke@163.com>
|
2024-09-04 04:29:53 +10:00 |
|
Lianmin Zheng
|
1e495e0847
|
[Fix] Fix select by ensuring each request has at least one token (#1318)
|
2024-09-03 06:31:45 -07:00 |
|
Lianmin Zheng
|
12cb115d38
|
Fix llama2 weight loader (#1317)
|
2024-09-03 05:32:14 -07:00 |
|
Jani Monoses
|
474317f2b6
|
Support Phi3 mini and medium (#1299)
|
2024-09-02 21:49:40 -07:00 |
|
Lianmin Zheng
|
f64eae3a29
|
[Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308)
|
2024-09-02 21:44:45 -07:00 |
|
Liangsheng Yin
|
a5a134f39f
|
Fix bugs in sampler with CUDA graph / torch.compile (#1306)
|
2024-09-02 23:18:48 +00:00 |
|
Yineng Zhang
|
2561ed012c
|
feat: update nightly gsm8k eval (#1304)
|
2024-09-03 01:18:41 +10:00 |
|
Lianmin Zheng
|
9999442756
|
Release v0.2.15 (#1295)
|
2024-09-01 22:22:38 -07:00 |
|
Max Shawabkeh
|
6def9b018c
|
Fix hang when doing s += None. (#1297)
Co-authored-by: max99x <mshawabkeh@jamandtea.studio>
|
2024-09-01 21:56:33 -07:00 |
|
Liangsheng Yin
|
47f20da223
|
Fix regex mask (#1296)
|
2024-09-01 21:50:58 -07:00 |
|
Yineng Zhang
|
9b0805242e
|
fix: resolve fp8 for mixtral (#1290)
|
2024-09-02 00:29:06 +10:00 |
|
Enrique Shockwave
|
32a4141d5a
|
Allow new lines during JSON generation (#1277)
|
2024-09-01 03:42:29 -07:00 |
|
Kai-Hsun Chen
|
0836055324
|
[Chore] Rename model_overide_args to model_override_args (#1284)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-09-01 03:14:56 -07:00 |
|
Byron Hsu
|
00b19f198f
|
[triton] Remove the zero initialization of qk_acc by directly writing the result (#1288)
|
2024-09-01 03:12:06 -07:00 |
|
Ke Bao
|
6cb32ef92c
|
Support Triton fp8 e5m2 kv cache (#1286)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-09-01 19:46:40 +10:00 |
|
Lianmin Zheng
|
761b2cebd6
|
[CI] merge all ci tests into one file (#1289)
|
2024-09-01 02:36:56 -07:00 |
|
Yineng Zhang
|
54772f784a
|
feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 (#1285)
Co-authored-by: ispobock <ispobaoke@163.com>
|
2024-09-01 17:28:06 +10:00 |
|
Lianmin Zheng
|
1b5d56f7f8
|
[CI] Add more multi-gpu tests (#1280)
|
2024-09-01 00:27:25 -07:00 |
|
xiaobochen
|
d134c139a1
|
Optimize the update flashinfer indices (#1262)
|
2024-08-31 23:40:28 -07:00 |
|
Yineng Zhang
|
52cefdbf57
|
fix: resolve the fp8 bug introduced by vLLM 0.5.5 (#1276)
|
2024-09-01 00:44:29 +10:00 |
|
Christopher Chou
|
51c554d812
|
Allow more flexible assistant and system response (#1256)
|
2024-08-30 11:51:44 -07:00 |
|
Lianmin Zheng
|
79ece2c51f
|
Report median instead of mean in bench_latency.py (#1269)
|
2024-08-30 06:05:01 -07:00 |
|
김종곤
|
b7f8341014
|
EXAONE 3.0 Model Support (#1258)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-30 08:08:28 +00:00 |
|
Ke Bao
|
f414352ae6
|
Transpose mla weight offline (#1261)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-30 16:45:40 +10:00 |
|
lxww302
|
a362340b33
|
fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader (#1260)
|
2024-08-30 16:43:41 +10:00 |
|
Liangsheng Yin
|
381dd57bd6
|
Sampler cudagraph (#1253)
|
2024-08-28 18:58:52 -07:00 |
|
Zhiqiang Xie
|
8153168c96
|
fix data racing due to mutable reference using deepcopy (#1255)
|
2024-08-28 18:57:54 -07:00 |
|
Enrique Shockwave
|
6c34d6339c
|
make json_schema usable from gen (#1254)
|
2024-08-28 18:57:10 -07:00 |
|
Yineng Zhang
|
13ac95b894
|
chore: bump v0.2.14.post2 (#1250)
|
2024-08-28 18:46:33 +00:00 |
|
Yineng Zhang
|
492143bf32
|
fix: resolve qwen2 moe weight loader (#1252)
|
2024-08-28 11:25:46 -07:00 |
|
Lianmin Zheng
|
0a97d7962d
|
[Fix] Fix OOM in llava base class (#1249)
|
2024-08-28 08:45:49 -07:00 |
|
Yineng Zhang
|
c411f32e1c
|
feat: replace GeluAndMul (#1234)
|
2024-08-28 14:07:02 +00:00 |
|
Lianmin Zheng
|
bf53bf5142
|
[Fix] Fix llava on multi images (#1247)
|
2024-08-28 06:33:05 -07:00 |
|
Yineng Zhang
|
b1a540ec42
|
feat: update GemmaRMSNorm (#1232)
|
2024-08-28 22:47:34 +10:00 |
|
Yineng Zhang
|
66975360e7
|
fix: increase max_new_tokens when testing generation models (#1244)
|
2024-08-28 22:12:36 +10:00 |
|
Yineng Zhang
|
f25f4dfde5
|
hotfix: revert sampler CUDA Graph (#1242)
|
2024-08-28 21:16:47 +10:00 |
|
Yineng Zhang
|
198974cd1a
|
feat: support sm75 with FlashInfer v0.1.6 (#1233)
|
2024-08-28 18:39:12 +10:00 |
|
Lianmin Zheng
|
6cc38b2bf3
|
[Minor] Add more type annotations (#1237)
|
2024-08-28 00:54:26 -07:00 |
|
Liangsheng Yin
|
1ece2cda3d
|
Fix bench latency benchmark (#1225)
|
2024-08-28 00:37:32 -07:00 |
|
Yineng Zhang
|
3602692c7c
|
feat: replace get_act_fn for gpt_bigcode (#1231)
|
2024-08-27 21:15:31 +10:00 |
|
havetc
|
909f34363b
|
[FIX] Wrong logger (#1230)
|
2024-08-27 20:10:46 +10:00 |
|