Commit Graph

618 Commits

Author SHA1 Message Date
Kaichen Zhang - NTU
662ecd9368 [Feat] Add modalities for vision server when handling pixel values for llava (#1346) 2024-09-09 02:07:34 -07:00
Byron Hsu
8e6bdf851c [triton] Support head_dim not 2^n in triton extend and decode attention (#1281) 2024-09-09 01:30:24 -07:00
Liangsheng Yin
05bea6883c Fix some online scheduling delay (#1345) 2024-09-07 20:46:27 -07:00
Liangsheng Yin
ab4a83b259 Optimize schedule (#1339) 2024-09-05 14:30:26 -07:00
Lianmin Zheng
eda7c09048 Remove useless fields in global_config.py (#1328) 2024-09-04 05:37:32 -07:00
Yineng Zhang
a63c8275c6 chore: bump v0.3.0 (#1320) 2024-09-04 04:32:18 +08:00
Yineng Zhang
dc67d97693 misc: speedup load safetensors (#1319)
Co-authored-by: ispobock <ISPObaoke@163.com>
2024-09-04 04:29:53 +10:00
Lianmin Zheng
1e495e0847 [Fix] Fix select by ensuring each request has at least one token (#1318) 2024-09-03 06:31:45 -07:00
Lianmin Zheng
12cb115d38 Fix llama2 weight loader (#1317) 2024-09-03 05:32:14 -07:00
Jani Monoses
474317f2b6 Support Phi3 mini and medium (#1299) 2024-09-02 21:49:40 -07:00
Lianmin Zheng
f64eae3a29 [Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308) 2024-09-02 21:44:45 -07:00
Liangsheng Yin
a5a134f39f Fix bugs in sampler with CUDA graph / torch.compile (#1306) 2024-09-02 23:18:48 +00:00
Yineng Zhang
2561ed012c feat: update nightly gsm8k eval (#1304) 2024-09-03 01:18:41 +10:00
Lianmin Zheng
9999442756 Release v0.2.15 (#1295) 2024-09-01 22:22:38 -07:00
Max Shawabkeh
6def9b018c Fix hang when doing s += None. (#1297)
Co-authored-by: max99x <mshawabkeh@jamandtea.studio>
2024-09-01 21:56:33 -07:00
Liangsheng Yin
47f20da223 Fix regex mask (#1296) 2024-09-01 21:50:58 -07:00
Yineng Zhang
9b0805242e fix: resolve fp8 for mixtral (#1290) 2024-09-02 00:29:06 +10:00
Enrique Shockwave
32a4141d5a Allow new lines during JSON generation (#1277) 2024-09-01 03:42:29 -07:00
Kai-Hsun Chen
0836055324 [Chore] Rename model_overide_args to model_override_args (#1284)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-01 03:14:56 -07:00
Byron Hsu
00b19f198f [triton] Remove the zero initialization of qk_acc by directly writing the result (#1288) 2024-09-01 03:12:06 -07:00
Ke Bao
6cb32ef92c Support Triton fp8 e5m2 kv cache (#1286)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-01 19:46:40 +10:00
Lianmin Zheng
761b2cebd6 [CI] merge all ci tests into one file (#1289) 2024-09-01 02:36:56 -07:00
Yineng Zhang
54772f784a feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 (#1285)
Co-authored-by: ispobock <ispobaoke@163.com>
2024-09-01 17:28:06 +10:00
Lianmin Zheng
1b5d56f7f8 [CI] Add more multi-gpu tests (#1280) 2024-09-01 00:27:25 -07:00
xiaobochen
d134c139a1 Optimize the update flashinfer indices (#1262) 2024-08-31 23:40:28 -07:00
Yineng Zhang
52cefdbf57 fix: resolve the fp8 bug introduced by vLLM 0.5.5 (#1276) 2024-09-01 00:44:29 +10:00
Christopher Chou
51c554d812 Allow more flexible assistant and system response (#1256) 2024-08-30 11:51:44 -07:00
Lianmin Zheng
79ece2c51f Report median instead of mean in bench_latency.py (#1269) 2024-08-30 06:05:01 -07:00
김종곤
b7f8341014 EXAONE 3.0 Model Support (#1258)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-30 08:08:28 +00:00
Ke Bao
f414352ae6 Transpose mla weight offline (#1261)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-30 16:45:40 +10:00
lxww302
a362340b33 fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader (#1260) 2024-08-30 16:43:41 +10:00
Liangsheng Yin
381dd57bd6 Sampler cudagraph (#1253) 2024-08-28 18:58:52 -07:00
Zhiqiang Xie
8153168c96 fix data racing due to mutable reference using deepcopy (#1255) 2024-08-28 18:57:54 -07:00
Enrique Shockwave
6c34d6339c make json_schema usable from gen (#1254) 2024-08-28 18:57:10 -07:00
Yineng Zhang
13ac95b894 chore: bump v0.2.14.post2 (#1250) 2024-08-28 18:46:33 +00:00
Yineng Zhang
492143bf32 fix: resolve qwen2 moe weight loader (#1252) 2024-08-28 11:25:46 -07:00
Lianmin Zheng
0a97d7962d [Fix] Fix OOM in llava base class (#1249) 2024-08-28 08:45:49 -07:00
Yineng Zhang
c411f32e1c feat: replace GeluAndMul (#1234) 2024-08-28 14:07:02 +00:00
Lianmin Zheng
bf53bf5142 [Fix] Fix llava on multi images (#1247) 2024-08-28 06:33:05 -07:00
Yineng Zhang
b1a540ec42 feat: update GemmaRMSNorm (#1232) 2024-08-28 22:47:34 +10:00
Yineng Zhang
66975360e7 fix: increase max_new_tokens when testing generation models (#1244) 2024-08-28 22:12:36 +10:00
Yineng Zhang
f25f4dfde5 hotfix: revert sampler CUDA Graph (#1242) 2024-08-28 21:16:47 +10:00
Yineng Zhang
198974cd1a feat: support sm75 with FlashInfer v0.1.6 (#1233) 2024-08-28 18:39:12 +10:00
Lianmin Zheng
6cc38b2bf3 [Minor] Add more type annotations (#1237) 2024-08-28 00:54:26 -07:00
Liangsheng Yin
1ece2cda3d Fix bench latency benchmark (#1225) 2024-08-28 00:37:32 -07:00
Yineng Zhang
3602692c7c feat: replace get_act_fn for gpt_bigcode (#1231) 2024-08-27 21:15:31 +10:00
havetc
909f34363b [FIX] Wrong logger (#1230) 2024-08-27 20:10:46 +10:00
caiyueliang
2f1d92834f [FEAT] Support batches cancel (#1222)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 23:28:26 +00:00
havetc
9935f97b3e [FEAT] JSON constrained support (#1125)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 09:37:26 -07:00
Yineng Zhang
c5fe11a8e1 chore: bump v0.2.14 (#1155) 2024-08-27 00:28:24 +10:00