Ying Sheng
|
712216928f
|
[Feature] Initial support for multi-LoRA serving (#1307)
|
2024-09-12 16:46:14 -07:00 |
|
hxer7963
|
c33d82a211
|
Add Support for XVERSE Models (Dense and MoE) to sglang (#1397)
Co-authored-by: will he <hexin@xverse.cn>
Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-09-12 01:47:52 -07:00 |
|
Kaichen Zhang - NTU
|
8234e663e9
|
[Minor Fix] Fix llava modalities issue for single-image (#1402)
|
2024-09-12 01:10:26 -07:00 |
|
Zihao Ye
|
debbdb5178
|
kernel: use tensor cores for flashinfer gqa kernels (#1403)
|
2024-09-12 00:38:18 -07:00 |
|
Lianmin Zheng
|
3efa798116
|
Support cuda graph in the triton attention backend (#1401)
|
2024-09-12 00:36:55 -07:00 |
|
William
|
2a71be5e25
|
Fix README format (#1399)
|
2024-09-11 23:46:51 -07:00 |
|
Liangsheng Yin
|
4462137777
|
Add no commit to main rule (#1393)
|
2024-09-12 05:40:45 +08:00 |
|
Lianmin Zheng
|
fec185ce0c
|
Refactor attention backend (#1381)
|
2024-09-11 11:44:26 -07:00 |
|
Lianmin Zheng
|
c03cece42f
|
Improve error reporting during server launch (#1390)
|
2024-09-11 04:50:04 -07:00 |
|
Lianmin Zheng
|
15c75e4146
|
[Fix] Fix --disable-flashinfer (#1389)
|
2024-09-11 04:36:21 -07:00 |
|
Vectory
|
224200e3c2
|
BaiChuan2 Model (#1367)
Co-authored-by: wanpenghan <wanpenghan@sohu-inc.com>
|
2024-09-11 03:55:24 -07:00 |
|
Byron Hsu
|
8c0efa514d
|
remove assertion in triton attention and add an unit test (#1385)
|
2024-09-11 03:22:07 -07:00 |
|
Liangsheng Yin
|
144bc70fcc
|
Organize flashinfer indices update (#1378)
|
2024-09-10 17:38:59 -07:00 |
|
Lianmin Zheng
|
46094e0c1b
|
Deprecate --disable-flashinfer and introduce --attention-backend (#1380)
|
2024-09-10 17:11:16 -07:00 |
|
Lianmin Zheng
|
3a6e8b6d78
|
[Minor] move triton attention kernels into a separate folder (#1379)
|
2024-09-10 15:15:08 -07:00 |
|
Liangsheng Yin
|
fbb4754cb8
|
Fix vocab mask update bug (#1376)
|
2024-09-10 13:10:36 -07:00 |
|
Lianmin Zheng
|
6c7cb90365
|
[Minor] improve kill scripts and torchao import (#1375)
|
2024-09-11 04:27:03 +10:00 |
|
josephrocca
|
dff2860a69
|
Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy (#1373)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-09-11 02:35:03 +10:00 |
|
William
|
e72275cf7f
|
Support MiniCPM3 (#1371)
|
2024-09-10 19:57:52 +10:00 |
|
wangchao
|
fec2d1223c
|
[Fix] fix bug of undefined is_single in meth create_abort_task (#1370)
|
2024-09-10 01:17:37 -07:00 |
|
Lianmin Zheng
|
8d1095dbf0
|
[Docs] Improve documentations (#1368)
|
2024-09-09 20:48:28 -07:00 |
|
Chayenne
|
743007e1ce
|
Adding Documentation for installation (#1300)
Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com>
|
2024-09-09 19:09:13 -07:00 |
|
zifeitong
|
9144ed1067
|
Support OpenAI API json_schema response format (#1363)
|
2024-09-09 19:08:25 -07:00 |
|
Liangsheng Yin
|
69b3bb9ae1
|
Unify forward mode (#1360)
|
2024-09-09 13:49:29 -07:00 |
|
Ying Sheng
|
689ff588ec
|
[CI] Return output logprobs in unit test (#1361)
|
2024-09-09 13:05:13 -07:00 |
|
Jerry Zhang
|
a7c47e0f02
|
Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-09-09 05:32:41 -07:00 |
|
Lianmin Zheng
|
e4d68afcf0
|
[Minor] Many cleanup (#1357)
|
2024-09-09 04:14:11 -07:00 |
|
Kai-Hsun Chen
|
c9b75917d5
|
[server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
|
2024-09-09 02:14:25 -07:00 |
|
Kaichen Zhang - NTU
|
662ecd9368
|
[Feat] Add modalities for vision server when handling pixel values for llava (#1346)
|
2024-09-09 02:07:34 -07:00 |
|
Byron Hsu
|
8e6bdf851c
|
[triton] Support head_dim not 2^n in triton extend and decode attention (#1281)
|
2024-09-09 01:30:24 -07:00 |
|
Liangsheng Yin
|
05bea6883c
|
Fix some online scheduling delay (#1345)
|
2024-09-07 20:46:27 -07:00 |
|
Liangsheng Yin
|
ab4a83b259
|
Optimize schedule (#1339)
|
2024-09-05 14:30:26 -07:00 |
|
Yineng Zhang
|
62f15eea5a
|
docs: add conclusion (#1340)
|
2024-09-06 04:25:14 +10:00 |
|
Yineng Zhang
|
79794af52d
|
docs: highlight ttft itl and throughput (#1337)
|
2024-09-06 00:00:06 +10:00 |
|
Yineng Zhang
|
3494b32c3a
|
docs: update README (#1336)
|
2024-09-05 23:39:44 +10:00 |
|
Lianmin Zheng
|
eda7c09048
|
Remove useless fields in global_config.py (#1328)
|
2024-09-04 05:37:32 -07:00 |
|
Yineng Zhang
|
5ab9418f5b
|
[Doc] update news (#1327)
|
2024-09-04 04:21:21 -07:00 |
|
Lianmin Zheng
|
843e63d809
|
Fix the flaky test test_moe_eval_accuracy_large.py (#1326)
|
2024-09-04 04:15:11 -07:00 |
|
Yineng Zhang
|
a63c8275c6
|
chore: bump v0.3.0 (#1320)
|
2024-09-04 04:32:18 +08:00 |
|
Yineng Zhang
|
dc67d97693
|
misc: speedup load safetensors (#1319)
Co-authored-by: ispobock <ISPObaoke@163.com>
|
2024-09-04 04:29:53 +10:00 |
|
Lianmin Zheng
|
1e495e0847
|
[Fix] Fix select by ensuring each request has at least one token (#1318)
|
2024-09-03 06:31:45 -07:00 |
|
Lianmin Zheng
|
12cb115d38
|
Fix llama2 weight loader (#1317)
|
2024-09-03 05:32:14 -07:00 |
|
Lianmin Zheng
|
c500f96bb1
|
Update README.md for llava-onevision instructions (#1313)
|
2024-09-03 01:43:08 -07:00 |
|
Jani Monoses
|
474317f2b6
|
Support Phi3 mini and medium (#1299)
|
2024-09-02 21:49:40 -07:00 |
|
Lianmin Zheng
|
f64eae3a29
|
[Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308)
|
2024-09-02 21:44:45 -07:00 |
|
Liangsheng Yin
|
a5a134f39f
|
Fix bugs in sampler with CUDA graph / torch.compile (#1306)
|
2024-09-02 23:18:48 +00:00 |
|
Yineng Zhang
|
2561ed012c
|
feat: update nightly gsm8k eval (#1304)
|
2024-09-03 01:18:41 +10:00 |
|
Lianmin Zheng
|
9999442756
|
Release v0.2.15 (#1295)
|
2024-09-01 22:22:38 -07:00 |
|
Max Shawabkeh
|
6def9b018c
|
Fix hang when doing s += None. (#1297)
Co-authored-by: max99x <mshawabkeh@jamandtea.studio>
|
2024-09-01 21:56:33 -07:00 |
|
Liangsheng Yin
|
47f20da223
|
Fix regex mask (#1296)
|
2024-09-01 21:50:58 -07:00 |
|