Liangsheng Yin
|
36078fb247
|
fix schedule bug (#1450)
|
2024-09-17 16:33:53 -07:00 |
|
Ke Bao
|
b3710d2c93
|
Fix attention backend (#1448)
|
2024-09-17 14:07:53 +00:00 |
|
Ke Bao
|
c6b6d2e71b
|
Enable MLA by default (#1447)
|
2024-09-17 11:42:48 +00:00 |
|
Lianmin Zheng
|
90a26be31c
|
Release 0.3.1.post1 (#1445)
|
2024-09-17 01:47:31 -07:00 |
|
Jani Monoses
|
1f4b5f770d
|
Add OLMoE model (#1444)
|
2024-09-17 01:14:53 -07:00 |
|
Ke Bao
|
76524b70d1
|
Fix torch compile for deepseek-v2 (#1442)
|
2024-09-17 00:52:08 -07:00 |
|
HAI
|
3a6e04185b
|
[Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1420)
|
2024-09-17 07:43:52 +00:00 |
|
Lianmin Zheng
|
2fa5cec775
|
Simplify sampler and its error handling (#1441)
|
2024-09-16 21:23:31 -07:00 |
|
Lianmin Zheng
|
27b557aea7
|
Clean up model loader (#1440)
|
2024-09-16 18:16:27 -07:00 |
|
zifeitong
|
93dffd699b
|
Add constrained_json_whitespace_pattern to ServerArgs (#1438)
|
2024-09-16 13:29:18 -07:00 |
|
Ying Sheng
|
37963394aa
|
[Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433)
|
2024-09-15 12:46:04 -07:00 |
|
Lianmin Zheng
|
899cf5c438
|
Remove deprecated configs (#1431)
|
2024-09-15 08:52:18 -07:00 |
|
Lianmin Zheng
|
e79f6cd73d
|
Release v0.3.1 (#1430)
|
2024-09-15 23:03:16 +09:00 |
|
Lianmin Zheng
|
9ba1f09760
|
[Fix] Fix logprob and normalized_logprob (#1428)
|
2024-09-15 06:36:06 -07:00 |
|
Lianmin Zheng
|
9463bc1385
|
Enable torch.compile for triton backend (#1422)
|
2024-09-14 15:38:37 -07:00 |
|
Jerry Zhang
|
30b404ce72
|
Add torchao quant for mixtral and qwen_moe (#1418)
|
2024-09-14 06:46:55 +00:00 |
|
Liangsheng Yin
|
70b6802982
|
Optimize conflicts between CUDA graph and vocab mask tensors (#1392)
|
2024-09-13 20:27:53 -07:00 |
|
Lianmin Zheng
|
ad0ff62a4c
|
Balance test in CI (#1411)
|
2024-09-12 23:29:44 -07:00 |
|
Lianmin Zheng
|
68be2f6d3b
|
[CI] Include triton backend and online serving benchmark into CI (#1408)
|
2024-09-12 21:36:41 -07:00 |
|
Lianmin Zheng
|
b912de11b0
|
Make stop reason a dict instead of str (#1407)
|
2024-09-12 20:47:31 -07:00 |
|
Ying Sheng
|
712216928f
|
[Feature] Initial support for multi-LoRA serving (#1307)
|
2024-09-12 16:46:14 -07:00 |
|
hxer7963
|
c33d82a211
|
Add Support for XVERSE Models (Dense and MoE) to sglang (#1397)
Co-authored-by: will he <hexin@xverse.cn>
Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-09-12 01:47:52 -07:00 |
|
Kaichen Zhang - NTU
|
8234e663e9
|
[Minor Fix] Fix llava modalities issue for single-image (#1402)
|
2024-09-12 01:10:26 -07:00 |
|
Zihao Ye
|
debbdb5178
|
kernel: use tensor cores for flashinfer gqa kernels (#1403)
|
2024-09-12 00:38:18 -07:00 |
|
Lianmin Zheng
|
3efa798116
|
Support cuda graph in the triton attention backend (#1401)
|
2024-09-12 00:36:55 -07:00 |
|
Lianmin Zheng
|
fec185ce0c
|
Refactor attention backend (#1381)
|
2024-09-11 11:44:26 -07:00 |
|
Lianmin Zheng
|
c03cece42f
|
Improve error reporting during server launch (#1390)
|
2024-09-11 04:50:04 -07:00 |
|
Lianmin Zheng
|
15c75e4146
|
[Fix] Fix --disable-flashinfer (#1389)
|
2024-09-11 04:36:21 -07:00 |
|
Vectory
|
224200e3c2
|
BaiChuan2 Model (#1367)
Co-authored-by: wanpenghan <wanpenghan@sohu-inc.com>
|
2024-09-11 03:55:24 -07:00 |
|
Byron Hsu
|
8c0efa514d
|
remove assertion in triton attention and add an unit test (#1385)
|
2024-09-11 03:22:07 -07:00 |
|
Liangsheng Yin
|
144bc70fcc
|
Organize flashinfer indices update (#1378)
|
2024-09-10 17:38:59 -07:00 |
|
Lianmin Zheng
|
46094e0c1b
|
Deprecate --disable-flashinfer and introduce --attention-backend (#1380)
|
2024-09-10 17:11:16 -07:00 |
|
Lianmin Zheng
|
3a6e8b6d78
|
[Minor] move triton attention kernels into a separate folder (#1379)
|
2024-09-10 15:15:08 -07:00 |
|
Liangsheng Yin
|
fbb4754cb8
|
Fix vocab mask update bug (#1376)
|
2024-09-10 13:10:36 -07:00 |
|
Lianmin Zheng
|
6c7cb90365
|
[Minor] improve kill scripts and torchao import (#1375)
|
2024-09-11 04:27:03 +10:00 |
|
josephrocca
|
dff2860a69
|
Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy (#1373)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-09-11 02:35:03 +10:00 |
|
William
|
e72275cf7f
|
Support MiniCPM3 (#1371)
|
2024-09-10 19:57:52 +10:00 |
|
wangchao
|
fec2d1223c
|
[Fix] fix bug of undefined is_single in meth create_abort_task (#1370)
|
2024-09-10 01:17:37 -07:00 |
|
zifeitong
|
9144ed1067
|
Support OpenAI API json_schema response format (#1363)
|
2024-09-09 19:08:25 -07:00 |
|
Liangsheng Yin
|
69b3bb9ae1
|
Unify forward mode (#1360)
|
2024-09-09 13:49:29 -07:00 |
|
Ying Sheng
|
689ff588ec
|
[CI] Return output logprobs in unit test (#1361)
|
2024-09-09 13:05:13 -07:00 |
|
Jerry Zhang
|
a7c47e0f02
|
Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-09-09 05:32:41 -07:00 |
|
Lianmin Zheng
|
e4d68afcf0
|
[Minor] Many cleanup (#1357)
|
2024-09-09 04:14:11 -07:00 |
|
Kai-Hsun Chen
|
c9b75917d5
|
[server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
|
2024-09-09 02:14:25 -07:00 |
|
Kaichen Zhang - NTU
|
662ecd9368
|
[Feat] Add modalities for vision server when handling pixel values for llava (#1346)
|
2024-09-09 02:07:34 -07:00 |
|
Byron Hsu
|
8e6bdf851c
|
[triton] Support head_dim not 2^n in triton extend and decode attention (#1281)
|
2024-09-09 01:30:24 -07:00 |
|
Liangsheng Yin
|
05bea6883c
|
Fix some online scheduling delay (#1345)
|
2024-09-07 20:46:27 -07:00 |
|
Liangsheng Yin
|
ab4a83b259
|
Optimize schedule (#1339)
|
2024-09-05 14:30:26 -07:00 |
|
Lianmin Zheng
|
eda7c09048
|
Remove useless fields in global_config.py (#1328)
|
2024-09-04 05:37:32 -07:00 |
|
Yineng Zhang
|
a63c8275c6
|
chore: bump v0.3.0 (#1320)
|
2024-09-04 04:32:18 +08:00 |
|