Commit Graph

115 Commits

Author SHA1 Message Date
havetc
ecb8bad276 Returning a per request metric for number of cached_tokens read (#1599) 2024-10-16 11:49:22 -07:00
Lianmin Zheng
9116b2896f Add a new event loop (#1677) 2024-10-16 01:33:20 -07:00
Shuo Yang
061e546313 Support double sparsity (#1459) 2024-10-14 02:00:41 -07:00
Lianmin Zheng
7ee6c259ff Simplify the event loop and expose --num-continuous-decode-steps as an argument (#1652) 2024-10-12 21:35:30 -07:00
Lianmin Zheng
9da5a60b18 Add an option to disable penalizer (#1651) 2024-10-12 17:53:23 -07:00
Zhang, Liangang
5d638c92f5 [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480) 2024-10-12 18:10:32 +00:00
Lianmin Zheng
23cc66f7b6 Add back data parallelism (#1635) 2024-10-11 07:22:48 -07:00
glen-amd
58093b868f Nit about the decorator of PortArgs.init_new (#1611) 2024-10-11 02:17:47 -07:00
Zhang, Liangang
8275049ce3 Add device support (#1607) 2024-10-11 02:05:58 -07:00
Jani Monoses
3ff641132e Remove references to squeezellm (#1603) 2024-10-07 11:30:41 -07:00
Lianmin Zheng
6a5b352aaf Use is_flashinfer_available to replace is_hip for flashinfer check (#1596)
Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>
2024-10-06 22:54:05 -07:00
Lianmin Zheng
114bbc8651 Use ipc instead of tcp in zmq (#1566) 2024-10-04 00:45:52 -07:00
Lianmin Zheng
32eb6e96f2 Organize sampling batch info better (#1562) 2024-10-03 18:29:49 -07:00
Ying Sheng
0f4fb19bc8 [Fix, LoRA] fix LoRA with updates in main (#1545) 2024-09-30 10:06:08 -07:00
Xinyu Yang
acaffd233f [Fix] fix ipv6 url when warm up model (#1537) 2024-09-29 11:02:40 -07:00
Lianmin Zheng
048685430d Improve process creation (#1534) 2024-09-29 02:36:12 -07:00
Lianmin Zheng
5e62a6b706 Add bench_server_latency.py (#1452) 2024-09-18 00:56:06 -07:00
Ke Bao
b3710d2c93 Fix attention backend (#1448) 2024-09-17 14:07:53 +00:00
Ke Bao
c6b6d2e71b Enable MLA by default (#1447) 2024-09-17 11:42:48 +00:00
Ke Bao
76524b70d1 Fix torch compile for deepseek-v2 (#1442) 2024-09-17 00:52:08 -07:00
HAI
3a6e04185b [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1420) 2024-09-17 07:43:52 +00:00
zifeitong
93dffd699b Add constrained_json_whitespace_pattern to ServerArgs (#1438) 2024-09-16 13:29:18 -07:00
Ying Sheng
37963394aa [Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433) 2024-09-15 12:46:04 -07:00
Ying Sheng
712216928f [Feature] Initial support for multi-LoRA serving (#1307) 2024-09-12 16:46:14 -07:00
Lianmin Zheng
fec185ce0c Refactor attention backend (#1381) 2024-09-11 11:44:26 -07:00
Lianmin Zheng
15c75e4146 [Fix] Fix --disable-flashinfer (#1389) 2024-09-11 04:36:21 -07:00
Lianmin Zheng
46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) 2024-09-10 17:11:16 -07:00
Jerry Zhang
a7c47e0f02 Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-09-09 05:32:41 -07:00
Lianmin Zheng
e4d68afcf0 [Minor] Many cleanup (#1357) 2024-09-09 04:14:11 -07:00
Kai-Hsun Chen
c9b75917d5 [server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2024-09-09 02:14:25 -07:00
Ke Bao
2c615d120f [Feature] Support fp8 e5m2 kv cache with flashinfer (#1204)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-25 17:38:11 -07:00
Lianmin Zheng
902278008a [Minor] Improve the function organization in TokenizerManager & improve loggers (#1208) 2024-08-25 14:46:34 -07:00
Chayenne
30b4f771b0 Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-25 10:29:12 -07:00
Lianmin Zheng
f6af3a6561 Cleanup readme, llava examples, usage examples and nccl init (#1194) 2024-08-24 08:02:23 -07:00
Lianmin Zheng
5623826f73 [Minor] Improve logging and rename the health check endpoint name (#1180) 2024-08-21 19:24:36 -07:00
Lianmin Zheng
bea2bb9eea Improve multi-node stability (#1171) 2024-08-20 22:35:05 -07:00
Xu-Chen
ff2cfdb1a2 [Feature] add disable-custom-all-reduce (#1148)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-20 08:44:12 -07:00
Liangsheng Yin
3694f8f996 Mixed style of chunked prefill (#1013) 2024-08-16 09:13:00 +00:00
Ying Sheng
93d4e354d8 [Fix] Window attention compatible with RadixAttention and chunked prefill (#1112) 2024-08-15 10:33:20 -07:00
Lianmin Zheng
e86b1ccbf0 Enable chunked prefill by default (#1040) 2024-08-14 21:56:20 -07:00
Ying Sheng
96a2093ef0 [Fix] Compatibility of window attention and cuda graph (#1090) 2024-08-14 10:37:01 -07:00
Ying Sheng
0909bb0d2f [Feat] Add window attention for gemma-2 (#1056) 2024-08-13 17:01:26 -07:00
Lianmin Zheng
d84c5e70f7 Test the case when max_new_tokens is very large (#1038) 2024-08-11 16:41:03 -07:00
Lianmin Zheng
a97df79124 Clean up readme and arguments of chunked prefill (#1022) 2024-08-11 01:18:52 -07:00
gryffindor-rr
9cf0a5bada Add skip_tokenizer_init args. (#959)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-09 12:14:13 -07:00
yichuan~
ffb15744b5 Support multiple args options (#941) 2024-08-06 04:12:53 +10:00
Ying Sheng
0d4f3a9fcd Make API Key OpenAI-compatible (#917) 2024-08-04 13:35:44 -07:00
Ke Bao
e1eae1fd15 Support MLA for DeepSeek-V2 with Triton - step 1 (#905) 2024-08-05 03:40:33 +10:00
任嘉
4013a4e1b0 Implement served_model_name to customize model id when use local mode… (#749)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-01 17:13:51 -07:00
Liangsheng Yin
c020f9ceda Support chunked prefill when radix cache is disabled (#811) 2024-08-01 00:29:01 -07:00