Commit Graph

132 Commits

Author SHA1 Message Date
Lianmin Zheng
751e5ca273 [minor] clean up docs and eos id (#2622) 2024-12-27 11:23:46 -08:00
Yang Zheng
7a7ac6bea1 [FIX] Update EOS from config (#2475) 2024-12-27 10:59:56 -08:00
fzyzcjy
3169e66c23 Fix duplicated handling of GetWeightsByNameReqInput (#2565) 2024-12-26 06:49:32 -08:00
Lianmin Zheng
773951548d Fix logprob_start_len for multi modal models (#2597)
Co-authored-by: libra <lihu723@gmail.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>
2024-12-26 06:27:45 -08:00
Adarsh Shirawalmath
acb340728c [Feature] Support new parameter - EBNF in xgrammar (#2526) 2024-12-26 05:12:41 -08:00
Liangsheng Yin
e7ebecf82e Fix cache hit rate when chunked prefill (#2555) 2024-12-26 03:14:28 -08:00
Lianmin Zheng
8496701934 [Misc] Fix metrics, weight update lock, request logging (#2543) 2024-12-22 06:27:22 -08:00
SangBin Cho
9208618b3e [Core] in batch prefix caching by delay scheduling (#2442) 2024-12-11 12:51:50 -08:00
Lianmin Zheng
0ce091a82d [Minor] Improve code style (#2419) 2024-12-09 03:05:59 -08:00
Lianmin Zheng
a6ca736c8e Simplify stream_output (#2398) 2024-12-08 12:27:13 -08:00
Lianmin Zheng
cc858953a0 Fix recv_requests (#2405) 2024-12-08 04:08:04 -08:00
Lianmin Zheng
a2486eb58f Fix a bug with logprob streaming + chunked prefill (#2403) 2024-12-08 03:55:27 -08:00
SangBin Cho
1f09e84b9a nit: Remove busy waiting on scheduler (#2382) 2024-12-08 01:06:15 -08:00
Lianmin Zheng
0e7409adb6 Fix the overlap for xgrammar (#2377) 2024-12-06 05:49:29 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Lianmin Zheng
3c79ad35ca [Fix] Fix the padded hash value for image tokens (#2309) 2024-12-01 23:36:28 -08:00
Chayenne
983bfcf386 Online weight updates from torch.distributed (#2279) 2024-12-01 23:23:18 -08:00
Liangsheng Yin
5f12f0e7af Fix chunked prefill when ignore eos (#2290) 2024-12-01 00:37:53 -08:00
Chayenne
7d1485d376 Add get weights by parameter name for llama (#2266) 2024-11-29 23:36:38 -08:00
Chayenne
7d5d1d3d29 udate weights from disk (#2265) 2024-11-30 01:17:00 +00:00
Lianmin Zheng
94e167ea5a Fix the default chunked prefill size (#2268) 2024-11-29 16:03:32 -08:00
Lianmin Zheng
afe1e46586 [Minor] fix the style for multimodal models (#2257) 2024-11-29 04:24:20 -08:00
Lianmin Zheng
f50a6cf443 Fix hash collision for multi modal models (#2256) 2024-11-29 03:15:58 -08:00
Lianmin Zheng
fe97a2d40f Simplify tokenizer manager (#2254) 2024-11-29 02:18:51 -08:00
Lianmin Zheng
b2ccf36d4d Fix memory leak during abort (#2238) 2024-11-28 02:22:15 -08:00
Lianmin Zheng
d4fc1a70e3 Crash the server correctly during error (#2231) 2024-11-28 00:22:39 -08:00
Lianmin Zheng
fb915bd1a2 Disable overlap scheduler for multimodal models (#2235) 2024-11-27 23:44:33 -08:00
Lianmin Zheng
2a02185c5f Rename DP_RANK to SGLANG_DP_RANK (#2218) 2024-11-27 09:36:36 -08:00
Lianmin Zheng
fb6e04a0c2 Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2222) 2024-11-27 02:52:46 -08:00
Lianmin Zheng
6997e28f6e Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" (#2221) 2024-11-27 02:02:01 -08:00
Lianmin Zheng
a0e58740a8 Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2217) 2024-11-27 01:13:41 -08:00
Ying Sheng
37c8a5761f [feat] Support session control for vision language models (#2210) 2024-11-27 00:03:29 -08:00
Lianmin Zheng
1605ae121e [CI] Minor fix for CI (#2187) 2024-11-25 16:38:43 -08:00
Rin Intachuen
1aea19f64b Input_embeds support (#2052) 2024-11-25 16:35:04 -08:00
HAI
10189d08dd [Performance]: Process affinity to CPU cores with multiple sockets support (#2171) 2024-11-25 14:57:32 -08:00
Ying Sheng
e1e595d702 [feat] Refactor session control interface and add CI (#2173) 2024-11-25 12:32:51 -08:00
Lianmin Zheng
8e1adb8441 Allow overwrite flashinfer use_tensorcore (#2169) 2024-11-24 20:58:17 -08:00
Lianmin Zheng
731146f6cb Fix mixed chunked prefill in overlap mode (#2158) 2024-11-24 07:17:37 -08:00
Lianmin Zheng
5652c56535 Update CI threshold & Improve code style (#2159) 2024-11-24 06:29:38 -08:00
Lianmin Zheng
c211e7b669 Simplify batch update (#2154) 2024-11-24 04:47:10 -08:00
Byron Hsu
52f58fc42a fix dp_rank env (#2144) 2024-11-23 11:46:21 -08:00
Lianmin Zheng
751c3a037c Fix dp print message (#2138) 2024-11-23 01:22:26 -08:00
Lianmin Zheng
66d4859acf Revert "Only stream output on tp rank 0" (#2130) 2024-11-22 15:46:16 -08:00
Lianmin Zheng
e1b63624d7 Only stream output on tp rank 0 (#2124) 2024-11-22 15:13:44 -08:00
Henry Hyeonmok Ko
c35cd1f8c7 Expose max total num tokens from Runtime & Engine API (#2092) 2024-11-22 15:10:10 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Jake Poznanski
8048c28c11 Fix #2037 - Context length check does not take into out pad tokens for visual models (#2106) 2024-11-21 19:05:41 -08:00
Byron Hsu
30af7dfb34 [router] add base_gpu_id server args & merged radix tree python reference (#2115) 2024-11-21 17:13:33 -08:00
Lianmin Zheng
722530fa01 Enable overlap scheduler by default for the triton attention backend (#2105) 2024-11-20 02:58:35 -08:00