Commit Graph

119 Commits

Author SHA1 Message Date
Lianmin Zheng
0e7409adb6 Fix the overlap for xgrammar (#2377) 2024-12-06 05:49:29 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Lianmin Zheng
3c79ad35ca [Fix] Fix the padded hash value for image tokens (#2309) 2024-12-01 23:36:28 -08:00
Chayenne
983bfcf386 Online weight updates from torch.distributed (#2279) 2024-12-01 23:23:18 -08:00
Liangsheng Yin
5f12f0e7af Fix chunked prefill when ignore eos (#2290) 2024-12-01 00:37:53 -08:00
Chayenne
7d1485d376 Add get weights by parameter name for llama (#2266) 2024-11-29 23:36:38 -08:00
Chayenne
7d5d1d3d29 udate weights from disk (#2265) 2024-11-30 01:17:00 +00:00
Lianmin Zheng
94e167ea5a Fix the default chunked prefill size (#2268) 2024-11-29 16:03:32 -08:00
Lianmin Zheng
afe1e46586 [Minor] fix the style for multimodal models (#2257) 2024-11-29 04:24:20 -08:00
Lianmin Zheng
f50a6cf443 Fix hash collision for multi modal models (#2256) 2024-11-29 03:15:58 -08:00
Lianmin Zheng
fe97a2d40f Simplify tokenizer manager (#2254) 2024-11-29 02:18:51 -08:00
Lianmin Zheng
b2ccf36d4d Fix memory leak during abort (#2238) 2024-11-28 02:22:15 -08:00
Lianmin Zheng
d4fc1a70e3 Crash the server correctly during error (#2231) 2024-11-28 00:22:39 -08:00
Lianmin Zheng
fb915bd1a2 Disable overlap scheduler for multimodal models (#2235) 2024-11-27 23:44:33 -08:00
Lianmin Zheng
2a02185c5f Rename DP_RANK to SGLANG_DP_RANK (#2218) 2024-11-27 09:36:36 -08:00
Lianmin Zheng
fb6e04a0c2 Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2222) 2024-11-27 02:52:46 -08:00
Lianmin Zheng
6997e28f6e Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" (#2221) 2024-11-27 02:02:01 -08:00
Lianmin Zheng
a0e58740a8 Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default (#2217) 2024-11-27 01:13:41 -08:00
Ying Sheng
37c8a5761f [feat] Support session control for vision language models (#2210) 2024-11-27 00:03:29 -08:00
Lianmin Zheng
1605ae121e [CI] Minor fix for CI (#2187) 2024-11-25 16:38:43 -08:00
Rin Intachuen
1aea19f64b Input_embeds support (#2052) 2024-11-25 16:35:04 -08:00
HAI
10189d08dd [Performance]: Process affinity to CPU cores with multiple sockets support (#2171) 2024-11-25 14:57:32 -08:00
Ying Sheng
e1e595d702 [feat] Refactor session control interface and add CI (#2173) 2024-11-25 12:32:51 -08:00
Lianmin Zheng
8e1adb8441 Allow overwrite flashinfer use_tensorcore (#2169) 2024-11-24 20:58:17 -08:00
Lianmin Zheng
731146f6cb Fix mixed chunked prefill in overlap mode (#2158) 2024-11-24 07:17:37 -08:00
Lianmin Zheng
5652c56535 Update CI threshold & Improve code style (#2159) 2024-11-24 06:29:38 -08:00
Lianmin Zheng
c211e7b669 Simplify batch update (#2154) 2024-11-24 04:47:10 -08:00
Byron Hsu
52f58fc42a fix dp_rank env (#2144) 2024-11-23 11:46:21 -08:00
Lianmin Zheng
751c3a037c Fix dp print message (#2138) 2024-11-23 01:22:26 -08:00
Lianmin Zheng
66d4859acf Revert "Only stream output on tp rank 0" (#2130) 2024-11-22 15:46:16 -08:00
Lianmin Zheng
e1b63624d7 Only stream output on tp rank 0 (#2124) 2024-11-22 15:13:44 -08:00
Henry Hyeonmok Ko
c35cd1f8c7 Expose max total num tokens from Runtime & Engine API (#2092) 2024-11-22 15:10:10 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Jake Poznanski
8048c28c11 Fix #2037 - Context length check does not take into out pad tokens for visual models (#2106) 2024-11-21 19:05:41 -08:00
Byron Hsu
30af7dfb34 [router] add base_gpu_id server args & merged radix tree python reference (#2115) 2024-11-21 17:13:33 -08:00
Lianmin Zheng
722530fa01 Enable overlap scheduler by default for the triton attention backend (#2105) 2024-11-20 02:58:35 -08:00
Ying Sheng
5942dfc00a [feat] Add session control (#2073) 2024-11-20 00:36:53 -08:00
Lianmin Zheng
7d671e4ad2 Enable overlap by default (#2067) 2024-11-19 22:07:58 -08:00
Lianmin Zheng
ffd20fcd03 Make constrained decoding work for overlap scheduler (#2095) 2024-11-19 15:04:43 -08:00
Lianmin Zheng
b7a065eae3 Use cuda event wait and synchronization instead of busy waiting (#2089) 2024-11-19 00:21:46 -08:00
Lianmin Zheng
b110453802 Simplify logits penalizer (#2086) 2024-11-18 17:48:28 -08:00
Lianmin Zheng
df7fe4521a Crash the CI jobs on model import errors (#2072) 2024-11-17 22:18:11 -08:00
Lianmin Zheng
116685337e Fix cuda illegal memory access in overlap mode (#2070) 2024-11-17 21:29:30 -08:00
Lianmin Zheng
a9e90b4bce [Minor] Fix styles for overlap mode (#2068) 2024-11-17 19:49:20 -08:00
Ke Bao
62832bb272 Support cuda graph for DP attention (#2061) 2024-11-17 16:29:20 -08:00
Lianmin Zheng
38625e2139 Remove monkey_patch_vllm_dummy_weight_loader (#2064) 2024-11-17 15:48:12 -08:00
Ke Bao
976bc302e5 Support DP MLA (#1970) 2024-11-16 09:01:43 +00:00
Lianmin Zheng
2558d6a675 Fix the default arguments of bench_offline_throughput.py & simplify detokenizer manager (#2042) 2024-11-15 05:02:44 -08:00
Lianmin Zheng
54479d6f30 Fix grammar backend for tensor parallelism (#2020) 2024-11-13 01:49:45 -08:00