Commit Graph

160 Commits

Author SHA1 Message Date
Lianmin Zheng
cd493b5afc Improve metrics, logging, and importing orders (#2992) 2025-01-19 18:36:59 -08:00
Lianmin Zheng
61f42b5732 Move sgl.Runtime under sglang/lang (#2990) 2025-01-19 17:10:29 -08:00
Hongpeng Guo
e403d23757 [Feature] Add sampler custom logits processor (#2396)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-19 14:46:53 -08:00
giorgiopiatti-dfinity
8b6a4486ec fix missing revision arg when loading tokenizer (#2982) 2025-01-19 11:36:07 -08:00
fzyzcjy
81d27c8e31 Refactor to add TypeBasedDispatcher to simplify dispatching (#2958) 2025-01-18 20:13:27 -08:00
Chang Su
4d4cdb3fe7 Frontend: better error message handling for FINISH_ABORT in scheduler.py (#2956) 2025-01-18 19:37:30 -08:00
Mick
3d93f84a00 [Feature] Support minicpmv v2.6 (#2785)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-01-18 14:14:19 -08:00
Zhiqiang Xie
8af7048dcf Query remaining memory dynamically for PrefillAdder (#2941) 2025-01-17 20:20:26 -08:00
Chunyuan WU
63051738a9 Enable CPU device on SGLang (#2806) 2025-01-16 21:22:53 -08:00
Chang Su
a8ccacc8b8 [Frontend] Fix request length check and add option to disallow auto truncation in scheduler (#2876) 2025-01-16 14:51:19 -08:00
Lianmin Zheng
0427416b59 Fix zmq binding (#2930)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
2025-01-16 14:36:07 -08:00
Lianmin Zheng
bc6915e3b9 Improve type annotation and styles (#2926) 2025-01-16 12:51:11 -08:00
Lianmin Zheng
8b6ce52e92 Support multi-node DP attention (#2925)
Co-authored-by: dhou-xai <dhou@x.ai>
2025-01-16 11:15:00 -08:00
Lianmin Zheng
8f2c522aba Improve benchmark scripts and error message printing (#2922) 2025-01-16 06:24:31 -08:00
Lianmin Zheng
f65c13b559 Remove normalized_prompt_logprobs from the engine to make code easier to maintain (#2902) 2025-01-15 04:54:14 -08:00
Lianmin Zheng
46d4431889 Add a new api configure_logging to allow dumping the requests (#2875) 2025-01-13 14:24:00 -08:00
fzyzcjy
923f518337 CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630) 2025-01-13 11:38:51 -08:00
Lianmin Zheng
72c7776355 Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-01-13 01:39:14 -08:00
Shi Shuai
c4f9707e16 Improve: Token-In Token-Out Usage for RLHF (#2843) 2025-01-11 15:14:26 -08:00
Lianmin Zheng
bdc1acf6cd Misc fix for min_p_sampling, --cuda-graph-bs (#2761) 2025-01-07 02:52:53 -08:00
Lianmin Zheng
b8574f6953 Clean up eagle code (#2756) 2025-01-06 14:54:18 -08:00
Lianmin Zheng
0f9cc6d8d3 Fix package loss for small models (#2717)
Co-authored-by: sdli1995 < mmlmonkey@163.com>
2025-01-02 18:25:26 -08:00
Lianmin Zheng
ad20b7957e Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-01-02 02:09:08 -08:00
Lianmin Zheng
9c6ba2484f Refactor logprob computation to return the real logprob used in sampling (#2664) 2024-12-30 04:51:38 -08:00
Shi Shuai
35bdb48557 [Feature] Get Token IDs with Engine.generate() (#2636)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2024-12-29 12:28:27 -08:00
Ying Sheng
e0e09fceeb [Session] Update session control interface (#2635) 2024-12-29 02:10:27 -08:00
Lianmin Zheng
3815b23ccb Clean up wrapper in flashinfer backend (#2638) 2024-12-29 00:45:57 -08:00
fzyzcjy
fd28640dc5 Add update_weights_from_tensor (#2631) 2024-12-28 13:30:27 -08:00
Lianmin Zheng
751e5ca273 [minor] clean up docs and eos id (#2622) 2024-12-27 11:23:46 -08:00
Yang Zheng
7a7ac6bea1 [FIX] Update EOS from config (#2475) 2024-12-27 10:59:56 -08:00
fzyzcjy
3169e66c23 Fix duplicated handling of GetWeightsByNameReqInput (#2565) 2024-12-26 06:49:32 -08:00
Lianmin Zheng
773951548d Fix logprob_start_len for multi modal models (#2597)
Co-authored-by: libra <lihu723@gmail.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>
2024-12-26 06:27:45 -08:00
Adarsh Shirawalmath
acb340728c [Feature] Support new parameter - EBNF in xgrammar (#2526) 2024-12-26 05:12:41 -08:00
Liangsheng Yin
e7ebecf82e Fix cache hit rate when chunked prefill (#2555) 2024-12-26 03:14:28 -08:00
Lianmin Zheng
8496701934 [Misc] Fix metrics, weight update lock, request logging (#2543) 2024-12-22 06:27:22 -08:00
SangBin Cho
9208618b3e [Core] in batch prefix caching by delay scheduling (#2442) 2024-12-11 12:51:50 -08:00
Lianmin Zheng
0ce091a82d [Minor] Improve code style (#2419) 2024-12-09 03:05:59 -08:00
Lianmin Zheng
a6ca736c8e Simplify stream_output (#2398) 2024-12-08 12:27:13 -08:00
Lianmin Zheng
cc858953a0 Fix recv_requests (#2405) 2024-12-08 04:08:04 -08:00
Lianmin Zheng
a2486eb58f Fix a bug with logprob streaming + chunked prefill (#2403) 2024-12-08 03:55:27 -08:00
SangBin Cho
1f09e84b9a nit: Remove busy waiting on scheduler (#2382) 2024-12-08 01:06:15 -08:00
Lianmin Zheng
0e7409adb6 Fix the overlap for xgrammar (#2377) 2024-12-06 05:49:29 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Yineng Zhang
85e1a6f3aa Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-02 23:22:13 +08:00
Lianmin Zheng
3c79ad35ca [Fix] Fix the padded hash value for image tokens (#2309) 2024-12-01 23:36:28 -08:00
Chayenne
983bfcf386 Online weight updates from torch.distributed (#2279) 2024-12-01 23:23:18 -08:00
Liangsheng Yin
5f12f0e7af Fix chunked prefill when ignore eos (#2290) 2024-12-01 00:37:53 -08:00
Chayenne
7d1485d376 Add get weights by parameter name for llama (#2266) 2024-11-29 23:36:38 -08:00
Chayenne
7d5d1d3d29 udate weights from disk (#2265) 2024-11-30 01:17:00 +00:00
Lianmin Zheng
94e167ea5a Fix the default chunked prefill size (#2268) 2024-11-29 16:03:32 -08:00