Commit Graph

401 Commits

Author SHA1 Message Date
Mick
bcc213df61 Model: Support Qwen 2.5 vl (#3258) 2025-02-16 00:58:53 -08:00
Yineng Zhang
6718b10996 fix eagle unit test (#3591) 2025-02-15 23:10:48 +08:00
Yineng Zhang
e0b9a423c8 chore: bump v0.4.3 (#3556) 2025-02-14 09:43:14 +08:00
Yineng Zhang
70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) 2025-02-14 08:50:14 +08:00
Ke Bao
7e6d5fc694 Support Eagle cuda graph for Triton backend (#3500) 2025-02-12 02:27:45 +08:00
Jackmin801
5f0e7de339 [Feat] Return hidden states (experimental) (#3364)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-10 15:54:37 -08:00
Ke Bao
2d61132374 Support Eagle2 for Triton backend (#3466) 2025-02-10 20:00:42 +08:00
Baizhou Zhang
c45cab1c00 [Fix] Fix accuracy bug and refactor codes for lora (#3413) 2025-02-10 13:29:00 +08:00
Yineng Zhang
60abdb3e7c minor: cleanup test_eagle_infer (#3415) 2025-02-09 09:34:30 +08:00
Ying Sheng
7b4e61fff3 [Fix] Fix eagle with disable cuda graph (#3411) 2025-02-09 08:40:00 +08:00
Yineng Zhang
6222e1c228 add disable cuda graph unit test for eagle 2 (#3412) 2025-02-09 08:02:56 +08:00
Yineng Zhang
2b1808cec4 update unit test in AMD CI (#3366) 2025-02-07 17:25:16 +08:00
Ke Bao
a322051e31 Support custom mask for Triton attention (#3317) 2025-02-06 01:16:02 +08:00
Ke Bao
de5533341e Update Triton extend backend interface (#3309) 2025-02-05 18:12:22 +08:00
Ke Bao
a07364ccc5 Update Triton decode backend interface (#3292) 2025-02-04 23:26:04 +08:00
Yineng Zhang
d39899e85c upgrade flashinfer v0.2.0.post2 (#3288)
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-04 21:41:40 +08:00
Baizhou Zhang
70817a7eae [Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-03 22:09:13 -08:00
Xiaoyu Zhang
3c8ac78dc1 optimize test_fused_moe style (#3268) 2025-02-03 18:56:18 +08:00
Ke Bao
5317902670 Add test for fp8 torch compile (#3246) 2025-02-01 16:07:54 +08:00
Byron Hsu
734daedd8f [fix] Clamp logprob with dtype min to prevent -inf (#3224) 2025-01-31 17:04:04 +08:00
Byron Hsu
20453cef62 [test] Lower number of top logprobs to get rid of -inf (#3212) 2025-01-30 18:01:23 +08:00
Mick
9f635ea50d [Fix] Address remaining issues of supporting MiniCPMV (#2977) 2025-01-28 00:22:13 -08:00
Byron Hsu
988d0a4bfc [kernel] Use sgl_kernel rope (#3169)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-01-28 14:33:11 +08:00
Byron Hsu
27aeb4b7d8 [test] deduplicate test_session_control (#3183) 2025-01-28 13:17:06 +08:00
Lianmin Zheng
f8ca66fb49 Update thresholds in test_nightly_gsm8k_eval.py (#3176) 2025-01-27 03:02:09 -08:00
Lianmin Zheng
52c03f16b9 Add activation parameters to fused_moe (#3170) 2025-01-27 00:23:37 -08:00
yizhang2077
1e3e521544 add unit test for block wise fp8 (#3156) 2025-01-27 15:32:04 +08:00
Lianmin Zheng
af02f99b7c Add more logprob tests (#3162) 2025-01-26 22:24:55 -08:00
YAMY
b045841bae Feature/function calling update (#2700)
Co-authored-by: Mingyuan Ma <mamingyuan2001@berkeley.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
2025-01-26 09:57:51 -08:00
Lianmin Zheng
f4a92f4b56 Temporarily skip the openai frontend tests (#3151) 2025-01-26 04:17:35 -08:00
Lianmin Zheng
d1a0863251 Add a test case for cached_tokens (#3145) 2025-01-26 01:39:28 -08:00
Lianmin Zheng
da6f8081f6 Fix CI tests (#3132) 2025-01-25 17:43:39 -08:00
Lianmin Zheng
a4331cd260 Add accuracy and latency tests of eagle into CI (#3027) 2025-01-21 02:55:14 -08:00
Lianmin Zheng
287d07a669 Misc fixes for eagle (flush_cache, CPU overhead) (#3014) 2025-01-20 20:27:38 -08:00
Hongpeng Guo
583697cd71 [Enhancement] Custom Logit Processor Improvement (#2998)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-20 02:00:35 -08:00
Lianmin Zheng
51e87f6f21 Skip flaky custom_logit_processor tests (#3004) 2025-01-20 00:28:47 -08:00
Lianmin Zheng
03464890e0 Separate two entry points: Engine and HTTP server (#2996)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-01-19 22:09:24 -08:00
Lianmin Zheng
cd493b5afc Improve metrics, logging, and importing orders (#2992) 2025-01-19 18:36:59 -08:00
Lianmin Zheng
61f42b5732 Move sgl.Runtime under sglang/lang (#2990) 2025-01-19 17:10:29 -08:00
Hongpeng Guo
e403d23757 [Feature] Add sampler custom logits processor (#2396)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-19 14:46:53 -08:00
Enrique Shockwave
3bcf5ecea7 support regex in xgrammar backend (#2983) 2025-01-20 04:34:41 +08:00
Chang Su
4d4cdb3fe7 Frontend: better error message handling for FINISH_ABORT in scheduler.py (#2956) 2025-01-18 19:37:30 -08:00
Mick
3d93f84a00 [Feature] Support minicpmv v2.6 (#2785)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-01-18 14:14:19 -08:00
bjmsong
d3024f4fc8 support e4m3 kvcache in qwen2 & add kv scaling facotr json (#2894)
Co-authored-by: bjmsong <bjmsong@126.com>
2025-01-18 11:43:22 +08:00
Ke Bao
d47c5101f1 Add ut for qwen model (#2947) 2025-01-18 00:03:54 +08:00
Chang Su
a8ccacc8b8 [Frontend] Fix request length check and add option to disallow auto truncation in scheduler (#2876) 2025-01-16 14:51:19 -08:00
Lianmin Zheng
8f2c522aba Improve benchmark scripts and error message printing (#2922) 2025-01-16 06:24:31 -08:00
yizhang2077
767c9dec03 adapt custom allreduce for tensorrt llm (#2511) 2025-01-16 04:57:35 +08:00
Ke Bao
bfbda62c8b Add ut for w8a8 int8 quantization (#2897) 2025-01-15 18:29:14 +08:00
fzyzcjy
923f518337 CUDA-graph-compatible releasing and resuming KV cache and model weight memory (#2630) 2025-01-13 11:38:51 -08:00