Commit Graph

432 Commits

Author SHA1 Message Date
Lianmin Zheng
bc1534ff32 Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134)
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-06 06:13:59 -08:00
Lianmin Zheng
fcc2e37f69 Split the __init__ of scheduler as smaller functions. Improve the eagle tests (#4128) 2025-03-06 00:13:20 -08:00
Lianmin Zheng
286e6540a6 Remove prefill-only-one-req (#4117) 2025-03-05 20:58:48 -08:00
Jhin
70b3c6eeb1 Add update_weights_from_disk endpoint to Engine (#4102)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-05 12:25:18 -08:00
Ying Sheng
d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
2025-03-05 08:06:07 -08:00
Qubitium-ModelCloud
56a724eba3 [QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization (#3790)
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
2025-03-05 01:11:00 -08:00
Mick
583d6af71b example: add vlm to token in & out example (#3941)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-03-04 22:18:26 -08:00
Lianmin Zheng
77a3954bf7 Simplify eagle tests and TP sync in grammar backend (#4066) 2025-03-04 13:40:40 -08:00
Ke Bao
03b0364f76 Update nextn ci test (#4071) 2025-03-04 13:01:24 -08:00
William
0d4e3228cf [Feature] Add test for speculative_token_map (#4016) 2025-03-04 04:26:24 -08:00
Xihuai Wang
95575aa76a Reasoning parser (#4000)
Co-authored-by: Lucas Pickup <lupickup@microsoft.com>
2025-03-03 21:16:36 -08:00
Lianmin Zheng
935cda944b Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00
Lianmin Zheng
ac2387279e Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
2025-03-03 00:12:04 -08:00
Ke Bao
d3fe9bae56 Add accuracy test for TP torch compile (#3994) 2025-03-02 13:18:18 -08:00
Qiaolin Yu
40782f05d7 Refactor: Move return_hidden_states to the generate input (#3985)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
2025-03-01 17:51:29 -08:00
Baizhou Zhang
90a4b7d98a [Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-28 18:13:56 -08:00
fzyzcjy
e3e0bc50a9 [Feature] SPMD for SGLang + Verl (#3852) 2025-02-28 09:53:10 -08:00
Chang Su
eec3f6d1eb [Bugfix] Fix tokenizer_manager not getting 400 when req is too long (#3678)
Co-authored-by: voidxb <unkown>
2025-02-27 22:59:43 -08:00
KCFindstr
bc20e93f2d [feat] Add Vertex AI compatible prediction route for /generate (#3866) 2025-02-27 19:42:15 -08:00
Qiaolin Yu
d6898dd253 Add return hidden state in the native API (#3897)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-26 22:06:54 -08:00
JC1DA
7551498a69 [Feature] Support llguidance for constrained decoding (#3298) 2025-02-26 10:41:49 -08:00
Chaitanya Sri Krishna Lolla
6ce9dbe828 [ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 (#3237)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-24 18:14:31 -08:00
Lianmin Zheng
d7934cde45 Fix CI and install docs (#3821) 2025-02-24 16:17:38 -08:00
laixin
1a6e97577a Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3730)
Co-authored-by: HandH1998 <1335248067@qq.com>
2025-02-24 05:43:35 -08:00
Lianmin Zheng
27a46317b6 Fix dependency (#3813) 2025-02-24 03:50:58 -08:00
Zhiyu
c66b2c9cf1 Add support for nvidia modelopt fp8 kv cache (#3223) 2025-02-22 07:04:58 +08:00
Andrew Smith
1df6eabd5d feat: Add SageMaker support (#3740) 2025-02-21 19:31:09 +08:00
aoshen524
e79f7420be [Fix] Fix bugs and refactor codes in lora for better scalability. (#3652)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-02-20 11:51:57 -08:00
Yineng Zhang
e782eb7e6a chore: bump v0.4.3.post1 (#3638) 2025-02-17 21:58:19 +08:00
Yineng Zhang
e319153be8 update unit test (#3636) 2025-02-17 21:06:10 +08:00
Yineng Zhang
32b44d2fca add mtp unit test (#3634) 2025-02-17 19:04:07 +08:00
Mick
bcc213df61 Model: Support Qwen 2.5 vl (#3258) 2025-02-16 00:58:53 -08:00
Yineng Zhang
6718b10996 fix eagle unit test (#3591) 2025-02-15 23:10:48 +08:00
Yineng Zhang
e0b9a423c8 chore: bump v0.4.3 (#3556) 2025-02-14 09:43:14 +08:00
Yineng Zhang
70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) 2025-02-14 08:50:14 +08:00
Ke Bao
7e6d5fc694 Support Eagle cuda graph for Triton backend (#3500) 2025-02-12 02:27:45 +08:00
Jackmin801
5f0e7de339 [Feat] Return hidden states (experimental) (#3364)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-10 15:54:37 -08:00
Ke Bao
2d61132374 Support Eagle2 for Triton backend (#3466) 2025-02-10 20:00:42 +08:00
Baizhou Zhang
c45cab1c00 [Fix] Fix accuracy bug and refactor codes for lora (#3413) 2025-02-10 13:29:00 +08:00
Yineng Zhang
60abdb3e7c minor: cleanup test_eagle_infer (#3415) 2025-02-09 09:34:30 +08:00
Ying Sheng
7b4e61fff3 [Fix] Fix eagle with disable cuda graph (#3411) 2025-02-09 08:40:00 +08:00
Yineng Zhang
6222e1c228 add disable cuda graph unit test for eagle 2 (#3412) 2025-02-09 08:02:56 +08:00
Yineng Zhang
2b1808cec4 update unit test in AMD CI (#3366) 2025-02-07 17:25:16 +08:00
Ke Bao
a322051e31 Support custom mask for Triton attention (#3317) 2025-02-06 01:16:02 +08:00
Ke Bao
de5533341e Update Triton extend backend interface (#3309) 2025-02-05 18:12:22 +08:00
Ke Bao
a07364ccc5 Update Triton decode backend interface (#3292) 2025-02-04 23:26:04 +08:00
Yineng Zhang
d39899e85c upgrade flashinfer v0.2.0.post2 (#3288)
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-04 21:41:40 +08:00
Baizhou Zhang
70817a7eae [Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-03 22:09:13 -08:00
Xiaoyu Zhang
3c8ac78dc1 optimize test_fused_moe style (#3268) 2025-02-03 18:56:18 +08:00
Ke Bao
5317902670 Add test for fp8 torch compile (#3246) 2025-02-01 16:07:54 +08:00