Lianmin Zheng
|
bc1534ff32
|
Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134)
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-06 06:13:59 -08:00 |
|
Lianmin Zheng
|
fcc2e37f69
|
Split the __init__ of scheduler as smaller functions. Improve the eagle tests (#4128)
|
2025-03-06 00:13:20 -08:00 |
|
Lianmin Zheng
|
286e6540a6
|
Remove prefill-only-one-req (#4117)
|
2025-03-05 20:58:48 -08:00 |
|
Jhin
|
70b3c6eeb1
|
Add update_weights_from_disk endpoint to Engine (#4102)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
|
2025-03-05 12:25:18 -08:00 |
|
Ying Sheng
|
d3d4d76758
|
[Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
|
2025-03-05 08:06:07 -08:00 |
|
Qubitium-ModelCloud
|
56a724eba3
|
[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization (#3790)
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
|
2025-03-05 01:11:00 -08:00 |
|
Mick
|
583d6af71b
|
example: add vlm to token in & out example (#3941)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
|
2025-03-04 22:18:26 -08:00 |
|
Lianmin Zheng
|
77a3954bf7
|
Simplify eagle tests and TP sync in grammar backend (#4066)
|
2025-03-04 13:40:40 -08:00 |
|
Ke Bao
|
03b0364f76
|
Update nextn ci test (#4071)
|
2025-03-04 13:01:24 -08:00 |
|
William
|
0d4e3228cf
|
[Feature] Add test for speculative_token_map (#4016)
|
2025-03-04 04:26:24 -08:00 |
|
Xihuai Wang
|
95575aa76a
|
Reasoning parser (#4000)
Co-authored-by: Lucas Pickup <lupickup@microsoft.com>
|
2025-03-03 21:16:36 -08:00 |
|
Lianmin Zheng
|
935cda944b
|
Misc clean up; Remove the support of jump forward (#4032)
|
2025-03-03 07:02:14 -08:00 |
|
Lianmin Zheng
|
ac2387279e
|
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
|
2025-03-03 00:12:04 -08:00 |
|
Ke Bao
|
d3fe9bae56
|
Add accuracy test for TP torch compile (#3994)
|
2025-03-02 13:18:18 -08:00 |
|
Qiaolin Yu
|
40782f05d7
|
Refactor: Move return_hidden_states to the generate input (#3985)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
|
2025-03-01 17:51:29 -08:00 |
|
Baizhou Zhang
|
90a4b7d98a
|
[Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
|
2025-02-28 18:13:56 -08:00 |
|
fzyzcjy
|
e3e0bc50a9
|
[Feature] SPMD for SGLang + Verl (#3852)
|
2025-02-28 09:53:10 -08:00 |
|
Chang Su
|
eec3f6d1eb
|
[Bugfix] Fix tokenizer_manager not getting 400 when req is too long (#3678)
Co-authored-by: voidxb <unkown>
|
2025-02-27 22:59:43 -08:00 |
|
KCFindstr
|
bc20e93f2d
|
[feat] Add Vertex AI compatible prediction route for /generate (#3866)
|
2025-02-27 19:42:15 -08:00 |
|
Qiaolin Yu
|
d6898dd253
|
Add return hidden state in the native API (#3897)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-02-26 22:06:54 -08:00 |
|
JC1DA
|
7551498a69
|
[Feature] Support llguidance for constrained decoding (#3298)
|
2025-02-26 10:41:49 -08:00 |
|
Chaitanya Sri Krishna Lolla
|
6ce9dbe828
|
[ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 (#3237)
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-02-24 18:14:31 -08:00 |
|
Lianmin Zheng
|
d7934cde45
|
Fix CI and install docs (#3821)
|
2025-02-24 16:17:38 -08:00 |
|
laixin
|
1a6e97577a
|
Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3730)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2025-02-24 05:43:35 -08:00 |
|
Lianmin Zheng
|
27a46317b6
|
Fix dependency (#3813)
|
2025-02-24 03:50:58 -08:00 |
|
Zhiyu
|
c66b2c9cf1
|
Add support for nvidia modelopt fp8 kv cache (#3223)
|
2025-02-22 07:04:58 +08:00 |
|
Andrew Smith
|
1df6eabd5d
|
feat: Add SageMaker support (#3740)
|
2025-02-21 19:31:09 +08:00 |
|
aoshen524
|
e79f7420be
|
[Fix] Fix bugs and refactor codes in lora for better scalability. (#3652)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
|
2025-02-20 11:51:57 -08:00 |
|
Yineng Zhang
|
e782eb7e6a
|
chore: bump v0.4.3.post1 (#3638)
|
2025-02-17 21:58:19 +08:00 |
|
Yineng Zhang
|
e319153be8
|
update unit test (#3636)
|
2025-02-17 21:06:10 +08:00 |
|
Yineng Zhang
|
32b44d2fca
|
add mtp unit test (#3634)
|
2025-02-17 19:04:07 +08:00 |
|
Mick
|
bcc213df61
|
Model: Support Qwen 2.5 vl (#3258)
|
2025-02-16 00:58:53 -08:00 |
|
Yineng Zhang
|
6718b10996
|
fix eagle unit test (#3591)
|
2025-02-15 23:10:48 +08:00 |
|
Yineng Zhang
|
e0b9a423c8
|
chore: bump v0.4.3 (#3556)
|
2025-02-14 09:43:14 +08:00 |
|
Yineng Zhang
|
70f894b810
|
feat: support flashinfer mla attention for deepseek v3 (#3550)
|
2025-02-14 08:50:14 +08:00 |
|
Ke Bao
|
7e6d5fc694
|
Support Eagle cuda graph for Triton backend (#3500)
|
2025-02-12 02:27:45 +08:00 |
|
Jackmin801
|
5f0e7de339
|
[Feat] Return hidden states (experimental) (#3364)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-02-10 15:54:37 -08:00 |
|
Ke Bao
|
2d61132374
|
Support Eagle2 for Triton backend (#3466)
|
2025-02-10 20:00:42 +08:00 |
|
Baizhou Zhang
|
c45cab1c00
|
[Fix] Fix accuracy bug and refactor codes for lora (#3413)
|
2025-02-10 13:29:00 +08:00 |
|
Yineng Zhang
|
60abdb3e7c
|
minor: cleanup test_eagle_infer (#3415)
|
2025-02-09 09:34:30 +08:00 |
|
Ying Sheng
|
7b4e61fff3
|
[Fix] Fix eagle with disable cuda graph (#3411)
|
2025-02-09 08:40:00 +08:00 |
|
Yineng Zhang
|
6222e1c228
|
add disable cuda graph unit test for eagle 2 (#3412)
|
2025-02-09 08:02:56 +08:00 |
|
Yineng Zhang
|
2b1808cec4
|
update unit test in AMD CI (#3366)
|
2025-02-07 17:25:16 +08:00 |
|
Ke Bao
|
a322051e31
|
Support custom mask for Triton attention (#3317)
|
2025-02-06 01:16:02 +08:00 |
|
Ke Bao
|
de5533341e
|
Update Triton extend backend interface (#3309)
|
2025-02-05 18:12:22 +08:00 |
|
Ke Bao
|
a07364ccc5
|
Update Triton decode backend interface (#3292)
|
2025-02-04 23:26:04 +08:00 |
|
Yineng Zhang
|
d39899e85c
|
upgrade flashinfer v0.2.0.post2 (#3288)
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
|
2025-02-04 21:41:40 +08:00 |
|
Baizhou Zhang
|
70817a7eae
|
[Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2025-02-03 22:09:13 -08:00 |
|
Xiaoyu Zhang
|
3c8ac78dc1
|
optimize test_fused_moe style (#3268)
|
2025-02-03 18:56:18 +08:00 |
|
Ke Bao
|
5317902670
|
Add test for fp8 torch compile (#3246)
|
2025-02-01 16:07:54 +08:00 |
|