Scott Lee
|
b6fb5d7666
|
Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441)
|
2025-10-13 11:24:27 -07:00 |
|
Liangsheng Yin
|
bfadb5ea5f
|
Adjust overlap event loop (#11507)
|
2025-10-14 00:33:19 +08:00 |
|
Liangsheng Yin
|
516738b096
|
Depreate global_server_args_dict (#11528)
|
2025-10-13 19:34:43 +08:00 |
|
Yi Zhang
|
a55cf5304a
|
[Feature] Support mamba radix cache v0 (#11214)
Co-authored-by: hanming-lu <hanming@x.ai>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: thalahors <ericalcaide1@gmail.com>
|
2025-10-12 20:57:15 -07:00 |
|
Cheng Wan
|
1bdd010291
|
Revert "Deprecate global_server_args_dict" (#11520)
|
2025-10-12 17:40:40 -07:00 |
|
Liangsheng Yin
|
1083e7e3df
|
Deprecate global_server_args_dict (#11331)
|
2025-10-13 01:20:47 +08:00 |
|
Liangsheng Yin
|
f49419061d
|
Move args from global_config to environ (#11332)
|
2025-10-12 21:29:31 +08:00 |
|
Liangsheng Yin
|
20a6c0a63d
|
Beta spec-overlap for EAGLE (#11398)
Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
|
2025-10-12 11:02:22 +08:00 |
|
Glen Liu
|
47c606d3dc
|
[Feature] support regex strings as a stopping condition (#10635)
|
2025-10-12 10:53:15 +08:00 |
|
ybyang
|
5061b8fd3e
|
fix stop when stream (#11462)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2025-10-11 22:06:31 +08:00 |
|
cctry
|
b36afed4a7
|
Separate allocation logic from scheduler (#11313)
|
2025-10-10 17:38:54 -07:00 |
|
Scott Lee
|
55b14656e6
|
Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (#11433)
|
2025-10-10 12:54:57 -07:00 |
|
Cheng Wan
|
52fcbbb8bd
|
Revert "perf: optimize qwen-vl with symm mem allreduce" (#11436)
|
2025-10-10 12:30:05 -07:00 |
|
Yuan Luo
|
3b9d97f335
|
perf: optimize qwen-vl with symm mem allreduce (#11381)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-10-10 22:24:45 +08:00 |
|
Scott Lee
|
0babd48736
|
Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11144)
|
2025-10-10 00:46:44 -07:00 |
|
Sundara Raman Ramachandran
|
53bd00d975
|
[Generative Score API] Multi-Item scoring with custom attention mask. (#10979)
|
2025-10-08 18:47:32 -07:00 |
|
cctry
|
f3764c26a3
|
Clean match_prefix and prepare_for_extend for mem cache V2 (#11200)
|
2025-10-07 17:54:18 -07:00 |
|
Liangsheng Yin
|
501dfa6b42
|
Remove sampling info events and overlap thread file (#11300)
|
2025-10-07 21:34:25 +08:00 |
|
Liangsheng Yin
|
1519a89cfd
|
Remove overlap thread (#11210)
Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
|
2025-10-07 20:12:12 +08:00 |
|
Lianmin Zheng
|
708f4ff490
|
Rename max_micro_batch_size -> pp_max_micro_batch_size (#11279)
|
2025-10-06 15:50:56 -07:00 |
|
fzyzcjy
|
efbc687c28
|
Support DeepSeek V3.2 Exp (#11061)
Co-authored-by: Stefan He <11166516+hebiao064@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <95566987+hnyls2002@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <56809903+fridge003@users.noreply.github.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Co-authored-by: ZhengdQin <46387172+zhengdqin@users.noreply.github.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-10-06 00:24:15 -07:00 |
|
Liangsheng Yin
|
4cb5a5235e
|
Tiny skip_sample adjust (#11225)
|
2025-10-05 23:41:04 +08:00 |
|
Liangsheng Yin
|
458611de77
|
Unify forward output datastructure (#11124)
|
2025-10-03 00:28:57 +08:00 |
|
Liangsheng Yin
|
25e7dbe8af
|
Fix ngram spec with page size > 1 (#11135)
|
2025-10-02 12:34:23 +08:00 |
|
Zhang Junda
|
0b2aa8a70c
|
Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (#10720)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
|
2025-10-02 10:51:25 +08:00 |
|
Lianmin Zheng
|
2d62af6be5
|
Fix metrics and request tracing (TimeStats) (#11123)
|
2025-10-01 13:03:07 -07:00 |
|
Liangsheng Yin
|
73d4a5f879
|
Organize spec-related data structures (#10735)
|
2025-10-01 09:45:30 +08:00 |
|
Ke Bao
|
91847e382a
|
Fix eagle radix cache (#10846)
|
2025-09-30 22:59:20 +08:00 |
|
Ke Bao
|
424591d53d
|
Fix spec filter batch when target extend (#10991)
|
2025-09-30 14:44:02 +08:00 |
|
narutolhy
|
d17986f8c6
|
Enable optional FP32 compute for LM Head (#10729)
Thanks to MiniMax Team and Chenyang Zhao's support.
|
2025-09-29 20:45:17 -07:00 |
|
Lianmin Zheng
|
dda34c2f93
|
Fix mem fraction static for nightly tests (#11076)
|
2025-09-29 12:57:41 -07:00 |
|
Zhihao Zhang
|
24f7cb1ece
|
[speculative decoding] rename lookahead to ngram (#11010)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
|
2025-09-28 21:06:59 -07:00 |
|
Shangming Cai
|
e23e280e16
|
Add support for topk metadata transferring for PD (#10616)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2025-09-28 00:09:38 +08:00 |
|
Xinyuan Tong
|
71f24ef8f6
|
feat: add cache_salt support to request (#10718)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2025-09-23 23:30:25 -07:00 |
|
Qiaolin Yu
|
e2ac7888b8
|
[2/2] Support deterministic inference for temperature > 0 (#10678)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-09-21 19:36:08 -07:00 |
|
Xinyuan Tong
|
12d6cf18f0
|
Refactors radix cache for extra key support (#10317)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2025-09-22 02:16:16 +08:00 |
|
Baizhou Zhang
|
8ecef73f12
|
[1/2] Support deterministic inference with flashinfer attention backend (#10645)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
2025-09-19 23:34:29 -07:00 |
|
Zhihao Zhang
|
e7bc600304
|
[Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
2025-09-18 16:42:41 -07:00 |
|
harrisonlimh
|
14fdd52740
|
feat: add priority based scheduling with priority based request acceptance and preemption (#8746)
|
2025-09-16 17:10:10 -07:00 |
|
Yingchun Lai
|
b1721edbac
|
[PD metrics] Add latency Histogram metrics of each stage for generate requests (#8710)
|
2025-09-16 01:52:49 +08:00 |
|
Cheng Wan
|
4844fac91d
|
Refactor TopK to ensure readability and extensibility (#9338)
|
2025-09-14 19:16:25 -07:00 |
|
Liangsheng Yin
|
6897e06b69
|
Remove repeatedly lists adding in init_incremental_detokenization (#10412)
|
2025-09-14 10:05:52 +08:00 |
|
Sundara Raman Ramachandran
|
a360511d7b
|
[Generative Score API] Scoring(Prefill-only) optimizations. (#9748)
|
2025-09-14 01:57:06 +08:00 |
|
Yi Zhang
|
30c6e1f569
|
Qwen3-Next support (#10233)
Co-authored-by: cao1zhg <114661107+cao1zhg@users.noreply.github.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
|
2025-09-11 04:11:49 -07:00 |
|
DarkSharpness
|
948b01a04c
|
[Refactor] Remove Hicache Load & Write threads (#10127)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-09-08 22:18:50 -07:00 |
|
Baizhou Zhang
|
8ad700f735
|
Cleaning codes for speculative attention mode (#10149)
|
2025-09-08 17:38:06 -07:00 |
|
cicirori
|
8c5930f08a
|
Add speculator attention backend switch (#9981)
|
2025-09-07 21:44:36 -07:00 |
|
Zhiqiang Xie
|
3b99f23c44
|
[Bugfix] Retract not releasing enough memory when page size > 1 (#9989)
|
2025-09-07 21:41:50 -07:00 |
|
Qiaolin Yu
|
8cda5a622c
|
Standalone speculative decoding (#10090)
|
2025-09-07 20:55:09 -07:00 |
|
Cheng Wan
|
3fa62da78c
|
[7/N] MoE Refactor: the implementation of new framework (#9269)
|
2025-09-05 21:09:09 -07:00 |
|