Commit Graph

346 Commits

Author SHA1 Message Date
Scott Lee
b6fb5d7666 Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441) 2025-10-13 11:24:27 -07:00
Liangsheng Yin
bfadb5ea5f Adjust overlap event loop (#11507) 2025-10-14 00:33:19 +08:00
Liangsheng Yin
516738b096 Depreate global_server_args_dict (#11528) 2025-10-13 19:34:43 +08:00
Yi Zhang
a55cf5304a [Feature] Support mamba radix cache v0 (#11214)
Co-authored-by: hanming-lu <hanming@x.ai>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: thalahors <ericalcaide1@gmail.com>
2025-10-12 20:57:15 -07:00
Cheng Wan
1bdd010291 Revert "Deprecate global_server_args_dict" (#11520) 2025-10-12 17:40:40 -07:00
Liangsheng Yin
1083e7e3df Deprecate global_server_args_dict (#11331) 2025-10-13 01:20:47 +08:00
Liangsheng Yin
f49419061d Move args from global_config to environ (#11332) 2025-10-12 21:29:31 +08:00
Liangsheng Yin
20a6c0a63d Beta spec-overlap for EAGLE (#11398)
Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
2025-10-12 11:02:22 +08:00
Glen Liu
47c606d3dc [Feature] support regex strings as a stopping condition (#10635) 2025-10-12 10:53:15 +08:00
ybyang
5061b8fd3e fix stop when stream (#11462)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2025-10-11 22:06:31 +08:00
cctry
b36afed4a7 Separate allocation logic from scheduler (#11313) 2025-10-10 17:38:54 -07:00
Scott Lee
55b14656e6 Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (#11433) 2025-10-10 12:54:57 -07:00
Cheng Wan
52fcbbb8bd Revert "perf: optimize qwen-vl with symm mem allreduce" (#11436) 2025-10-10 12:30:05 -07:00
Yuan Luo
3b9d97f335 perf: optimize qwen-vl with symm mem allreduce (#11381)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-10 22:24:45 +08:00
Scott Lee
0babd48736 Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11144) 2025-10-10 00:46:44 -07:00
Sundara Raman Ramachandran
53bd00d975 [Generative Score API] Multi-Item scoring with custom attention mask. (#10979) 2025-10-08 18:47:32 -07:00
cctry
f3764c26a3 Clean match_prefix and prepare_for_extend for mem cache V2 (#11200) 2025-10-07 17:54:18 -07:00
Liangsheng Yin
501dfa6b42 Remove sampling info events and overlap thread file (#11300) 2025-10-07 21:34:25 +08:00
Liangsheng Yin
1519a89cfd Remove overlap thread (#11210)
Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
2025-10-07 20:12:12 +08:00
Lianmin Zheng
708f4ff490 Rename max_micro_batch_size -> pp_max_micro_batch_size (#11279) 2025-10-06 15:50:56 -07:00
fzyzcjy
efbc687c28 Support DeepSeek V3.2 Exp (#11061)
Co-authored-by: Stefan He <11166516+hebiao064@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <95566987+hnyls2002@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <56809903+fridge003@users.noreply.github.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Co-authored-by: ZhengdQin <46387172+zhengdqin@users.noreply.github.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-06 00:24:15 -07:00
Liangsheng Yin
4cb5a5235e Tiny skip_sample adjust (#11225) 2025-10-05 23:41:04 +08:00
Liangsheng Yin
458611de77 Unify forward output datastructure (#11124) 2025-10-03 00:28:57 +08:00
Liangsheng Yin
25e7dbe8af Fix ngram spec with page size > 1 (#11135) 2025-10-02 12:34:23 +08:00
Zhang Junda
0b2aa8a70c Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (#10720)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2025-10-02 10:51:25 +08:00
Lianmin Zheng
2d62af6be5 Fix metrics and request tracing (TimeStats) (#11123) 2025-10-01 13:03:07 -07:00
Liangsheng Yin
73d4a5f879 Organize spec-related data structures (#10735) 2025-10-01 09:45:30 +08:00
Ke Bao
91847e382a Fix eagle radix cache (#10846) 2025-09-30 22:59:20 +08:00
Ke Bao
424591d53d Fix spec filter batch when target extend (#10991) 2025-09-30 14:44:02 +08:00
narutolhy
d17986f8c6 Enable optional FP32 compute for LM Head (#10729)
Thanks to MiniMax Team and Chenyang Zhao's support.
2025-09-29 20:45:17 -07:00
Lianmin Zheng
dda34c2f93 Fix mem fraction static for nightly tests (#11076) 2025-09-29 12:57:41 -07:00
Zhihao Zhang
24f7cb1ece [speculative decoding] rename lookahead to ngram (#11010)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
2025-09-28 21:06:59 -07:00
Shangming Cai
e23e280e16 Add support for topk metadata transferring for PD (#10616)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-09-28 00:09:38 +08:00
Xinyuan Tong
71f24ef8f6 feat: add cache_salt support to request (#10718)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-09-23 23:30:25 -07:00
Qiaolin Yu
e2ac7888b8 [2/2] Support deterministic inference for temperature > 0 (#10678)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-09-21 19:36:08 -07:00
Xinyuan Tong
12d6cf18f0 Refactors radix cache for extra key support (#10317)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-09-22 02:16:16 +08:00
Baizhou Zhang
8ecef73f12 [1/2] Support deterministic inference with flashinfer attention backend (#10645)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-19 23:34:29 -07:00
Zhihao Zhang
e7bc600304 [Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-18 16:42:41 -07:00
harrisonlimh
14fdd52740 feat: add priority based scheduling with priority based request acceptance and preemption (#8746) 2025-09-16 17:10:10 -07:00
Yingchun Lai
b1721edbac [PD metrics] Add latency Histogram metrics of each stage for generate requests (#8710) 2025-09-16 01:52:49 +08:00
Cheng Wan
4844fac91d Refactor TopK to ensure readability and extensibility (#9338) 2025-09-14 19:16:25 -07:00
Liangsheng Yin
6897e06b69 Remove repeatedly lists adding in init_incremental_detokenization (#10412) 2025-09-14 10:05:52 +08:00
Sundara Raman Ramachandran
a360511d7b [Generative Score API] Scoring(Prefill-only) optimizations. (#9748) 2025-09-14 01:57:06 +08:00
Yi Zhang
30c6e1f569 Qwen3-Next support (#10233)
Co-authored-by: cao1zhg <114661107+cao1zhg@users.noreply.github.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
2025-09-11 04:11:49 -07:00
DarkSharpness
948b01a04c [Refactor] Remove Hicache Load & Write threads (#10127)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-08 22:18:50 -07:00
Baizhou Zhang
8ad700f735 Cleaning codes for speculative attention mode (#10149) 2025-09-08 17:38:06 -07:00
cicirori
8c5930f08a Add speculator attention backend switch (#9981) 2025-09-07 21:44:36 -07:00
Zhiqiang Xie
3b99f23c44 [Bugfix] Retract not releasing enough memory when page size > 1 (#9989) 2025-09-07 21:41:50 -07:00
Qiaolin Yu
8cda5a622c Standalone speculative decoding (#10090) 2025-09-07 20:55:09 -07:00
Cheng Wan
3fa62da78c [7/N] MoE Refactor: the implementation of new framework (#9269) 2025-09-05 21:09:09 -07:00