Commit Graph

3491 Commits

Author SHA1 Message Date
Rain H
71fc7b7fad [Fix] KV-cache eviction mismatch across PP ranks in DeepSeek V3/R1 (#10214) 2025-09-09 02:07:13 -07:00
shaharmor98
9ab72f9895 add variable TP Decode > Prefill size support (#9960)
Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-09-09 16:47:26 +08:00
Lianmin Zheng
71133a0426 [Auto Sync] Update sampling_batch_info.py (20250909) (#10212)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: cctry <shiyang@x.ai>
2025-09-09 01:29:52 -07:00
Shangming Cai
f5f6b3b4b5 Refactor fused_add_rmsnorm import logic (#10207)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-09-09 00:23:58 -07:00
Yineng Zhang
94fb4e9e54 feat: support fa cute in sgl-kernel (#10205)
Co-authored-by: cicirori <32845984+cicirori@users.noreply.github.com>
2025-09-09 00:14:39 -07:00
blzheng
d1d4074c4e [CPU] Add gelu_and_mul kernel in sgl-kernel and add ut (#9300) 2025-09-08 23:23:13 -07:00
DarkSharpness
948b01a04c [Refactor] Remove Hicache Load & Write threads (#10127)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-08 22:18:50 -07:00
wenhuipeng
16ff3d4b05 Support opt model (#10165) 2025-09-09 12:45:00 +08:00
Liangsheng Yin
83d55ac51f [1/N]DP refactor: Improve dp rank scheduling in PD disaggregation mode. (#10169) 2025-09-09 12:27:55 +08:00
blzheng
97fff98c68 [CPU] Fix phi4-mm prompt issue in bench_serving (#9900) 2025-09-08 20:12:32 -07:00
Caproni
96784a65fd [Fix] Orphan process in data parallel (#7995)
Signed-off-by: Capronir <839972205@qq.com>
2025-09-09 11:09:09 +08:00
Rain Jiang
df5407fb53 Revert "feat: add fused moe config for Qwen3-30B-A3B on B200" (#10185) 2025-09-08 18:11:15 -07:00
Baizhou Zhang
8ad700f735 Cleaning codes for speculative attention mode (#10149) 2025-09-08 17:38:06 -07:00
Rain Jiang
7a40e4f4a6 fix the cutlass moe tests (#10182) 2025-09-08 16:24:55 -07:00
Yineng Zhang
19d64f2b72 fix: resolve lint issue (#10181) 2025-09-08 15:09:55 -07:00
Teng Ma
a02071a12c [Bench] feat: mooncake trace integration (#9839)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Signed-off-by: Teng Ma <sima.mt@alibaba-inc.com>
Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
2025-09-09 02:50:54 +08:00
Yineng Zhang
45b3a6a256 Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" (#10176) 2025-09-08 11:28:15 -07:00
LukasBluebaum
9a18aa54c2 [fix] Relax white space rules in EBNFComposer (#9595) 2025-09-08 10:47:19 -07:00
Zhiy-Zhang
91f0fd95a4 pref: Add H20 fp8 fused MoE kernel configs for Qwen3 (#10166)
Co-authored-by: qiufan.zzy <qiufan.zzy@antgroup.com>
2025-09-08 09:57:21 -07:00
alanhe151220037
8085aca791 [Bug fix] Fix ascend mla in aclgraph (#9925) 2025-09-08 09:49:43 -07:00
Liangsheng Yin
72f9fc5f11 Monkey patch uvicorn multi worker is_alive timeout (#10159)
Co-authored-by: Huang Long <121648372+llll114@users.noreply.github.com>
2025-09-08 17:43:23 +08:00
hzh0425
ec99668ab7 [Hicache]: Add E2E CI For 3FS-KVStore (#10131) 2025-09-08 01:54:50 -07:00
Liangsheng Yin
78f139812a [1/N] DP-Refactor: move communicators into tokenizer_communicator_mixin (#10028) 2025-09-08 16:27:37 +08:00
Swipe4057
bfd7a18d8d update xgrammar 0.1.24 and transformers 4.56.1 (#10155) 2025-09-08 01:20:31 -07:00
ssshinigami
5dd8c6444b [Bug fix] Fix Gemma 2 and fix Gemma 3 multimodal with bs > 1 on NPU (#9871)
Co-authored-by: Maksim <makcum888e@mail.ru>
2025-09-08 01:19:40 -07:00
Huaiyu, Zheng
ee21817c6b enable llama3.1-8B on xpu (#9434) 2025-09-07 22:34:20 -07:00
Yineng Zhang
b7d1f17b8d Revert "enable auto-round quantization model (#6226)" (#10148) 2025-09-07 22:31:11 -07:00
Weiwei
c8295d2353 enable auto-round quantization model (#6226)
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
2025-09-07 22:05:35 -07:00
Even Zhou
b67c277f86 [Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph (#10013) 2025-09-07 21:50:49 -07:00
Xinyuan Tong
8116804e4f Fix: (glm4v) Add missing field (#10147) 2025-09-07 21:47:14 -07:00
cicirori
8c5930f08a Add speculator attention backend switch (#9981) 2025-09-07 21:44:36 -07:00
Zhiqiang Xie
3b99f23c44 [Bugfix] Retract not releasing enough memory when page size > 1 (#9989) 2025-09-07 21:41:50 -07:00
Cao E
7577f0e40f Add graph runner support with torch compile on CPU (#7843) 2025-09-07 21:33:58 -07:00
Qiaolin Yu
8cda5a622c Standalone speculative decoding (#10090) 2025-09-07 20:55:09 -07:00
kk
400d3b97ae Fix run time error in dsv3-fp8 model on mi35x (#10104)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-09-07 20:45:17 -07:00
Lzhang-hub
37d83c6e6d Qwen2.5-VL eagle3 infer (#8801) 2025-09-07 20:44:34 -07:00
Rain Jiang
7802586cab fix the fp8 topk_config.correction_bias is none bug (#10040) 2025-09-07 20:28:14 -07:00
fzyzcjy
bc5fc332f7 Fix slow fused add RMSNorm (#10141) 2025-09-07 20:20:39 -07:00
Xinyuan Tong
f3440adcb5 vlm: enable GLM4.1V server testing & fix video processing (#10095)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
2025-09-08 03:53:08 +01:00
Cheng Wan
5a7e10fe4c [MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 (#10144) 2025-09-07 19:43:59 -07:00
Shisong Ma
33467c05a4 [BUG FIX] add fail check when get fail in case wait complete block (#9971)
Co-authored-by: mashisong <mashisong@bytedance.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-07 18:34:04 -07:00
Lianmin Zheng
76a2c86b88 Fix flashinfer version in sgl-kernel (#10135) 2025-09-07 12:54:07 -07:00
Liangsheng Yin
e719bb0e84 [1/2] Refactor multi-tokenizer manager (#10074) 2025-09-07 19:13:34 +08:00
DarkSharpness
067246830d [Minor] fix lint in main (#10128) 2025-09-07 17:36:46 +08:00
Lianmin Zheng
617aa2b248 [Auto Sync] Update parallel_state.py (20250907) (#10126)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: jzhou-xai <jzhou@x.ai>
2025-09-07 02:12:32 -07:00
miter
111b137964 add dataset_path for bench_one_batch_server.py (#10113)
Signed-off-by: linhuang <linhuang@ruijie.com.cn>
Co-authored-by: linhuang <linhuang@ruijie.com.cn>
2025-09-07 14:07:09 +08:00
Teng Ma
41628dc1b1 [HiCache] fix: check clear() method for storage backend (#10096)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2025-09-06 22:59:58 -07:00
Ben Barsdell
a12061df4c Fix cuda graph mode in flashinfer attn backend (#10056) 2025-09-06 22:59:48 -07:00
Yuwei An
9a7ced4e4d [Feature] LMCache Connector Integration (#9741)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-06 20:14:55 -07:00
Yuan Luo
cb3918a091 Optimize moe_sum_reduce_kernel (#9477)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-09-07 09:16:18 +08:00