Commit Graph

3477 Commits

Author SHA1 Message Date
Yineng Zhang
19d64f2b72 fix: resolve lint issue (#10181) 2025-09-08 15:09:55 -07:00
Teng Ma
a02071a12c [Bench] feat: mooncake trace integration (#9839)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Signed-off-by: Teng Ma <sima.mt@alibaba-inc.com>
Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
2025-09-09 02:50:54 +08:00
Yineng Zhang
45b3a6a256 Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" (#10176) 2025-09-08 11:28:15 -07:00
LukasBluebaum
9a18aa54c2 [fix] Relax white space rules in EBNFComposer (#9595) 2025-09-08 10:47:19 -07:00
Zhiy-Zhang
91f0fd95a4 pref: Add H20 fp8 fused MoE kernel configs for Qwen3 (#10166)
Co-authored-by: qiufan.zzy <qiufan.zzy@antgroup.com>
2025-09-08 09:57:21 -07:00
alanhe151220037
8085aca791 [Bug fix] Fix ascend mla in aclgraph (#9925) 2025-09-08 09:49:43 -07:00
Liangsheng Yin
72f9fc5f11 Monkey patch uvicorn multi worker is_alive timeout (#10159)
Co-authored-by: Huang Long <121648372+llll114@users.noreply.github.com>
2025-09-08 17:43:23 +08:00
hzh0425
ec99668ab7 [Hicache]: Add E2E CI For 3FS-KVStore (#10131) 2025-09-08 01:54:50 -07:00
Liangsheng Yin
78f139812a [1/N] DP-Refactor: move communicators into tokenizer_communicator_mixin (#10028) 2025-09-08 16:27:37 +08:00
Swipe4057
bfd7a18d8d update xgrammar 0.1.24 and transformers 4.56.1 (#10155) 2025-09-08 01:20:31 -07:00
ssshinigami
5dd8c6444b [Bug fix] Fix Gemma 2 and fix Gemma 3 multimodal with bs > 1 on NPU (#9871)
Co-authored-by: Maksim <makcum888e@mail.ru>
2025-09-08 01:19:40 -07:00
Huaiyu, Zheng
ee21817c6b enable llama3.1-8B on xpu (#9434) 2025-09-07 22:34:20 -07:00
Yineng Zhang
b7d1f17b8d Revert "enable auto-round quantization model (#6226)" (#10148) 2025-09-07 22:31:11 -07:00
Weiwei
c8295d2353 enable auto-round quantization model (#6226)
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
2025-09-07 22:05:35 -07:00
Even Zhou
b67c277f86 [Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph (#10013) 2025-09-07 21:50:49 -07:00
Xinyuan Tong
8116804e4f Fix: (glm4v) Add missing field (#10147) 2025-09-07 21:47:14 -07:00
cicirori
8c5930f08a Add speculator attention backend switch (#9981) 2025-09-07 21:44:36 -07:00
Zhiqiang Xie
3b99f23c44 [Bugfix] Retract not releasing enough memory when page size > 1 (#9989) 2025-09-07 21:41:50 -07:00
Cao E
7577f0e40f Add graph runner support with torch compile on CPU (#7843) 2025-09-07 21:33:58 -07:00
Qiaolin Yu
8cda5a622c Standalone speculative decoding (#10090) 2025-09-07 20:55:09 -07:00
kk
400d3b97ae Fix run time error in dsv3-fp8 model on mi35x (#10104)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-09-07 20:45:17 -07:00
Lzhang-hub
37d83c6e6d Qwen2.5-VL eagle3 infer (#8801) 2025-09-07 20:44:34 -07:00
Rain Jiang
7802586cab fix the fp8 topk_config.correction_bias is none bug (#10040) 2025-09-07 20:28:14 -07:00
fzyzcjy
bc5fc332f7 Fix slow fused add RMSNorm (#10141) 2025-09-07 20:20:39 -07:00
Xinyuan Tong
f3440adcb5 vlm: enable GLM4.1V server testing & fix video processing (#10095)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
2025-09-08 03:53:08 +01:00
Cheng Wan
5a7e10fe4c [MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 (#10144) 2025-09-07 19:43:59 -07:00
Shisong Ma
33467c05a4 [BUG FIX] add fail check when get fail in case wait complete block (#9971)
Co-authored-by: mashisong <mashisong@bytedance.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-07 18:34:04 -07:00
Lianmin Zheng
76a2c86b88 Fix flashinfer version in sgl-kernel (#10135) 2025-09-07 12:54:07 -07:00
Liangsheng Yin
e719bb0e84 [1/2] Refactor multi-tokenizer manager (#10074) 2025-09-07 19:13:34 +08:00
DarkSharpness
067246830d [Minor] fix lint in main (#10128) 2025-09-07 17:36:46 +08:00
Lianmin Zheng
617aa2b248 [Auto Sync] Update parallel_state.py (20250907) (#10126)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: jzhou-xai <jzhou@x.ai>
2025-09-07 02:12:32 -07:00
miter
111b137964 add dataset_path for bench_one_batch_server.py (#10113)
Signed-off-by: linhuang <linhuang@ruijie.com.cn>
Co-authored-by: linhuang <linhuang@ruijie.com.cn>
2025-09-07 14:07:09 +08:00
Teng Ma
41628dc1b1 [HiCache] fix: check clear() method for storage backend (#10096)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2025-09-06 22:59:58 -07:00
Ben Barsdell
a12061df4c Fix cuda graph mode in flashinfer attn backend (#10056) 2025-09-06 22:59:48 -07:00
Yuwei An
9a7ced4e4d [Feature] LMCache Connector Integration (#9741)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-06 20:14:55 -07:00
Yuan Luo
cb3918a091 Optimize moe_sum_reduce_kernel (#9477)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-09-07 09:16:18 +08:00
Lianmin Zheng
f3b6760213 [Auto Sync] Update server_args.py (20250906) (#10117)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
2025-09-06 16:59:36 -07:00
Shangming Cai
00974e4f6e [CI] Refactor disaggregation tests (#10068)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-09-06 22:14:46 +08:00
Cheng Wan
a5a03209e9 Fix circular import (#10107) 2025-09-06 01:34:17 -07:00
Cheng Wan
21af5c0404 [Fix] Compatibility between DP attention and pipeline parallelism (#10100)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-06 01:34:10 -07:00
Jinyang Yuan
012584ecd5 perf: Avoid unnecessary data type conversions for DeepSeek-V3 on Blackwell (#9834)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-09-06 14:06:46 +08:00
Kaixi Hou
90dfe3de4c [NVIDIA] disable chunked prefix cache when dp and blackwell is used (#9861) 2025-09-06 14:05:16 +08:00
Kaixi Hou
9a719b7afc [NVIDIA] Remove unused get_fused_moe_impl_class function (#9764) 2025-09-05 22:41:22 -07:00
Cheng Wan
3fa62da78c [7/N] MoE Refactor: the implementation of new framework (#9269) 2025-09-05 21:09:09 -07:00
Chi-Chih Chang
ad26f298e2 fix double sparsity initialization (#6905) 2025-09-06 11:45:24 +08:00
sogalin
8d114f254b Fix RMSNorm API CALL mismatch issue. (#10032)
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
2025-09-05 20:45:13 -07:00
Zhiqiang Xie
0b8c5721f1 [HiStorage] Remove delete and clear as necessary methods (#10039) 2025-09-06 10:27:26 +08:00
gongwei-130
ab62b135c1 support Llama4 with non uniformed intermediate size across layers for… (#10047) 2025-09-05 17:28:15 -07:00
Xinyuan Tong
273b28344b [Minor] Refactors KV memory pool (#9842) 2025-09-05 17:06:08 -07:00
pansicheng
f84db115b1 Add storage read/write bandwidth logs to monitor kvcache performance (#9965)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-05 16:52:55 -07:00