Commit Graph

947 Commits

Author SHA1 Message Date
EduardDurech
46d8fb1c98 model: support Apertus (#9774) 2025-09-11 20:49:10 -07:00
Shu Wang
3df05f4d6a [NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199) 2025-09-11 20:18:43 -07:00
Minglei Zhu
46ccbed2cd update GLM nightly test threshold (#10331) 2025-09-11 14:54:58 -07:00
Zaili Wang
ef959d7b85 [CPU] fix OOM when mem-fraction is not set (#9090) 2025-09-10 23:52:22 -07:00
Even Zhou
5b64f006ec [Feature] Support DeepEP normal & Redundant Experts on NPU (#9881) 2025-09-10 20:35:26 -07:00
Xinyuan Tong
f3b5db6ee8 Feat: support disable tool parser (#10184) 2025-09-10 14:03:55 -07:00
Hubert Lu
91b3555d2d Add tests to AMD CI for MI35x (#9662)
Co-authored-by: Sai Enduri <saimanas.enduri@amd.com>
2025-09-10 12:50:05 -07:00
Lifu Huang
e903f695c8 Fix potential flakiness in test_lora_qwen3 (#10250) 2025-09-10 08:04:39 +00:00
ryang
dccf52f9c8 [UT for RL] Add UT to cover release/resume memory case for moe model (#8803) 2025-09-09 19:25:12 -07:00
blzheng
d1d4074c4e [CPU] Add gelu_and_mul kernel in sgl-kernel and add ut (#9300) 2025-09-08 23:23:13 -07:00
wenhuipeng
16ff3d4b05 Support opt model (#10165) 2025-09-09 12:45:00 +08:00
Baizhou Zhang
8ad700f735 Cleaning codes for speculative attention mode (#10149) 2025-09-08 17:38:06 -07:00
LukasBluebaum
9a18aa54c2 [fix] Relax white space rules in EBNFComposer (#9595) 2025-09-08 10:47:19 -07:00
Liangsheng Yin
2c2b19b18b [CI] fix ambiguous argument in testing hybrid attentions. (#10161) 2025-09-08 18:16:52 +08:00
hzh0425
ec99668ab7 [Hicache]: Add E2E CI For 3FS-KVStore (#10131) 2025-09-08 01:54:50 -07:00
Yineng Zhang
b7d1f17b8d Revert "enable auto-round quantization model (#6226)" (#10148) 2025-09-07 22:31:11 -07:00
Weiwei
c8295d2353 enable auto-round quantization model (#6226)
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
2025-09-07 22:05:35 -07:00
Even Zhou
b67c277f86 [Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph (#10013) 2025-09-07 21:50:49 -07:00
cicirori
8c5930f08a Add speculator attention backend switch (#9981) 2025-09-07 21:44:36 -07:00
Cao E
7577f0e40f Add graph runner support with torch compile on CPU (#7843) 2025-09-07 21:33:58 -07:00
Qiaolin Yu
8cda5a622c Standalone speculative decoding (#10090) 2025-09-07 20:55:09 -07:00
Xinyuan Tong
f3440adcb5 vlm: enable GLM4.1V server testing & fix video processing (#10095)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
2025-09-08 03:53:08 +01:00
Shangming Cai
00974e4f6e [CI] Refactor disaggregation tests (#10068)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-09-06 22:14:46 +08:00
Cheng Wan
21af5c0404 [Fix] Compatibility between DP attention and pipeline parallelism (#10100)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-06 01:34:10 -07:00
Cheng Wan
3fa62da78c [7/N] MoE Refactor: the implementation of new framework (#9269) 2025-09-05 21:09:09 -07:00
gongwei-130
ab62b135c1 support Llama4 with non uniformed intermediate size across layers for… (#10047) 2025-09-05 17:28:15 -07:00
Xinyuan Tong
273b28344b [Minor] Refactors KV memory pool (#9842) 2025-09-05 17:06:08 -07:00
DevashishLal-CB
13705dae06 [Fix] Add speculative_draft_model_revision to server_args (#5255)
Signed-off-by: Devashish Lal <devashish@rivosinc.com>
2025-09-05 19:45:46 +08:00
Liangsheng Yin
6e95f5e5bd Simplify Router arguments passing and build it in docker image (#9964) 2025-09-05 12:13:55 +08:00
Yingchun Lai
b32ab0705e metrics: support customer buckets for prompt/generation_tokens_histogram (#9634) 2025-09-04 22:22:08 +08:00
hzh0425
106c2b31fb feat(hicache): Add generic hicache ci e2e test and benchmark test (#9846)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-04 20:43:46 +08:00
Yineng Zhang
de9217334b feat: add gpt oss b200 ci (#9988) 2025-09-03 17:26:38 -07:00
Lianmin Zheng
60e37f8028 Move parsers under a single folder (#9912) 2025-09-02 18:25:04 -07:00
tc-mb
03dbf1aa8e [model] support MiniCPM-V 4.0 (#8747)
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-09-02 15:33:03 -07:00
ybyang
5f77e1292d Support Multi Process Tokenizer Manager(#6555) (#8964)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-09-01 01:00:13 -07:00
Baizhou Zhang
7de2ce45b2 Disable radix cache in test_lora_update.py for better stability (#9852) 2025-08-31 22:28:22 -07:00
narutolhy
839c93bd2d feat: add original logprobs to response (#8375)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
2025-08-29 11:43:57 -07:00
gongwei-130
3fd1431df2 support enable in the reasoning field to enable thingking for thinkin… (#9715) 2025-08-29 10:57:32 -07:00
gongwei-130
9a7c8842ba accomendate json schema in the "schema" field, not in "json_schema" field of response_format (#9786) 2025-08-28 23:51:50 -07:00
Hubert Lu
711390a971 [AMD] Support Hierarchical Caching on AMD GPUs (#8236) 2025-08-28 15:27:07 -07:00
Qiaolin Yu
4a4772ae03 Support speculative decoding in hybrid attention backend (#9573) 2025-08-28 01:11:42 -07:00
cicirori
b6c14ec0b4 add response_format support for completion API (#9665) 2025-08-26 15:01:29 -07:00
Xiaotong Jiang
0936c766ed Fix kimi k2 function calling format (#9606) 2025-08-26 00:50:59 -07:00
Netanel Haber
4cd08dc592 model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 (#9301) 2025-08-26 15:33:40 +08:00
ZhengdQin
f92b729d52 [new feat] ascend backend support fia fusion kernel (#8328)
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
2025-08-25 23:13:08 -07:00
Jonas
a0a77d937b Fix Harmony reasoning parser for and auto-separation for gpt-oss models (#9190)
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
Co-authored-by: minleminzui <2969413251@qq.com>
Co-authored-by: maocheng23 <maocheng@berkeley.edu>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-25 15:26:26 -07:00
Yineng Zhang
ebd9dbe71b fix: revert #8593 (#9581) 2025-08-25 01:29:06 -07:00
Pavani Majety
3cc3d9b950 Add Support for Page Size greater than 1 for Flashinfer MLA Backend (#8593)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-08-21 18:15:06 -07:00
DiweiSun
029e0af31d ci: enhance xeon ci (#9395) 2025-08-21 03:35:17 -07:00
VDV1985
2c4b4b786b [feature] Ascend NPU graph support (#9399)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: yezhifeng (D) <y00897525@china.huawei.com>
Co-authored-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Maksim <makcum888e@mail.ru>
Co-authored-by: ssshinigami <44640852+ssshinigami@users.noreply.github.com>
2025-08-20 21:13:27 -07:00