Commit Graph

3908 Commits

Author SHA1 Message Date
Binyao Jiang
451d15c44b [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450) 2025-10-10 23:13:46 -07:00
Liu-congo
c80a96dae9 [BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360)
Signed-off-by: Liu-congo <1502632128@qq.com>
2025-10-10 21:14:24 -07:00
Stefan He
eae9a9fb9d Fix batch invariant ops (#11368) 2025-10-10 20:49:08 -07:00
wxsm
2674c1d280 fix: Change dsv32 hack temporary path to use system temp directory (#11445) 2025-10-10 19:59:41 -07:00
Lianmin Zheng
61055cb309 Reorder PD disagg CI tests (#11438) 2025-10-10 17:56:49 -07:00
Simo Lin
c495833186 [router] leverage RAII to actively cancel request during client disconnect (#11399) 2025-10-10 20:43:38 -04:00
cctry
b36afed4a7 Separate allocation logic from scheduler (#11313) 2025-10-10 17:38:54 -07:00
JinYan Su
9aa4502d11 feat(mooncake): support GB suffix for global_segment_size (#10745)
Signed-off-by: Jinyang Su <751080330@qq.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
2025-10-10 17:38:25 -07:00
Scott Lee
55b14656e6 Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (#11433) 2025-10-10 12:54:57 -07:00
Lianmin Zheng
b4408e6098 Revert "fix: fix video input for qwen3-vl" (#11437) 2025-10-10 12:44:40 -07:00
Cheng Wan
52fcbbb8bd Revert "perf: optimize qwen-vl with symm mem allreduce" (#11436) 2025-10-10 12:30:05 -07:00
Teng Ma
9082a7d323 [HiCache] feat: add multi tenant with prefix tag (#9256) 2025-10-11 00:23:28 +08:00
Yuan Luo
3b9d97f335 perf: optimize qwen-vl with symm mem allreduce (#11381)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-10 22:24:45 +08:00
Mick
a1a20b4c7c fix: fix video input for qwen3-vl (#11361) 2025-10-10 04:35:35 -07:00
Yineng Zhang
4299aebdbb chore: update pyproject (#11420) 2025-10-10 00:56:30 -07:00
Scott Lee
0babd48736 Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11144) 2025-10-10 00:46:44 -07:00
Zaili Wang
f19613e6c3 Dedicated toml files for CPU/XPU (#10734) 2025-10-10 00:44:55 -07:00
ziruiliu
8df4945559 fix file and object naming scheme in HiCacheNixl to avoid data corruption (#10969)
Signed-off-by: Zirui Liu <ziliu@ddn.com>
2025-10-10 00:23:10 -07:00
hzh0425
ee3bd8a1c8 feat(hicache): Support passing prefix keys for l3 store. (#9045)
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-10-10 00:22:05 -07:00
Yuan Luo
b5044fbf12 Replace pad with cat for better performance (#11388)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-10 12:03:17 +08:00
Glen Liu
9a7e7a6576 [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (#11365) 2025-10-09 18:43:03 -07:00
Yingchun Lai
0fe87213bb fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (#11389) 2025-10-09 18:40:05 -07:00
Xinyuan Tong
1f106ee365 [grammar] Avoid server crash when grammar backend is None (#11401) 2025-10-09 18:38:10 -07:00
Lianmin Zheng
9b8ebb2798 move more files under srt/utils (#11285) 2025-10-09 16:46:15 -07:00
sglang-bot
758b887ad1 chore: bump SGLang version to 0.5.3.post1 (#11324) 2025-10-09 15:19:59 -07:00
Yineng Zhang
44cb060785 chore: upgrade flashinfer 0.4.0 (#11364) 2025-10-09 14:17:54 -07:00
Chang Su
b520958ec8 [router][grpc] Replace fake health check with correct ones (#11387) 2025-10-09 09:13:57 -07:00
shaharmor98
fa7e2c3049 fix bench_serving mishandling of internal states (#11376)
Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-10-09 19:24:50 +08:00
shaharmor98
8f2cd177af add code pp support for nixl (#11375)
Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-10-09 19:24:32 +08:00
Trevor Morris
a4b424c632 [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (#11309) 2025-10-08 23:59:46 -07:00
Simo Lin
368fd20622 [router][grpc] disable health check generation and increase timeout (#11353) 2025-10-08 19:23:08 -07:00
Sundara Raman Ramachandran
53bd00d975 [Generative Score API] Multi-Item scoring with custom attention mask. (#10979) 2025-10-08 18:47:32 -07:00
Yineng Zhang
e22b13c569 [Auto Sync] Update scheduler.py (20251009) (#11350)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Junxiong Wang <junxiong@together.ai>
2025-10-08 17:39:04 -07:00
Chang Su
a65ca73911 [router][grpc] Cleanup debug logs in grpc_server and grpc_router (#11340) 2025-10-08 13:26:19 -07:00
Netanel Haber
d6837aea4d model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (#10909)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-10-09 00:37:38 +08:00
Liangsheng Yin
c882b5ae75 [CI] improve disaggregation CI. (#11264)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-10-08 21:40:56 +08:00
Kevin Xiang Li
e3bb7f5ae6 benchmark: enhance configurable multimodal benchmarking in bench_serving (#9812)
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-10-08 01:31:36 -07:00
Lifu Huang
92473e2e34 Support LoRA in bench_serving oai interface (#11318) 2025-10-08 01:28:58 -07:00
JinYan Su
6c0bb32711 fix(decode): adjust ServerArgs import to explicit module path (#11007) 2025-10-08 01:27:50 -07:00
Lifu Huang
edefab0c64 [2/2] Support MHA prefill with FlashAttention 4. (#10937)
Co-authored-by: Hieu Pham <hyhieu@gmail.com>
2025-10-08 00:54:20 -07:00
Cheng Wan
97cd38e58d Skip weight loading in deepgemm compilation (#11312) 2025-10-07 21:52:46 -07:00
Cheng Wan
3c06b673af [8/N] MoE Refactor: deprecate EPMoE (#11211) 2025-10-07 21:51:41 -07:00
Adarsh Shirawalmath
7c3f07dbcb [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (#9545) 2025-10-08 12:38:48 +08:00
Liangsheng Yin
4b4dc132fa Rename ngram_utils -> ngram_info (#11316) 2025-10-08 11:49:46 +08:00
YAMY
5a9170d993 Optimize copy_kv_cache for spec decoding (#11126)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-10-08 10:43:30 +08:00
DarkSharpness
832c84fba9 [Chore] Update xgrammar 0.1.24 -> 0.1.25 (#10710) 2025-10-07 18:22:28 -07:00
Mick
64d1505c0a ci: unify the model launch method of nightly ci (#11230) 2025-10-07 18:13:14 -07:00
cctry
f3764c26a3 Clean match_prefix and prepare_for_extend for mem cache V2 (#11200) 2025-10-07 17:54:18 -07:00
Chang Su
7ba3de0e92 [oai serving chat] Add argument --sampling-defaults and fix ChatCompletionRequest defaults (#11304) 2025-10-08 00:36:05 +00:00
Chang Su
f094e0a490 [router][grpc] Fix request_id extraction when n > 1 (#11311) 2025-10-07 19:27:56 -04:00