Commit Graph

404 Commits

Author SHA1 Message Date
Zhiyu
80b2b3207a Enable native ModelOpt quantization support (3/3) (#10154)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-10-21 21:44:29 -07:00
Kai-Hsun Chen
c61b0b294c [quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n (#11702)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
2025-10-21 21:25:28 +08:00
Meng, Hengyu
b113c72e7a Init attention backend for Intel XPU (#10656)
Co-authored-by: guangyey <guangye.yu@intel.com>
Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>
2025-10-21 11:41:28 +08:00
harrisonlimh
c726d44cc7 Recapture cuda graph after model weight update to resolve IMA error (#11780) 2025-10-20 10:50:03 +08:00
Liangsheng Yin
57e25de756 Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" (#11827) 2025-10-19 19:44:06 +08:00
YAMY
80407b0493 Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (#10788) 2025-10-19 11:37:43 +08:00
b8zhong
f9a7d9b3dc support server arg override KV cache to bf16 to avoid slow cases (#11749) 2025-10-19 02:49:48 +08:00
Lianmin Zheng
67e34c56d7 Fix install instructions and pyproject.tomls (#11781) 2025-10-18 01:08:01 -07:00
Cheng Wan
5b214b50b6 [Refactor] move deep_gemm_wrapper out of quantization (#11784) 2025-10-17 18:57:54 -07:00
Chang Su
627974405d [Lint] Add python/sglang to ruff F401 checks and remove unused imports in files (#11685) 2025-10-17 16:49:46 -07:00
ykcombat
f440baa136 [Feature] Reuse flashinfer workspace for PD-Multiplexing. (#11540) 2025-10-18 02:35:06 +08:00
Shangming Cai
868403f642 [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: ZeldaHuang <hzm414167@alibaba-inc.com>
2025-10-15 18:59:14 -07:00
Lianmin Zheng
cd7e1bd591 Sync code and test CI; rename some env vars (#11686) 2025-10-15 18:37:03 -07:00
Yineng Zhang
91fc5bb5a9 feat: add add_chunked_prefix_cache_attention_backend (#11636) 2025-10-14 21:48:13 -07:00
Xun Sun
a40229f6f8 [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-10-14 19:40:54 -07:00
Baizhou Zhang
c224a4c6cc Fix log for chunked prefix cache (#11624) 2025-10-14 11:49:33 -07:00
Lianmin Zheng
5e3f7e7fa9 Minor: improve sampler & remove unused fields from model_config.py (#11531) 2025-10-13 11:04:44 -07:00
Mick
f35f120d70 fix: fix video input for qwen3-vl (#11442) 2025-10-13 09:30:43 -07:00
Liangsheng Yin
516738b096 Depreate global_server_args_dict (#11528) 2025-10-13 19:34:43 +08:00
Yi Zhang
a55cf5304a [Feature] Support mamba radix cache v0 (#11214)
Co-authored-by: hanming-lu <hanming@x.ai>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: thalahors <ericalcaide1@gmail.com>
2025-10-12 20:57:15 -07:00
Cheng Wan
1bdd010291 Revert "Deprecate global_server_args_dict" (#11520) 2025-10-12 17:40:40 -07:00
Liangsheng Yin
1083e7e3df Deprecate global_server_args_dict (#11331) 2025-10-13 01:20:47 +08:00
Yuwei An
4ac8e09df0 Piecewise CUDA Graph Support & Torch Compile Backend (#10062)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
2025-10-12 11:55:57 +08:00
Lianmin Zheng
b4408e6098 Revert "fix: fix video input for qwen3-vl" (#11437) 2025-10-10 12:44:40 -07:00
Mick
a1a20b4c7c fix: fix video input for qwen3-vl (#11361) 2025-10-10 04:35:35 -07:00
Lianmin Zheng
9b8ebb2798 move more files under srt/utils (#11285) 2025-10-09 16:46:15 -07:00
Trevor Morris
a4b424c632 [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (#11309) 2025-10-08 23:59:46 -07:00
Netanel Haber
d6837aea4d model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (#10909)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-10-09 00:37:38 +08:00
Lifu Huang
edefab0c64 [2/2] Support MHA prefill with FlashAttention 4. (#10937)
Co-authored-by: Hieu Pham <hyhieu@gmail.com>
2025-10-08 00:54:20 -07:00
YAMY
5a9170d993 Optimize copy_kv_cache for spec decoding (#11126)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-10-08 10:43:30 +08:00
Liangsheng Yin
501dfa6b42 Remove sampling info events and overlap thread file (#11300) 2025-10-07 21:34:25 +08:00
Zhiyu
155cbb51f0 Enable native ModelOpt quantization support (1/3) (#7149)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-10-06 13:24:15 -07:00
fzyzcjy
efbc687c28 Support DeepSeek V3.2 Exp (#11061)
Co-authored-by: Stefan He <11166516+hebiao064@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <95566987+hnyls2002@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <56809903+fridge003@users.noreply.github.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Co-authored-by: ZhengdQin <46387172+zhengdqin@users.noreply.github.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-06 00:24:15 -07:00
Yuan Luo
590f2da052 [Feat] Support Torch Symm Mem AllReduce (#10571)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-05 13:55:19 -07:00
fzyzcjy
fdc4e1e570 Tiny move files to utils folder (#11166) 2025-10-03 22:40:06 +08:00
Matt Nappo
8c57490210 [Feature] Option to save model weights to CPU when memory saver mode is enabled (#10873)
Co-authored-by: molocule <34072934+molocule@users.noreply.github.com>
2025-10-03 16:48:19 +08:00
jacky.cheng
b00a0c786f [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (#11161) 2025-10-02 21:19:30 -07:00
Dom Brown
e810077488 Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (#11138) 2025-10-02 16:04:58 -07:00
ilyasch2
083629c235 [model] Add mamba2 and Falcon-H1 support. (#10988)
Co-authored-by: Younes Belkada <younes.belkada@tii.ae>
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com>
2025-10-02 19:15:36 +08:00
fzyzcjy
2ac453b07f Tiny detect slow ranks (#10508) 2025-10-02 18:00:33 +08:00
li-kesen
2bc61dd194 Remove hybrid_linear_attn attention backend and refactor attention registry (#10816)
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-09-30 10:16:16 +08:00
Lianmin Zheng
dda34c2f93 Fix mem fraction static for nightly tests (#11076) 2025-09-29 12:57:41 -07:00
amysaq2023
2bdaf482f9 refactor loading weights from remote instance coding format (#10941)
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
2025-09-26 15:25:39 -07:00
ronnie_zheng
e22f3a5ec9 [Ascend]optimize Qwen3 on Ascend (#10574)
Co-authored-by: c30031083 <chenxu140@huawei.com>
2025-09-22 17:18:36 -07:00
Qiaolin Yu
e2ac7888b8 [2/2] Support deterministic inference for temperature > 0 (#10678)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-09-21 19:36:08 -07:00
Stefan He
86527a4799 [deterministic inference] Move batch invariant pkg to sglang (#10695) 2025-09-21 19:35:14 -07:00
Lifu Huang
08ecd0aa2a [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization (#10592) 2025-09-20 22:47:48 -07:00
Baizhou Zhang
8ecef73f12 [1/2] Support deterministic inference with flashinfer attention backend (#10645)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-19 23:34:29 -07:00
Yineng Zhang
60e2a7cead [Auto Sync] Update model_runner.py (20250920) (#10679)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-09-19 18:26:54 -07:00
Zhihao Zhang
e7bc600304 [Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-18 16:42:41 -07:00