Zhiyu
|
80b2b3207a
|
Enable native ModelOpt quantization support (3/3) (#10154)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-10-21 21:44:29 -07:00 |
|
Kai-Hsun Chen
|
c61b0b294c
|
[quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n (#11702)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
|
2025-10-21 21:25:28 +08:00 |
|
Meng, Hengyu
|
b113c72e7a
|
Init attention backend for Intel XPU (#10656)
Co-authored-by: guangyey <guangye.yu@intel.com>
Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>
|
2025-10-21 11:41:28 +08:00 |
|
harrisonlimh
|
c726d44cc7
|
Recapture cuda graph after model weight update to resolve IMA error (#11780)
|
2025-10-20 10:50:03 +08:00 |
|
Liangsheng Yin
|
57e25de756
|
Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" (#11827)
|
2025-10-19 19:44:06 +08:00 |
|
YAMY
|
80407b0493
|
Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (#10788)
|
2025-10-19 11:37:43 +08:00 |
|
b8zhong
|
f9a7d9b3dc
|
support server arg override KV cache to bf16 to avoid slow cases (#11749)
|
2025-10-19 02:49:48 +08:00 |
|
Lianmin Zheng
|
67e34c56d7
|
Fix install instructions and pyproject.tomls (#11781)
|
2025-10-18 01:08:01 -07:00 |
|
Cheng Wan
|
5b214b50b6
|
[Refactor] move deep_gemm_wrapper out of quantization (#11784)
|
2025-10-17 18:57:54 -07:00 |
|
Chang Su
|
627974405d
|
[Lint] Add python/sglang to ruff F401 checks and remove unused imports in files (#11685)
|
2025-10-17 16:49:46 -07:00 |
|
ykcombat
|
f440baa136
|
[Feature] Reuse flashinfer workspace for PD-Multiplexing. (#11540)
|
2025-10-18 02:35:06 +08:00 |
|
Shangming Cai
|
868403f642
|
[PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: ZeldaHuang <hzm414167@alibaba-inc.com>
|
2025-10-15 18:59:14 -07:00 |
|
Lianmin Zheng
|
cd7e1bd591
|
Sync code and test CI; rename some env vars (#11686)
|
2025-10-15 18:37:03 -07:00 |
|
Yineng Zhang
|
91fc5bb5a9
|
feat: add add_chunked_prefix_cache_attention_backend (#11636)
|
2025-10-14 21:48:13 -07:00 |
|
Xun Sun
|
a40229f6f8
|
[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-14 19:40:54 -07:00 |
|
Baizhou Zhang
|
c224a4c6cc
|
Fix log for chunked prefix cache (#11624)
|
2025-10-14 11:49:33 -07:00 |
|
Lianmin Zheng
|
5e3f7e7fa9
|
Minor: improve sampler & remove unused fields from model_config.py (#11531)
|
2025-10-13 11:04:44 -07:00 |
|
Mick
|
f35f120d70
|
fix: fix video input for qwen3-vl (#11442)
|
2025-10-13 09:30:43 -07:00 |
|
Liangsheng Yin
|
516738b096
|
Depreate global_server_args_dict (#11528)
|
2025-10-13 19:34:43 +08:00 |
|
Yi Zhang
|
a55cf5304a
|
[Feature] Support mamba radix cache v0 (#11214)
Co-authored-by: hanming-lu <hanming@x.ai>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: thalahors <ericalcaide1@gmail.com>
|
2025-10-12 20:57:15 -07:00 |
|
Cheng Wan
|
1bdd010291
|
Revert "Deprecate global_server_args_dict" (#11520)
|
2025-10-12 17:40:40 -07:00 |
|
Liangsheng Yin
|
1083e7e3df
|
Deprecate global_server_args_dict (#11331)
|
2025-10-13 01:20:47 +08:00 |
|
Yuwei An
|
4ac8e09df0
|
Piecewise CUDA Graph Support & Torch Compile Backend (#10062)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
2025-10-12 11:55:57 +08:00 |
|
Lianmin Zheng
|
b4408e6098
|
Revert "fix: fix video input for qwen3-vl" (#11437)
|
2025-10-10 12:44:40 -07:00 |
|
Mick
|
a1a20b4c7c
|
fix: fix video input for qwen3-vl (#11361)
|
2025-10-10 04:35:35 -07:00 |
|
Lianmin Zheng
|
9b8ebb2798
|
move more files under srt/utils (#11285)
|
2025-10-09 16:46:15 -07:00 |
|
Trevor Morris
|
a4b424c632
|
[DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (#11309)
|
2025-10-08 23:59:46 -07:00 |
|
Netanel Haber
|
d6837aea4d
|
model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (#10909)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
|
2025-10-09 00:37:38 +08:00 |
|
Lifu Huang
|
edefab0c64
|
[2/2] Support MHA prefill with FlashAttention 4. (#10937)
Co-authored-by: Hieu Pham <hyhieu@gmail.com>
|
2025-10-08 00:54:20 -07:00 |
|
YAMY
|
5a9170d993
|
Optimize copy_kv_cache for spec decoding (#11126)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
|
2025-10-08 10:43:30 +08:00 |
|
Liangsheng Yin
|
501dfa6b42
|
Remove sampling info events and overlap thread file (#11300)
|
2025-10-07 21:34:25 +08:00 |
|
Zhiyu
|
155cbb51f0
|
Enable native ModelOpt quantization support (1/3) (#7149)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-10-06 13:24:15 -07:00 |
|
fzyzcjy
|
efbc687c28
|
Support DeepSeek V3.2 Exp (#11061)
Co-authored-by: Stefan He <11166516+hebiao064@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <95566987+hnyls2002@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <56809903+fridge003@users.noreply.github.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Co-authored-by: ZhengdQin <46387172+zhengdqin@users.noreply.github.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-10-06 00:24:15 -07:00 |
|
Yuan Luo
|
590f2da052
|
[Feat] Support Torch Symm Mem AllReduce (#10571)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-10-05 13:55:19 -07:00 |
|
fzyzcjy
|
fdc4e1e570
|
Tiny move files to utils folder (#11166)
|
2025-10-03 22:40:06 +08:00 |
|
Matt Nappo
|
8c57490210
|
[Feature] Option to save model weights to CPU when memory saver mode is enabled (#10873)
Co-authored-by: molocule <34072934+molocule@users.noreply.github.com>
|
2025-10-03 16:48:19 +08:00 |
|
jacky.cheng
|
b00a0c786f
|
[Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (#11161)
|
2025-10-02 21:19:30 -07:00 |
|
Dom Brown
|
e810077488
|
Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (#11138)
|
2025-10-02 16:04:58 -07:00 |
|
ilyasch2
|
083629c235
|
[model] Add mamba2 and Falcon-H1 support. (#10988)
Co-authored-by: Younes Belkada <younes.belkada@tii.ae>
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com>
|
2025-10-02 19:15:36 +08:00 |
|
fzyzcjy
|
2ac453b07f
|
Tiny detect slow ranks (#10508)
|
2025-10-02 18:00:33 +08:00 |
|
li-kesen
|
2bc61dd194
|
Remove hybrid_linear_attn attention backend and refactor attention registry (#10816)
Co-authored-by: Yi Zhang <1109276519@qq.com>
|
2025-09-30 10:16:16 +08:00 |
|
Lianmin Zheng
|
dda34c2f93
|
Fix mem fraction static for nightly tests (#11076)
|
2025-09-29 12:57:41 -07:00 |
|
amysaq2023
|
2bdaf482f9
|
refactor loading weights from remote instance coding format (#10941)
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
|
2025-09-26 15:25:39 -07:00 |
|
ronnie_zheng
|
e22f3a5ec9
|
[Ascend]optimize Qwen3 on Ascend (#10574)
Co-authored-by: c30031083 <chenxu140@huawei.com>
|
2025-09-22 17:18:36 -07:00 |
|
Qiaolin Yu
|
e2ac7888b8
|
[2/2] Support deterministic inference for temperature > 0 (#10678)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-09-21 19:36:08 -07:00 |
|
Stefan He
|
86527a4799
|
[deterministic inference] Move batch invariant pkg to sglang (#10695)
|
2025-09-21 19:35:14 -07:00 |
|
Lifu Huang
|
08ecd0aa2a
|
[3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization (#10592)
|
2025-09-20 22:47:48 -07:00 |
|
Baizhou Zhang
|
8ecef73f12
|
[1/2] Support deterministic inference with flashinfer attention backend (#10645)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
2025-09-19 23:34:29 -07:00 |
|
Yineng Zhang
|
60e2a7cead
|
[Auto Sync] Update model_runner.py (20250920) (#10679)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-09-19 18:26:54 -07:00 |
|
Zhihao Zhang
|
e7bc600304
|
[Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
2025-09-18 16:42:41 -07:00 |
|