Liangsheng Yin
|
5ea96ac7cc
|
Reduce one step decode for draft model. (#11561)
|
2025-10-14 23:52:04 +08:00 |
|
Liangsheng Yin
|
5a33c3aae7
|
Optimize Triton Draft Backend (#11556)
|
2025-10-14 20:08:32 +08:00 |
|
Qiaolin Yu
|
e4358a4585
|
Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json (#11587)
|
2025-10-14 13:24:43 +08:00 |
|
fzyzcjy
|
cb8ed2c09a
|
Make DeepEP combine recv do not overlap (#11535)
|
2025-10-13 18:40:42 -07:00 |
|
Trevor Morris
|
384733639a
|
[DSv32] Use torch.compile for _get_logits_head_gate (#11565)
|
2025-10-13 18:38:39 -07:00 |
|
Yuwei An
|
932e263725
|
Compilation Folder Reset (#11539)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
2025-10-14 09:19:12 +08:00 |
|
fzyzcjy
|
065ce81574
|
Tiny cleanup fp4 gemm calls (#11537)
|
2025-10-13 14:48:22 -07:00 |
|
Lianmin Zheng
|
5e3f7e7fa9
|
Minor: improve sampler & remove unused fields from model_config.py (#11531)
|
2025-10-13 11:04:44 -07:00 |
|
Liangsheng Yin
|
acc2327bbd
|
Move deep gemm related arguments to sglang.srt.environ (#11547)
|
2025-10-14 00:34:35 +08:00 |
|
Mick
|
f35f120d70
|
fix: fix video input for qwen3-vl (#11442)
|
2025-10-13 09:30:43 -07:00 |
|
Mohammad Miadh Angkad
|
c7867b6702
|
[Fix] Add per_channel_quant parameter to MoE config functions (#11201)
|
2025-10-13 21:26:06 +08:00 |
|
Liangsheng Yin
|
516738b096
|
Depreate global_server_args_dict (#11528)
|
2025-10-13 19:34:43 +08:00 |
|
Yuan Luo
|
0b6f535f66
|
[Reland] perf: optimize qwen-vl with symm mem allreduce (#11457)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-10-13 17:51:25 +08:00 |
|
Cheng Wan
|
1bdd010291
|
Revert "Deprecate global_server_args_dict" (#11520)
|
2025-10-12 17:40:40 -07:00 |
|
Lianmin Zheng
|
2ac46e94ef
|
Sync changes on io_struct.py and deterministic ops (#11498)
|
2025-10-12 16:03:10 -07:00 |
|
Liangsheng Yin
|
1083e7e3df
|
Deprecate global_server_args_dict (#11331)
|
2025-10-13 01:20:47 +08:00 |
|
Yi Zhang
|
4b15fa00f0
|
move fla env check position (#11500)
|
2025-10-12 06:40:45 -07:00 |
|
Liangsheng Yin
|
f49419061d
|
Move args from global_config to environ (#11332)
|
2025-10-12 21:29:31 +08:00 |
|
Liangsheng Yin
|
01e59e8247
|
Fix CI break by express-laned PRs. (#11499)
|
2025-10-12 21:06:06 +08:00 |
|
Mahmoud Ashraf
|
7b064f04f8
|
[bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends (#10172)
|
2025-10-12 20:28:16 +08:00 |
|
Kai-Hsun Chen
|
43190becfa
|
[chore][1/N] Avoid using default mutable parameters (#11478)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
|
2025-10-12 20:26:39 +08:00 |
|
Yuwei An
|
4ac8e09df0
|
Piecewise CUDA Graph Support & Torch Compile Backend (#10062)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
2025-10-12 11:55:57 +08:00 |
|
Liangsheng Yin
|
20a6c0a63d
|
Beta spec-overlap for EAGLE (#11398)
Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
|
2025-10-12 11:02:22 +08:00 |
|
ykcombat
|
c8452551ce
|
[Fix] Fix split prefill with fa3. (#11428)
|
2025-10-11 22:03:28 +08:00 |
|
fzyzcjy
|
bf3e7149be
|
Fix enable_v2 in int8 quant (#11470)
|
2025-10-11 21:56:30 +08:00 |
|
fzyzcjy
|
21337b22b9
|
Reland [1/2] Optimizations and refactors about quant kernel (#10312)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-10-11 15:59:03 +08:00 |
|
Binyao Jiang
|
451d15c44b
|
[DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450)
|
2025-10-10 23:13:46 -07:00 |
|
Lianmin Zheng
|
b4408e6098
|
Revert "fix: fix video input for qwen3-vl" (#11437)
|
2025-10-10 12:44:40 -07:00 |
|
Cheng Wan
|
52fcbbb8bd
|
Revert "perf: optimize qwen-vl with symm mem allreduce" (#11436)
|
2025-10-10 12:30:05 -07:00 |
|
Yuan Luo
|
3b9d97f335
|
perf: optimize qwen-vl with symm mem allreduce (#11381)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-10-10 22:24:45 +08:00 |
|
Mick
|
a1a20b4c7c
|
fix: fix video input for qwen3-vl (#11361)
|
2025-10-10 04:35:35 -07:00 |
|
Lianmin Zheng
|
9b8ebb2798
|
move more files under srt/utils (#11285)
|
2025-10-09 16:46:15 -07:00 |
|
Yineng Zhang
|
44cb060785
|
chore: upgrade flashinfer 0.4.0 (#11364)
|
2025-10-09 14:17:54 -07:00 |
|
Sundara Raman Ramachandran
|
53bd00d975
|
[Generative Score API] Multi-Item scoring with custom attention mask. (#10979)
|
2025-10-08 18:47:32 -07:00 |
|
Netanel Haber
|
d6837aea4d
|
model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (#10909)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
|
2025-10-09 00:37:38 +08:00 |
|
Lifu Huang
|
edefab0c64
|
[2/2] Support MHA prefill with FlashAttention 4. (#10937)
Co-authored-by: Hieu Pham <hyhieu@gmail.com>
|
2025-10-08 00:54:20 -07:00 |
|
Cheng Wan
|
3c06b673af
|
[8/N] MoE Refactor: deprecate EPMoE (#11211)
|
2025-10-07 21:51:41 -07:00 |
|
Bowen Bao
|
cd4b39a900
|
[quantization] Properly ignore quantization for layers excluded in quant_config (#11205)
|
2025-10-07 14:06:05 -07:00 |
|
Zhiyu
|
155cbb51f0
|
Enable native ModelOpt quantization support (1/3) (#7149)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-10-06 13:24:15 -07:00 |
|
fzyzcjy
|
efbc687c28
|
Support DeepSeek V3.2 Exp (#11061)
Co-authored-by: Stefan He <11166516+hebiao064@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <95566987+hnyls2002@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <56809903+fridge003@users.noreply.github.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Co-authored-by: ZhengdQin <46387172+zhengdqin@users.noreply.github.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-10-06 00:24:15 -07:00 |
|
Bowen Bao
|
baee08601b
|
[quantization] Enable aiter mxfp4 fused_moe for Quark (#10048)
Co-authored-by: HaiShaw <hixiao@gmail.com>
|
2025-10-05 19:51:34 -07:00 |
|
Yuan Luo
|
590f2da052
|
[Feat] Support Torch Symm Mem AllReduce (#10571)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-10-05 13:55:19 -07:00 |
|
Hank Han
|
666da3d59f
|
[fix]enable flashmla when using draft model P/D attention select (#11012)
|
2025-10-04 20:59:34 +08:00 |
|
b8zhong
|
a2faf8940c
|
[1/n] Enable DCA CUDA graph capture (#9537)
|
2025-10-03 11:30:00 +08:00 |
|
Dom Brown
|
e810077488
|
Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (#11138)
|
2025-10-02 16:04:58 -07:00 |
|
fzyzcjy
|
12d6818380
|
Tiny fix ep_gather behavior different in CI (#11130)
|
2025-10-02 21:55:53 +08:00 |
|
fzyzcjy
|
b65db0287b
|
Tiny cleanup deepseek_v2.py (#11163)
|
2025-10-02 21:54:52 +08:00 |
|
ilyasch2
|
083629c235
|
[model] Add mamba2 and Falcon-H1 support. (#10988)
Co-authored-by: Younes Belkada <younes.belkada@tii.ae>
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com>
|
2025-10-02 19:15:36 +08:00 |
|
fzyzcjy
|
5e786cca3a
|
Support single batch overlap (#10422)
|
2025-10-02 18:04:36 +08:00 |
|
fzyzcjy
|
0b9dfba787
|
Support dispatch low latency (#10263)
Co-authored-by: Kaixi Hou <4001424+kaixih@users.noreply.github.com>
|
2025-10-02 18:02:19 +08:00 |
|