Cheng Wan
|
5b214b50b6
|
[Refactor] move deep_gemm_wrapper out of quantization (#11784)
|
2025-10-17 18:57:54 -07:00 |
|
fzyzcjy
|
33e9bbec35
|
Make single-batch overlap compatible with offloading (#11614)
|
2025-10-18 08:45:54 +08:00 |
|
fzyzcjy
|
dcb8f090ad
|
Super tiny fix CI (#11788)
|
2025-10-17 17:41:58 -07:00 |
|
fzyzcjy
|
8af8491298
|
Support casting bf16 NextN moe to fp8 (#11613)
|
2025-10-18 08:02:15 +08:00 |
|
fzyzcjy
|
505329cab0
|
Support shared experts overlap in cutlass moe (#11611)
|
2025-10-18 07:59:40 +08:00 |
|
Chang Su
|
627974405d
|
[Lint] Add python/sglang to ruff F401 checks and remove unused imports in files (#11685)
|
2025-10-17 16:49:46 -07:00 |
|
Even Zhou
|
3cceaa381a
|
[Bugfix] Fix Qwen3/DSV3/DSV3.2 model support (#11510)
|
2025-10-16 15:14:09 +08:00 |
|
Xun Sun
|
a40229f6f8
|
[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-14 19:40:54 -07:00 |
|
Liangsheng Yin
|
516738b096
|
Depreate global_server_args_dict (#11528)
|
2025-10-13 19:34:43 +08:00 |
|
Cheng Wan
|
1bdd010291
|
Revert "Deprecate global_server_args_dict" (#11520)
|
2025-10-12 17:40:40 -07:00 |
|
Liangsheng Yin
|
1083e7e3df
|
Deprecate global_server_args_dict (#11331)
|
2025-10-13 01:20:47 +08:00 |
|
Liu-congo
|
c80a96dae9
|
[BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360)
Signed-off-by: Liu-congo <1502632128@qq.com>
|
2025-10-10 21:14:24 -07:00 |
|
fzyzcjy
|
efbc687c28
|
Support DeepSeek V3.2 Exp (#11061)
Co-authored-by: Stefan He <11166516+hebiao064@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <95566987+hnyls2002@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <56809903+fridge003@users.noreply.github.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Co-authored-by: ZhengdQin <46387172+zhengdqin@users.noreply.github.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-10-06 00:24:15 -07:00 |
|
fzyzcjy
|
b65db0287b
|
Tiny cleanup deepseek_v2.py (#11163)
|
2025-10-02 21:54:52 +08:00 |
|
fzyzcjy
|
5e786cca3a
|
Support single batch overlap (#10422)
|
2025-10-02 18:04:36 +08:00 |
|
fzyzcjy
|
0b9dfba787
|
Support dispatch low latency (#10263)
Co-authored-by: Kaixi Hou <4001424+kaixih@users.noreply.github.com>
|
2025-10-02 18:02:19 +08:00 |
|
fzyzcjy
|
f35def8652
|
Fuse quantize and rope in trtllm_mla MTP (#10779)
|
2025-10-02 17:59:37 +08:00 |
|
fzyzcjy
|
44b1fbe258
|
Fix DeepSeek chunked prefill memory issue (#11149)
|
2025-10-01 23:56:59 -07:00 |
|
Even Zhou
|
d27a6f7092
|
[Feature] Add MLAProcess for DeepSeek MLA on NPU (#10130)
|
2025-09-22 17:17:48 -07:00 |
|
Yineng Zhang
|
f67d1f45bc
|
[Auto Sync] Update deepseek_v2.py (20250922) (#10717)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Michael Granado <mgranado@together.ai>
|
2025-09-21 17:43:50 -07:00 |
|
Yineng Zhang
|
7c876de7f5
|
fix: remove awq_dequantize deps (#10686)
|
2025-09-20 01:47:01 -07:00 |
|
Yineng Zhang
|
b17e67df36
|
[Auto Sync] Update deepseek_v2.py (20250920) (#10683)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-09-19 23:43:31 -07:00 |
|
Shu Wang
|
124097fc5b
|
enable prefix cache with dp (#10459)
|
2025-09-16 18:26:58 -07:00 |
|
cicirori
|
a2f7218a2e
|
support using fa4 on deepseek on blackwell (#9928)
|
2025-09-16 16:16:06 -07:00 |
|
fzyzcjy
|
059c13de5c
|
Fix trtllm_moe wrong correction bias (#10440)
|
2025-09-15 01:02:05 -07:00 |
|
Cheng Wan
|
4844fac91d
|
Refactor TopK to ensure readability and extensibility (#9338)
|
2025-09-14 19:16:25 -07:00 |
|
fzyzcjy
|
258d02c86d
|
Fix correction bias undefined behavior for nvfp4 models (#10426)
|
2025-09-14 18:41:09 -07:00 |
|
fzyzcjy
|
fa46e2bd40
|
Support offloading in fp8 (#9948)
|
2025-09-14 01:14:28 -07:00 |
|
fzyzcjy
|
b047b553c2
|
[2/2] Speed up prefill mla attention concat (#10157)
|
2025-09-14 01:12:04 -07:00 |
|
Shu Wang
|
36acd2ff16
|
Fix chunked prefix cache for nvfp4 (#10180)
Co-authored-by: Elfie Guo <elfieg@nvidia.com>
|
2025-09-12 03:20:30 -07:00 |
|
Shu Wang
|
3df05f4d6a
|
[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#9199)
|
2025-09-11 20:18:43 -07:00 |
|
Hubert Lu
|
91b3555d2d
|
Add tests to AMD CI for MI35x (#9662)
Co-authored-by: Sai Enduri <saimanas.enduri@amd.com>
|
2025-09-10 12:50:05 -07:00 |
|
Lianmin Zheng
|
4582931ac3
|
Revert "Revert the changes on NCCL symmetric memory" (#10238)
|
2025-09-09 12:11:49 -07:00 |
|
Lianmin Zheng
|
d352c29aa0
|
Revert the changes on NCCL symmetric memory (#10210)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-09-09 11:01:33 -07:00 |
|
Baizhou Zhang
|
8ad700f735
|
Cleaning codes for speculative attention mode (#10149)
|
2025-09-08 17:38:06 -07:00 |
|
cicirori
|
8c5930f08a
|
Add speculator attention backend switch (#9981)
|
2025-09-07 21:44:36 -07:00 |
|
kk
|
400d3b97ae
|
Fix run time error in dsv3-fp8 model on mi35x (#10104)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2025-09-07 20:45:17 -07:00 |
|
Jinyang Yuan
|
012584ecd5
|
perf: Avoid unnecessary data type conversions for DeepSeek-V3 on Blackwell (#9834)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
|
2025-09-06 14:06:46 +08:00 |
|
Elfie Guo
|
bebd0576e5
|
Integrate trtllm ragged attention for prefill self-attention (#9801)
|
2025-09-05 17:18:00 +08:00 |
|
kk
|
918e3d4c27
|
Fix accuracy drop of dsv3 run in dp enablement (#8677)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-09-04 16:51:16 -07:00 |
|
kk
|
e96973742c
|
Optimized deepseek-v3/r1 model performance on mxfp4 run (#10008)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
|
2025-09-04 15:11:22 -07:00 |
|
Elfie Guo
|
66d5d0425c
|
Minor update regarding issue #9704 (#9733)
|
2025-09-03 16:52:07 -07:00 |
|
Yineng Zhang
|
1b2ff4fb7f
|
Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" (#9959)
|
2025-09-03 00:50:04 -07:00 |
|
kk
|
0dfd54d11d
|
Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: wghuang <wghuang@amd.com>
|
2025-09-02 22:26:28 -07:00 |
|
chenxj
|
d4a938417d
|
[feat] Support tp mode for DeepSeek-R1-W4AFP8 (#8118)
Co-authored-by: yuhyao <827623970@qq.com>
|
2025-09-01 22:17:26 -07:00 |
|
Yineng Zhang
|
8abe8deae6
|
fix: dsv3 lite q_lora_rank none (#9815)
|
2025-08-29 23:24:14 -07:00 |
|
chenxu140
|
74dd4249ac
|
[Feature] Support NPUGraph for DeepSeek on Ascend NPU (#9355)
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
|
2025-08-28 16:06:24 -07:00 |
|
Lianmin Zheng
|
fd71b11b1d
|
move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py (#9679)
|
2025-08-27 03:34:29 -07:00 |
|
ZhengdQin
|
f92b729d52
|
[new feat] ascend backend support fia fusion kernel (#8328)
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
|
2025-08-25 23:13:08 -07:00 |
|
fzyzcjy
|
2600fc0d47
|
Overlapped weight offload (#8034)
|
2025-08-23 02:06:46 -07:00 |
|