Xiaoyu Zhang
|
bf86c5e990
|
restruct compressed_tensors_w8a8_fp8 (#5475)
|
2025-04-19 04:52:15 -07:00 |
|
fzyzcjy
|
1e0806f30b
|
Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (#5340)
|
2025-04-18 22:38:07 -07:00 |
|
Yineng Zhang
|
2c11f9c2eb
|
chore: upgrade sgl-kernel 0.0.9.post2 (#5540)
|
2025-04-18 21:17:23 -07:00 |
|
Yineng Zhang
|
a6f892e5d0
|
Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544)
|
2025-04-18 16:50:21 -07:00 |
|
Yineng Zhang
|
08b518d51f
|
fix util import (#5542)
|
2025-04-18 15:06:46 -07:00 |
|
yhyang201
|
4db463b1ad
|
[Model] Adding Qwen3 and Qwen3MoE (#4693)
|
2025-04-18 09:51:29 -07:00 |
|
Wenxuan Tan
|
bfa3922451
|
Avoid computing lse in Ragged Prefill when there's no prefix. (#5476)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-04-18 01:13:57 -07:00 |
|
liwenju0
|
e465b08ddb
|
fix bug of VLLM_AVAILABLE not defined (#5497)
|
2025-04-18 00:59:03 -07:00 |
|
Xiaoyu Zhang
|
bed05878f6
|
fix kimi vl running bug after rebase main (#5461)
|
2025-04-18 00:17:34 -07:00 |
|
strgrb
|
b2a189dd11
|
use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (#5473)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
|
2025-04-18 00:05:24 -07:00 |
|
Chang Su
|
c776234b45
|
Enable local attention during decode (#5479)
|
2025-04-17 02:07:43 -07:00 |
|
woodx
|
3bface15e6
|
Feat/support encoder model (like bert) (#4887)
|
2025-04-17 01:50:48 -07:00 |
|
Baizhou Zhang
|
4fb05583ef
|
Deprecate disable-mla (#5481)
|
2025-04-17 01:43:14 -07:00 |
|
eigen
|
8f783c1943
|
[Model Support] unsloth/Phi-4-mini bnb model (#4982)
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-04-16 19:58:20 -07:00 |
|
Lianmin Zheng
|
177320a582
|
Clean up imports (#5467)
|
2025-04-16 15:26:49 -07:00 |
|
Baizhou Zhang
|
a42736bbb8
|
Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113)
|
2025-04-15 22:01:22 -07:00 |
|
Yineng Zhang
|
fa909dc3c4
|
feat: update model_specific_adjustment (#5344)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-04-15 14:45:15 -07:00 |
|
fzyzcjy
|
15e91d721b
|
Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (#5406)
|
2025-04-15 01:33:47 -07:00 |
|
Ximingwang-09
|
2dd6489468
|
Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (#5291)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
|
2025-04-14 18:40:31 -07:00 |
|
lambert0312
|
61e7c4dd21
|
Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (#5368)
|
2025-04-14 18:39:44 -07:00 |
|
Baizhou Zhang
|
f6772f1497
|
[Fix] Turn off DeepGEMM by default (#5263)
|
2025-04-14 17:45:44 -07:00 |
|
Xiaoyu Zhang
|
38076dea84
|
apply fused moe gate in ds v3/r1 (#5371)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-04-14 16:24:26 -07:00 |
|
Ke Bao
|
5e0a9b0981
|
Apply deepseek cuda rope (#5385)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-04-14 15:22:43 -07:00 |
|
JieXin Liang
|
bdde237562
|
[perf] experimental enhance fp8 per-tensor quant (#5370)
|
2025-04-14 12:35:43 -07:00 |
|
fzyzcjy
|
defede5073
|
Fix DeepSeek DP Attention + torch compile (#5367)
Co-authored-by: ispobock <ispobaoke@163.com>
|
2025-04-14 01:07:58 -07:00 |
|
Yineng Zhang
|
f58b929a51
|
chore: upgrade sgl-kernel 0.0.8.post3 (#5342)
|
2025-04-13 00:45:59 -07:00 |
|
Xiaoyu Zhang
|
3e4794aad8
|
refine fused_moe tuning docs (#5294)
|
2025-04-12 10:01:13 -07:00 |
|
Xiaoyu Zhang
|
690ec20587
|
Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (#5321)
|
2025-04-12 10:00:03 -07:00 |
|
Yineng Zhang
|
57de7c6b5f
|
feat: use fa3 mla by default on hopper (#5210)
Co-authored-by: yundai424 <yundai424@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-04-12 01:09:25 -07:00 |
|
Qingquan Song
|
aea98512a8
|
Fix fa3 window size setup (#5316)
|
2025-04-11 23:37:52 -07:00 |
|
Yineng Zhang
|
611720919d
|
fix: use deepgemm only on hopper (#5310)
|
2025-04-11 20:48:24 -07:00 |
|
Xiaoyu Zhang
|
60bcbf2a35
|
remove moe_align_block_size torch.zeros in small batch/expert mode (#5298)
|
2025-04-11 12:13:55 -07:00 |
|
Mick
|
e53a0b3d5b
|
[fix] fix mrope positions not picked up (#5265)
|
2025-04-11 01:29:45 -07:00 |
|
Chang Su
|
aee62d744b
|
Optimize GPU memory usage in FlashAttentionBackend's strided indexing (#5262)
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-04-11 00:34:17 -07:00 |
|
HAI
|
8879944800
|
ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (#5228)
|
2025-04-10 18:19:57 -07:00 |
|
Xiaoyu Zhang
|
f730362ee2
|
reduce moe_align_block_size_kernel small batch mode overhead (#5086)
|
2025-04-09 17:59:35 -07:00 |
|
Zhaoyang Hao
|
456b008bd8
|
Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (#5196)
|
2025-04-09 11:54:36 -07:00 |
|
HandH1998
|
4065248214
|
Support Llama4 fp8 inference (#5194)
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: zhyncs <me@zhyncs.com>
|
2025-04-09 20:14:34 +08:00 |
|
kk
|
92823069c4
|
Fix ci test "test_eval_fp8_accuracy" failed (#5185)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-04-09 02:44:05 -07:00 |
|
Cheng Wan
|
76c48a0913
|
[DeepEP] fix: import buffer error (#5179)
|
2025-04-08 22:12:14 -07:00 |
|
Yineng Zhang
|
6669d12707
|
feat: add DeepGEMM build warning (#5176)
Co-authored-by: grimoire <streetyao@live.com>
|
2025-04-08 21:16:23 -07:00 |
|
Jinyan Chen
|
bc3f6db2dd
|
[Fix] DeepEP Compatibility with Low Latency (#5068)
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-04-08 20:31:31 -07:00 |
|
Chang Su
|
aac531c53b
|
[Bugfix] Fix index out of bounds in local attention with large sequences (#5173)
|
2025-04-08 18:43:13 -07:00 |
|
Trevor Morris
|
11d760d56a
|
FP4 weight loading and inference (2/2) (#3972)
|
2025-04-08 17:26:21 -07:00 |
|
Yun Dai
|
2695ab0537
|
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-04-08 09:11:35 -07:00 |
|
kk
|
88d6fd9a11
|
Fix torch compile errors (#5158)
|
2025-04-08 15:04:37 +00:00 |
|
Yubo Wang
|
804d9f2e4c
|
Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (#4760)
|
2025-04-07 23:20:51 -07:00 |
|
Chunan Zeng
|
a7c3f74bec
|
[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (#5103)
|
2025-04-07 22:58:08 -07:00 |
|
kk
|
5a144a8ab9
|
Fix run time error in ROCm platform (#5147)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu>
|
2025-04-07 22:49:40 -07:00 |
|
Hubert Lu
|
afb752bcbe
|
[AMD] Fix missing per_token_group_quant_fp8 for ROCm (#5140)
|
2025-04-07 22:38:25 -07:00 |
|