huangtingwei
|
5fbafbb8f8
|
fix MLATokenToKVPoolHost get_size_per_token bug (#5161)
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
|
2025-04-13 12:37:26 -07:00 |
|
Byron Hsu
|
a9499885e9
|
[PD] Add transfer backend abstraction (#5328)
|
2025-04-14 01:39:39 +08:00 |
|
Liangsheng Yin
|
f765579046
|
Fix typo: infight -> inflight (#5357)
|
2025-04-14 01:25:30 +08:00 |
|
Yineng Zhang
|
f58b929a51
|
chore: upgrade sgl-kernel 0.0.8.post3 (#5342)
|
2025-04-13 00:45:59 -07:00 |
|
mlmz
|
8311b07fb9
|
Fix: Ensure tensors for dist.broadcast match NCCL backend device (#5322)
|
2025-04-12 22:50:37 -07:00 |
|
Yineng Zhang
|
7d3b7c87f5
|
fix: determine if flashinfer is installed (#5336)
|
2025-04-12 19:59:13 -07:00 |
|
tianlian yi
|
bc92107b03
|
Support server based rollout in Verlengine (#4848)
Co-authored-by: Jin Pan <jpan236@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>
|
2025-04-12 10:07:52 -07:00 |
|
Xiaoyu Zhang
|
3e4794aad8
|
refine fused_moe tuning docs (#5294)
|
2025-04-12 10:01:13 -07:00 |
|
Xiaoyu Zhang
|
690ec20587
|
Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (#5321)
|
2025-04-12 10:00:03 -07:00 |
|
Yineng Zhang
|
57de7c6b5f
|
feat: use fa3 mla by default on hopper (#5210)
Co-authored-by: yundai424 <yundai424@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
|
2025-04-12 01:09:25 -07:00 |
|
Qingquan Song
|
aea98512a8
|
Fix fa3 window size setup (#5316)
|
2025-04-11 23:37:52 -07:00 |
|
lambert0312
|
1b1b47a949
|
Fix w8a8_int8 model shared experts fusion load weights error (#5120)
|
2025-04-11 23:33:51 -07:00 |
|
Mick
|
34ef6c8135
|
[VLM] Adopt fast image processor by default (#5065)
|
2025-04-11 21:46:58 -07:00 |
|
Yineng Zhang
|
611720919d
|
fix: use deepgemm only on hopper (#5310)
|
2025-04-11 20:48:24 -07:00 |
|
Yineng Zhang
|
f774a0d275
|
feat: add blackwell Dockerfile (#5302)
|
2025-04-11 13:08:53 -07:00 |
|
Xiaoyu Zhang
|
60bcbf2a35
|
remove moe_align_block_size torch.zeros in small batch/expert mode (#5298)
|
2025-04-11 12:13:55 -07:00 |
|
Yusong Gao
|
c35dcfdb30
|
[PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (#5292)
|
2025-04-11 23:03:07 +08:00 |
|
Mick
|
e53a0b3d5b
|
[fix] fix mrope positions not picked up (#5265)
|
2025-04-11 01:29:45 -07:00 |
|
Cheng Wan
|
038bc5d521
|
Support --enable-llama4-multimodal (#5254)
|
2025-04-11 01:24:14 -07:00 |
|
Chang Su
|
aee62d744b
|
Optimize GPU memory usage in FlashAttentionBackend's strided indexing (#5262)
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-04-11 00:34:17 -07:00 |
|
fzyzcjy
|
cd7e32e2cb
|
Optimize attention in llama4 (#5127)
|
2025-04-11 00:32:41 -07:00 |
|
HAI
|
8879944800
|
ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (#5228)
|
2025-04-10 18:19:57 -07:00 |
|
Richard Zou
|
a879811c4b
|
Fix torch.compile cacheing (#5259)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2025-04-10 18:08:45 -07:00 |
|
Ke Bao
|
1078396f47
|
Update deps for mllama4 (#5215)
|
2025-04-10 09:12:44 -07:00 |
|
Teng Ma
|
7e4f72dd8c
|
[PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (#5204)
|
2025-04-10 20:05:34 +08:00 |
|
Teng Ma
|
4c31ae9f6d
|
[PD] Support KV transfer with mooncake (#4880)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: shangmingc <csmthu@gmail.com>
|
2025-04-10 14:23:23 +08:00 |
|
Xiaoyu Zhang
|
f730362ee2
|
reduce moe_align_block_size_kernel small batch mode overhead (#5086)
|
2025-04-09 17:59:35 -07:00 |
|
fzyzcjy
|
e3c4bd3153
|
Fix DeepSeek error when using DeepEP mode (#5190)
|
2025-04-09 17:43:22 -07:00 |
|
Stefan He
|
5db37c8626
|
[metrics] Add in queue metrics (#4444)
|
2025-04-09 17:19:27 -07:00 |
|
Yineng Zhang
|
4cb53ecd0c
|
fix: log warning when disable cuda graph (#5209)
|
2025-04-09 14:16:13 -07:00 |
|
Zhaoyang Hao
|
456b008bd8
|
Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (#5196)
|
2025-04-09 11:54:36 -07:00 |
|
saienduri
|
7f875f1293
|
update grok test (#5171)
|
2025-04-09 11:09:47 -07:00 |
|
Mick
|
fbebcb7aa4
|
model: support mllama4 (#5144)
|
2025-04-09 09:28:44 -07:00 |
|
HandH1998
|
4065248214
|
Support Llama4 fp8 inference (#5194)
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: zhyncs <me@zhyncs.com>
|
2025-04-09 20:14:34 +08:00 |
|
fzyzcjy
|
86a876d883
|
Optimize topk operation in llama4 (#5128)
|
2025-04-09 02:50:22 -07:00 |
|
kk
|
92823069c4
|
Fix ci test "test_eval_fp8_accuracy" failed (#5185)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-04-09 02:44:05 -07:00 |
|
fzyzcjy
|
61970b08d8
|
Let bench_one_batch support enable_dp_attention (#4058)
|
2025-04-08 23:44:25 -07:00 |
|
Cheng Wan
|
76c48a0913
|
[DeepEP] fix: import buffer error (#5179)
|
2025-04-08 22:12:14 -07:00 |
|
Yineng Zhang
|
90caf06c00
|
fix: use DeepEPDispatcher on CUDA (#5180)
|
2025-04-08 21:56:53 -07:00 |
|
Yineng Zhang
|
6669d12707
|
feat: add DeepGEMM build warning (#5176)
Co-authored-by: grimoire <streetyao@live.com>
|
2025-04-08 21:16:23 -07:00 |
|
Jinyan Chen
|
bc3f6db2dd
|
[Fix] DeepEP Compatibility with Low Latency (#5068)
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-04-08 20:31:31 -07:00 |
|
Chang Su
|
aac531c53b
|
[Bugfix] Fix index out of bounds in local attention with large sequences (#5173)
|
2025-04-08 18:43:13 -07:00 |
|
fzyzcjy
|
466899e69c
|
Fix multimodal hashing error (#5174)
|
2025-04-08 18:42:26 -07:00 |
|
Trevor Morris
|
11d760d56a
|
FP4 weight loading and inference (2/2) (#3972)
|
2025-04-08 17:26:21 -07:00 |
|
fzyzcjy
|
5039d54772
|
Support 2x8xH100 for Llama 4 (#5159)
|
2025-04-08 14:55:14 -07:00 |
|
XinyuanTong
|
d09a51f1f6
|
[feat&refactor] Enhance multimodal input support with refactor io_struct (#4938)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-04-08 14:48:07 -07:00 |
|
Byron Hsu
|
6d3b35fae9
|
[PD] Simplify mini LB (#4911)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2025-04-08 09:42:34 -07:00 |
|
shangmingc
|
89a554181f
|
[PD] Fix unclosed prefill connection warning of mini_lb (#5155)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-04-08 09:15:06 -07:00 |
|
Yun Dai
|
2695ab0537
|
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-04-08 09:11:35 -07:00 |
|
kk
|
88d6fd9a11
|
Fix torch compile errors (#5158)
|
2025-04-08 15:04:37 +00:00 |
|