Yineng Zhang
|
7e0976133c
|
udpate sgl-kernel version for srt (#3150)
|
2025-01-26 20:22:34 +08:00 |
|
Lianmin Zheng
|
f4a92f4b56
|
Temporarily skip the openai frontend tests (#3151)
|
2025-01-26 04:17:35 -08:00 |
|
Yineng Zhang
|
318260c0fa
|
chore: bump 0.0.2.post18 for sgl-kernel (#3149)
|
2025-01-26 19:00:34 +08:00 |
|
Lianmin Zheng
|
4a61253123
|
Do not load OPENAI_KEY from secrets (#3147)
|
2025-01-26 01:54:03 -08:00 |
|
Lianmin Zheng
|
d1a0863251
|
Add a test case for cached_tokens (#3145)
|
2025-01-26 01:39:28 -08:00 |
|
Hubert Lu
|
f8b28e461a
|
Add CPU affinity setting to latency benchmark (#3085)
|
2025-01-25 23:52:05 -08:00 |
|
HandH1998
|
82392da830
|
support w8a8 fp8 kernel with CUTLASS (#3047)
Co-authored-by: yych0745 <1398089567@qq.com>
|
2025-01-26 15:46:51 +08:00 |
|
Yineng Zhang
|
95f789adb0
|
minor: cleanup sgl-kernel (#3143)
|
2025-01-26 14:29:58 +08:00 |
|
Lianmin Zheng
|
4f118a39d7
|
Fix repetition penalty (#3139)
|
2025-01-25 21:48:58 -08:00 |
|
yigex
|
66283dbc0c
|
[Fix] Not skip NVML Check on AMD Platform (#3135)
|
2025-01-25 21:33:51 -08:00 |
|
Yineng Zhang
|
822bae8c00
|
feat: cross python wheel for sgl-kernel (#3138)
|
2025-01-26 13:21:34 +08:00 |
|
Hui Liu
|
8e48ca8cc1
|
enable kv_scale for Gemma2 (#3113)
|
2025-01-25 18:29:14 -08:00 |
|
Lianmin Zheng
|
27acf63bbd
|
Use torch.compile for scaling penalty (#3133)
|
2025-01-25 18:27:33 -08:00 |
|
Lianmin Zheng
|
da6f8081f6
|
Fix CI tests (#3132)
|
2025-01-25 17:43:39 -08:00 |
|
yinfan98
|
9286740eff
|
feat: refactor sgl-kernel and use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#3130)
Co-authored-by: yinfan.1024 <yinfan.1024@bytedance.com>
Co-authored-by: yinfan98 <1106110035@qq.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-01-26 02:55:08 +08:00 |
|
Yineng Zhang
|
896c07441e
|
update installation doc for sgl-kernel (#3129)
|
2025-01-26 00:00:13 +08:00 |
|
Ke Bao
|
c23d5706f4
|
Update whl index path (#3128)
|
2025-01-25 23:57:09 +08:00 |
|
Ke Bao
|
67ad4338e1
|
Update tag name for whl release (#3127)
|
2025-01-25 23:14:35 +08:00 |
|
Yineng Zhang
|
3cab5f71ea
|
speedup pr test for sgl-kernel (#3126)
|
2025-01-25 21:37:48 +08:00 |
|
Yineng Zhang
|
14e754a868
|
chore: bump v0.0.2.post17 for sgl-kernel (#3125)
|
2025-01-25 20:43:02 +08:00 |
|
yizhang2077
|
98522149ff
|
mirror fix for custom allreduce (#3124)
|
2025-01-25 18:26:41 +08:00 |
|
Xiaoyu Zhang
|
5d9d15e70f
|
support fp32 in sampling_scaling_penalties kernel (#3121)
|
2025-01-25 16:52:17 +08:00 |
|
Ke Bao
|
665e5e85f6
|
Add step to update sgl-kernel whl index (#3110)
|
2025-01-25 02:03:01 +08:00 |
|
Ke Bao
|
a22f60a313
|
Add workflow for sgl-kernel cu118 release (#3109)
|
2025-01-24 22:30:30 +08:00 |
|
Yineng Zhang
|
04f0b4cbef
|
minor: update sgl-kernel setup (#3107)
|
2025-01-24 20:10:35 +08:00 |
|
Adarsh Shirawalmath
|
4505a43614
|
[Docs] minor update for phi-3 and phi-4 (#3096)
|
2025-01-24 04:00:20 -08:00 |
|
Trevor Morris
|
685a5738a7
|
Allow local cutlass directory to be used in sgl-kernel build (#3037)
|
2025-01-24 03:59:47 -08:00 |
|
Yineng Zhang
|
153b414e83
|
minor: sync flashinfer and add turbomind as 3rdparty (#3105)
|
2025-01-24 19:22:39 +08:00 |
|
Ke Bao
|
6619f48e18
|
Fix cu118 group gemm compile issue (#3097)
|
2025-01-24 15:19:09 +08:00 |
|
Byron Hsu
|
3ed0a547b2
|
[router] Fix twine uploading (#3095)
|
2025-01-23 21:01:01 -08:00 |
|
Byron Hsu
|
8d8ef8497e
|
bump router to 0.1.4 (#3094)
|
2025-01-23 20:32:43 -08:00 |
|
Byron Hsu
|
9a0cc2e90e
|
[router] Forward all request headers from router to workers (#3070)
|
2025-01-23 20:30:31 -08:00 |
|
Ke Bao
|
7bad7e75bf
|
Add shapes for int8 gemm benchmark (#3093)
|
2025-01-24 12:27:30 +08:00 |
|
simveit
|
1c4e0d2445
|
Docs: Update doc for server arguments (#2742)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-01-23 11:32:05 -08:00 |
|
Yineng Zhang
|
54bac8af0b
|
chore: bump sgl-kernel 0.0.2.post16 (#3087)
|
2025-01-24 01:57:48 +08:00 |
|
Yineng Zhang
|
5de4051bcf
|
feat: integrate sampling kernels into sgl-kernel (#3086)
Co-authored-by: Zihao Ye <expye@outlook.com>
|
2025-01-24 01:54:47 +08:00 |
|
Xiaoyu Zhang
|
e0cd65c2b6
|
[hotfix] fix test_sampling_scaling_penalties.py ci test (#3084)
|
2025-01-24 00:33:59 +08:00 |
|
Xiaoyu Zhang
|
f1b6861828
|
use flashinfer vec_dtypes in sgl_kernel (#3083)
|
2025-01-23 22:19:04 +08:00 |
|
Yineng Zhang
|
0da0989ad4
|
sync flashinfer and update sgl-kernel tests (#3081)
|
2025-01-23 21:13:55 +08:00 |
|
Yineng Zhang
|
07a22cbba3
|
use env variable to control the build conf on the CPU build node (#3080)
|
2025-01-23 20:46:49 +08:00 |
|
Yineng Zhang
|
3d0bfa3e17
|
update version setup for sgl-kernel (#3079)
|
2025-01-23 19:45:25 +08:00 |
|
Yineng Zhang
|
1f6cf0d4b9
|
fix build error for sgl-kernel (#3078)
|
2025-01-23 19:16:35 +08:00 |
|
Lianmin Zheng
|
553f5a3ffe
|
Remove torch dependency in sgl-kernel (#3074)
|
2025-01-23 17:23:37 +08:00 |
|
Xiaoyu Zhang
|
ac2dc35d0e
|
support lightning_attention_decode in sgl-kernel for MiniMax-Text-01 (#3030)
|
2025-01-23 15:29:20 +08:00 |
|
Yineng Zhang
|
3e032c07cc
|
use v0.6.4.post1 for sgl-kernel ci (#3071)
|
2025-01-23 14:19:38 +08:00 |
|
Yineng Zhang
|
44e12ce463
|
docs: update developer guide for sgl-kernel (#3069)
|
2025-01-23 14:08:25 +08:00 |
|
Yineng Zhang
|
a547aad61f
|
docs: add developer guide for sgl-kernel (#3068)
|
2025-01-23 13:47:53 +08:00 |
|
Lianmin Zheng
|
ea535dc574
|
Revert "disable custom allreduce on HIP" (#3067)
|
2025-01-22 21:33:35 -08:00 |
|
Ke Wen
|
862bcff833
|
Support loading of larger models with on-the-fly quantization (#3061)
|
2025-01-22 21:33:17 -08:00 |
|
Lianmin Zheng
|
8b84e69f25
|
Fix tp token sync for dp attention (#3062)
|
2025-01-22 18:51:40 -08:00 |
|