Hongbo Xu
|
a669bc2f74
|
Replace sglang.srt.layers.quantization.scalar_types with sgl_kernel.scalar_type (#8951)
|
2025-08-13 19:41:41 -07:00 |
|
Cheng Wan
|
a437aa9987
|
[hotfix] fix mixtral with tensor-level compressed-tensor quantization (#8721)
|
2025-08-02 22:59:25 -07:00 |
|
Xiaoyu Zhang
|
a167fd0bcb
|
[code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (#8310)
|
2025-07-24 14:38:30 +08:00 |
|
Hubert Lu
|
e50109f2ed
|
[AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (#7484)
|
2025-07-21 17:33:19 -07:00 |
|
Cheng Wan
|
15ad6c9086
|
[1/N] MoE Refactor: refactor select_experts (#7966)
|
2025-07-19 00:51:15 -07:00 |
|
Enrique Shockwave
|
fd63b62eaa
|
fix compressed tensors WNA16 imports (#8142)
|
2025-07-18 11:34:14 -07:00 |
|
Cheng Wan
|
49b8777460
|
Refactor: move all quantization-related code to srt/layer/quantization (#7989)
|
2025-07-17 00:47:07 -07:00 |
|
narutolhy
|
3e34e9004f
|
Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (#7648)
|
2025-06-30 21:51:01 -07:00 |
|
YanbingJiang
|
094c116f7d
|
Update python API of activation, topk, norm and rope and remove vllm dependency (#6614)
Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: sdp <sdp@gnr799219.jf.intel.com>
|
2025-06-17 22:11:50 -07:00 |
|
Yijie Zhu
|
a39d928782
|
support qwen2 running on ascend npu device (#7022)
Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com>
|
2025-06-17 11:24:10 -07:00 |
|
Xiaoyu Zhang
|
3712abfaf9
|
Fuse routed scaling factor in deepseek (#6970)
|
2025-06-08 15:24:24 -07:00 |
|
Yineng Zhang
|
1fb76ebb93
|
Revert "Fuse routed scaling factor in topk_reduce kernel (#6220)" (#6968)
|
2025-06-07 21:02:49 -07:00 |
|
Xiaoyu Zhang
|
515ef4facb
|
Fuse routed scaling factor in topk_reduce kernel (#6220)
|
2025-06-07 11:06:50 -07:00 |
|
Cheng Wan
|
81964328b7
|
Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736)
|
2025-06-04 15:53:22 -07:00 |
|
Lianmin Zheng
|
e8e18dcdcc
|
Revert "fix some typos" (#6244)
|
2025-05-12 12:53:26 -07:00 |
|
applesaucethebun
|
d738ab52f8
|
fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-13 01:42:38 +08:00 |
|
JieXin Liang
|
b70957fcf8
|
[refactor] slightly tidy fp8 module (#5993)
|
2025-05-07 17:28:24 -07:00 |
|
Juwan Yoo
|
502524e2da
|
compressed_tensors: port w8a16 fp8 from vllm (#4852)
|
2025-04-20 17:48:31 -07:00 |
|
Xiaoyu Zhang
|
d58e354472
|
simplify the control logic for using shared experts fusion (#5504)
|
2025-04-19 13:17:35 -07:00 |
|
Xiaoyu Zhang
|
bf86c5e990
|
restruct compressed_tensors_w8a8_fp8 (#5475)
|
2025-04-19 04:52:15 -07:00 |
|
liwenju0
|
e465b08ddb
|
fix bug of VLLM_AVAILABLE not defined (#5497)
|
2025-04-18 00:59:03 -07:00 |
|
Lianmin Zheng
|
177320a582
|
Clean up imports (#5467)
|
2025-04-16 15:26:49 -07:00 |
|
HandH1998
|
4065248214
|
Support Llama4 fp8 inference (#5194)
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: zhyncs <me@zhyncs.com>
|
2025-04-09 20:14:34 +08:00 |
|
Xiaoyu Zhang
|
db452760e5
|
[ci] fix llama4 ci error (#5126)
|
2025-04-07 21:15:46 +08:00 |
|
Xiaoyu Zhang
|
924ca7c92c
|
Add DeepSeek V3/R1 shared experts fusion (#4918)
|
2025-04-04 01:59:29 -07:00 |
|
Xiaoyu Zhang
|
04e3ff6975
|
Support compressed tensors fp8w8a8 (#4743)
|
2025-03-26 13:21:25 -07:00 |
|