Commit Graph

365 Commits

Author SHA1 Message Date
Ke Bao
799c4bb502 Fuse MLA set kv cache kernel (#5748) 2025-04-26 18:42:22 -07:00
ZXN
04d0123fd9 [Fix]: support deepseek-vl2-tiny model (#5552)
Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com>
2025-04-26 17:52:53 +08:00
Ke Bao
c3948ba67e Reorder loop in shared expert weight loading (#5719) 2025-04-25 17:27:42 -07:00
Lianmin Zheng
5641a09458 Revert "[Model] Support ArcticForCausalLM architecture (Snowflake/snowflake-arctic-instruct)" (#5754) 2025-04-25 15:50:28 -07:00
Brayden Zhong
43fb95c2fa [Model] Support ArcticForCausalLM architecture (Snowflake/snowflake-arctic-instruct) (#5078)
Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
2025-04-25 15:24:09 +08:00
Baizhou Zhang
a14654dd68 Fix weight loading bug for Deepseek v3+nextn (#5684) 2025-04-24 21:29:56 +08:00
Yuhong Guo
5d93a950ee [BugFix] Fix combination of MTP and --n-share-experts-fusionwith R1 (#5707) 2025-04-24 21:13:51 +08:00
Mick
c998d04b46 vlm: enable radix cache for qwen-vl models (#5349)
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
2025-04-23 20:35:05 -07:00
fzyzcjy
71d1785f2d Remove unnecessary torch.full in DeepSeek (#5601) 2025-04-22 21:24:29 -07:00
Baizhou Zhang
3f87f83116 Fuse q_a_proj and kv_a_proj (#5619) 2025-04-22 20:35:08 -07:00
Ke Bao
6b6e748775 Remove q concat in FA3 backend for DeepSeek decode (#5638) 2025-04-22 11:43:12 -07:00
lambert0312
76d17c7ecb Fix shared experts fusion error without quantization (#5632) 2025-04-22 09:22:26 -07:00
JieXin Liang
4418f599a5 Fix FA3 DeepSeek prefill performance regression (#5624)
Co-authored-by: ispobock <ispobaoke@gmail.com>
2025-04-22 01:41:41 -07:00
Yineng Zhang
04f2abcb34 fix: gemma 3 not use softcap (#5622) 2025-04-22 01:16:08 -07:00
JieXin Liang
506be6b892 [fix] fix compile_deep_gemm missing kv_b_proj (#5620) 2025-04-22 00:06:36 -07:00
Ke Bao
11b23ae97b Remove extra copy in deepseek forward absorb (#5578)
Co-authored-by: saienduri <saimanas.enduri@amd.com>
2025-04-21 19:33:21 -07:00
JieXin Liang
c2942907d5 [feature] enable pre compile jit deep_gemm (#5580) 2025-04-21 16:52:53 -07:00
JieXin Liang
97cb762bb6 [misc] remove is_cuda_available (#5319) 2025-04-20 18:16:51 -07:00
fzyzcjy
5239d79568 Speedup shared expert weight construction by avoid cloning (#5188) 2025-04-20 18:12:01 -07:00
Brayden Zhong
b868526d94 Fix one more issue reported by torchfix (#4859) 2025-04-20 17:49:27 -07:00
Xiaoyu Zhang
d9dd529854 enable DeepSeek V3 shared_experts_fusion in sm90 (#5571) 2025-04-20 12:46:42 -07:00
fzyzcjy
0a0dd34e6a Fix BumpAllocator error when no input_ids (#5564) 2025-04-20 02:20:53 -07:00
JieXin Liang
99456bcacb [perf] introduce deep gemm group_gemm_masked as bmm (#5432) 2025-04-20 00:38:27 -07:00
fzyzcjy
613b197e57 Remove one kernel in per_tensor_quant_mla_fp8 (#5549) 2025-04-19 15:08:15 -07:00
Xiaoyu Zhang
d58e354472 simplify the control logic for using shared experts fusion (#5504) 2025-04-19 13:17:35 -07:00
yhyang201
4db463b1ad [Model] Adding Qwen3 and Qwen3MoE (#4693) 2025-04-18 09:51:29 -07:00
fzyzcjy
53dcf38876 Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (#4836) 2025-04-17 21:38:26 -07:00
Michael Feil
1effba4c70 Configuration qwen2_moe.py - qkv_bias now in transformers (#5512) 2025-04-17 21:23:22 -07:00
Xuchun Shang
06d0a3d92b [Bug fix] use correct func path in deepseek (#5496)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
2025-04-17 02:41:41 -07:00
fzyzcjy
8beb356f0d Refactor DeepSeek decoder layer branches (#5205) 2025-04-17 02:11:11 -07:00
woodx
3bface15e6 Feat/support encoder model (like bert) (#4887) 2025-04-17 01:50:48 -07:00
Baizhou Zhang
4fb05583ef Deprecate disable-mla (#5481) 2025-04-17 01:43:14 -07:00
eigen
8f783c1943 [Model Support] unsloth/Phi-4-mini bnb model (#4982)
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-04-16 19:58:20 -07:00
Lianmin Zheng
177320a582 Clean up imports (#5467) 2025-04-16 15:26:49 -07:00
Baizhou Zhang
a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) 2025-04-15 22:01:22 -07:00
ryang
bc24205b32 Support BNB quantization for llama/mllama (#5038)
Co-authored-by: Yuhao Yang <yyh073@foxmail.com>
2025-04-15 18:00:31 -07:00
Yangcheng Li
ee9d6ca677 [fix/misc] remove duplicate row in deepseek v2 model (#5279) 2025-04-14 18:41:24 -07:00
JieXin Liang
bdde237562 [perf] experimental enhance fp8 per-tensor quant (#5370) 2025-04-14 12:35:43 -07:00
yhyang201
072df75354 Support for Qwen2.5-VL Model in bitsandbytes Format (#5003) 2025-04-14 02:03:40 -07:00
yulei
adca585bfb [DeepEP] Reduce routed scaling overhead (#5277)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-04-13 16:03:09 -07:00
lambert0312
1b1b47a949 Fix w8a8_int8 model shared experts fusion load weights error (#5120) 2025-04-11 23:33:51 -07:00
Mick
34ef6c8135 [VLM] Adopt fast image processor by default (#5065) 2025-04-11 21:46:58 -07:00
Mick
e53a0b3d5b [fix] fix mrope positions not picked up (#5265) 2025-04-11 01:29:45 -07:00
fzyzcjy
cd7e32e2cb Optimize attention in llama4 (#5127) 2025-04-11 00:32:41 -07:00
fzyzcjy
e3c4bd3153 Fix DeepSeek error when using DeepEP mode (#5190) 2025-04-09 17:43:22 -07:00
Mick
fbebcb7aa4 model: support mllama4 (#5144) 2025-04-09 09:28:44 -07:00
HandH1998
4065248214 Support Llama4 fp8 inference (#5194)
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: zhyncs <me@zhyncs.com>
2025-04-09 20:14:34 +08:00
fzyzcjy
86a876d883 Optimize topk operation in llama4 (#5128) 2025-04-09 02:50:22 -07:00
Yineng Zhang
90caf06c00 fix: use DeepEPDispatcher on CUDA (#5180) 2025-04-08 21:56:53 -07:00
Jinyan Chen
bc3f6db2dd [Fix] DeepEP Compatibility with Low Latency (#5068)
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-04-08 20:31:31 -07:00