Commit Graph

275 Commits

Author SHA1 Message Date
Mick
9d02bb3e2a Urgent model support: support gemma-3-it (#4424) 2025-03-16 17:37:32 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
Yineng Zhang
65b7c9b78f cleanup deps 2/n (#4464) 2025-03-15 23:06:17 -07:00
Michael Feil
1fd0cf8a7b Update comment in qwen2.py (#4447) 2025-03-15 21:14:29 -07:00
Lianmin Zheng
c6d7f8d370 Add some fused elementwise kernels for grok-1 (#4398)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
2025-03-13 13:39:10 -07:00
Lianmin Zheng
8e66fbecee Improve DP attention (#4390)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-03-13 08:23:56 -07:00
Lianmin Zheng
45de89719c Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367) 2025-03-12 23:45:52 -07:00
Meng, Hengyu
71046fcd71 [XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
2025-03-12 22:26:29 -07:00
Mick
01090e8ac3 model: Support Janus-pro (#3203) 2025-03-12 11:02:11 -07:00
yych0745
6f43a9b9f4 remove the unused readline dependency from the Qwen2 model implementa… (#4340) 2025-03-12 02:47:27 -07:00
lambert0312
481f608b8e Add INT8 support MTP NextN function (#3911) 2025-03-12 01:37:16 -07:00
Yineng Zhang
d1da58e275 unify is_cuda and is_hip (#4321) 2025-03-11 18:12:56 -07:00
Mick
ff2ce0b86f refactor: move image processors to separate files (#4229) 2025-03-11 12:35:35 -07:00
shimin
ac69885056 fix the input_ids is None error (#4144) 2025-03-10 01:38:37 -07:00
DavidChan
4455b26e76 [Bug fixed] fixed the crash when enable the dp-attention on the single card (#3958) 2025-03-10 00:50:34 -07:00
Baizhou Zhang
9fb48f951f Support nextn for flashinfer mla attention backend (#4218) 2025-03-09 00:01:54 -08:00
HandH1998
c7f254468f [Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) (#3888)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: b0urnee <2769086541@qq.com>
2025-03-06 20:54:52 -08:00
Qubitium-ModelCloud
56a724eba3 [QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization (#3790)
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
2025-03-05 01:11:00 -08:00
Lianmin Zheng
e074d84e5b [Minor] more code cleanup (#4077) 2025-03-04 21:23:47 -08:00
Xiuyu Li
9545bfb28a fix: support gelu_new activation function in gpt2 (#3712) 2025-03-04 04:09:52 -08:00
Ke Bao
9fafa62db7 Share target model embed and head weights for nextn (#4033) 2025-03-03 13:30:04 -08:00
Lianmin Zheng
1a8f995c46 remove cache configs in model definitions (#4031) 2025-03-03 05:00:50 -08:00
Zhousx
7fbab730bd [feat] add small vocab table for eagle's draft model[1]. (#3822)
Co-authored-by: Achazwl <323163497@qq.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-03-02 18:58:45 -08:00
Baizhou Zhang
90a4b7d98a [Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-28 18:13:56 -08:00
fzyzcjy
e3e0bc50a9 [Feature] SPMD for SGLang + Verl (#3852) 2025-02-28 09:53:10 -08:00
Nicolas Castet
127998cc41 Fix allgather ops inside cuda graphs (#3709) 2025-02-25 08:39:10 -08:00
Yueyang Pan
7036d6fc67 [Bug]: Add missing clamp to llavavid (#3787) 2025-02-24 19:10:15 -08:00
Chaitanya Sri Krishna Lolla
6ce9dbe828 [ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 (#3237)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-24 18:14:31 -08:00
laixin
1a6e97577a Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3730)
Co-authored-by: HandH1998 <1335248067@qq.com>
2025-02-24 05:43:35 -08:00
Baizhou Zhang
b110084654 Refactor flashinfer logic for deepseek v3 and fix accuracy bug (#3785) 2025-02-24 04:07:25 -08:00
fzyzcjy
a3339d8cac Bug: Fix weight loader error when LM head weights are tied (#3766) 2025-02-21 17:53:12 -08:00
Chayenne
14d90617b0 Bug: fix lm head weights in Qwen models (#3777) 2025-02-21 16:49:31 -08:00
fzyzcjy
d37f95511d Improve: Tiny fix Olmo2 (#3348) 2025-02-21 16:09:35 -08:00
Zhiyu
c66b2c9cf1 Add support for nvidia modelopt fp8 kv cache (#3223) 2025-02-22 07:04:58 +08:00
simveit
20b765a26e Model: Support Qwen 72B RM model. (#3772) 2025-02-21 14:38:21 -08:00
Mick
424848d26f fix: remove dependency on latest transformers impl (#3635) 2025-02-19 01:14:11 +08:00
Yineng Zhang
714f3e6362 feat: support flashinfer mla with prefix cache (#3643) 2025-02-18 02:06:43 +08:00
Mick
bcc213df61 Model: Support Qwen 2.5 vl (#3258) 2025-02-16 00:58:53 -08:00
Mick
7711ac6ed0 doc: emphasize and notify the usage of chat_template (#3589)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-15 00:10:32 -08:00
Ke Bao
862dd76c76 Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 (#3582) 2025-02-15 05:28:34 +08:00
Yineng Zhang
70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) 2025-02-14 08:50:14 +08:00
Liangsheng Yin
8616357a97 Fix deepseek awq v3 (#3450) 2025-02-12 22:09:52 +08:00
Chayenne
40022d075a Feature: Fix the binding error in Llama (#3355) 2025-02-06 20:19:24 -08:00
Yineng Zhang
8db776f049 support QuickGELU (#3250) 2025-02-01 19:31:47 +08:00
Mick
9f635ea50d [Fix] Address remaining issues of supporting MiniCPMV (#2977) 2025-01-28 00:22:13 -08:00
Yineng Zhang
2f79f58873 feat: use sgl-kernel 0.0.3 in sglang (#3179) 2025-01-27 21:39:52 +08:00
Lianmin Zheng
52c03f16b9 Add activation parameters to fused_moe (#3170) 2025-01-27 00:23:37 -08:00
Hui Liu
8e48ca8cc1 enable kv_scale for Gemma2 (#3113) 2025-01-25 18:29:14 -08:00
Ke Wen
862bcff833 Support loading of larger models with on-the-fly quantization (#3061) 2025-01-22 21:33:17 -08:00
Hui Liu
d2571dd5c7 Enable Cohere2 Models (#3018) 2025-01-20 19:21:41 -08:00