yulei
|
adca585bfb
|
[DeepEP] Reduce routed scaling overhead (#5277)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-04-13 16:03:09 -07:00 |
|
lambert0312
|
1b1b47a949
|
Fix w8a8_int8 model shared experts fusion load weights error (#5120)
|
2025-04-11 23:33:51 -07:00 |
|
fzyzcjy
|
e3c4bd3153
|
Fix DeepSeek error when using DeepEP mode (#5190)
|
2025-04-09 17:43:22 -07:00 |
|
HandH1998
|
4065248214
|
Support Llama4 fp8 inference (#5194)
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: zhyncs <me@zhyncs.com>
|
2025-04-09 20:14:34 +08:00 |
|
Yineng Zhang
|
90caf06c00
|
fix: use DeepEPDispatcher on CUDA (#5180)
|
2025-04-08 21:56:53 -07:00 |
|
Jinyan Chen
|
bc3f6db2dd
|
[Fix] DeepEP Compatibility with Low Latency (#5068)
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-04-08 20:31:31 -07:00 |
|
Yun Dai
|
2695ab0537
|
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-04-08 09:11:35 -07:00 |
|
Baizhou Zhang
|
efbae697b3
|
[Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052)
|
2025-04-05 01:23:02 -07:00 |
|
inkcherry
|
7ed77d6b9e
|
fix dummy-load deepseekv2 (#4535)
|
2025-04-04 15:22:37 -07:00 |
|
Xiaoyu Zhang
|
924ca7c92c
|
Add DeepSeek V3/R1 shared experts fusion (#4918)
|
2025-04-04 01:59:29 -07:00 |
|
fzyzcjy
|
febe21ce03
|
Small refactor DeepEPDispatcher into subclasses (#4994)
|
2025-04-04 00:24:18 -07:00 |
|
Lianmin Zheng
|
74885a848b
|
Revert "Replace enable_flashinfer_mla argument with attention_backend" (#5048)
|
2025-04-03 13:30:56 -07:00 |
|
fzyzcjy
|
8e10fec9a8
|
Small refactor DeepEPMode to clean up code a bit (#4992)
|
2025-04-03 02:56:44 -07:00 |
|
Baizhou Zhang
|
e8999b13b7
|
Replace enable_flashinfer_mla argument with attention_backend (#5005)
|
2025-04-03 02:53:58 -07:00 |
|
Jinyan Chen
|
23c764b18a
|
[Feature] Support DeepEP Low Latency (#4767)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-04-01 09:23:25 -07:00 |
|
Mick
|
5cb552b1d4
|
refactor: multimodal data (#4754)
|
2025-03-31 09:57:51 -07:00 |
|
fzyzcjy
|
a303325fdb
|
Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (#4883)
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-03-30 20:10:21 -07:00 |
|
Lianmin Zheng
|
9adf178cc2
|
Fix 2-gpu CI test and suppress some warnings (#4930)
|
2025-03-30 12:51:44 -07:00 |
|
Baizhou Zhang
|
20c90be23d
|
[Feature] Support FA3 backend for MLA (#4831)
|
2025-03-28 18:30:14 -07:00 |
|
Lianmin Zheng
|
74e0ac1dbd
|
Clean up import vllm in quantization/__init__.py (#4834)
|
2025-03-28 10:34:10 -07:00 |
|
tarinkk
|
7f19e083c1
|
Support (1 <= dp < tp) in the dp attention in DeepEP (#4770)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
|
2025-03-27 17:09:35 -07:00 |
|
Xiaoyu Zhang
|
04e3ff6975
|
Support compressed tensors fp8w8a8 (#4743)
|
2025-03-26 13:21:25 -07:00 |
|
yuhsaun-t
|
199bb01d00
|
Add endpoints to dump selected expert ids (#4435)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-03-24 21:34:19 -07:00 |
|
fzyzcjy
|
ca75741e86
|
Support async in DeepEP (#4610)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
|
2025-03-22 22:39:56 -07:00 |
|
fzyzcjy
|
c6d549e773
|
Multiple tiny code cleanups (#4608)
|
2025-03-22 22:39:11 -07:00 |
|
xutizhou
|
c2bd094d6e
|
Optimize Permute Kernel in DeepEP (#4643)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-03-22 14:30:34 -07:00 |
|
strgrb
|
df7014a8d2
|
avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA (#4577)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
|
2025-03-19 10:02:26 -07:00 |
|
Jinyan Chen
|
f44db16c8e
|
[Feature] Integrate DeepEP into SGLang (#4232)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
|
2025-03-19 08:16:31 -07:00 |
|
Cheng Wan
|
3196999f63
|
Reduce computation and communication in DP attention (#4521)
|
2025-03-18 13:41:36 -07:00 |
|
Yineng Zhang
|
c16b33ccac
|
cleanup deps 3/n (#4541)
|
2025-03-18 00:11:36 -07:00 |
|
萝卜菜
|
d6d21640d3
|
[Feature] Support Deepseek-VL2 (#2798)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
|
2025-03-16 23:07:59 -07:00 |
|
Yineng Zhang
|
65b7c9b78f
|
cleanup deps 2/n (#4464)
|
2025-03-15 23:06:17 -07:00 |
|
Lianmin Zheng
|
8e66fbecee
|
Improve DP attention (#4390)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-03-13 08:23:56 -07:00 |
|
Lianmin Zheng
|
45de89719c
|
Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367)
|
2025-03-12 23:45:52 -07:00 |
|
Meng, Hengyu
|
71046fcd71
|
[XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
|
2025-03-12 22:26:29 -07:00 |
|
Yineng Zhang
|
d1da58e275
|
unify is_cuda and is_hip (#4321)
|
2025-03-11 18:12:56 -07:00 |
|
DavidChan
|
4455b26e76
|
[Bug fixed] fixed the crash when enable the dp-attention on the single card (#3958)
|
2025-03-10 00:50:34 -07:00 |
|
Baizhou Zhang
|
9fb48f951f
|
Support nextn for flashinfer mla attention backend (#4218)
|
2025-03-09 00:01:54 -08:00 |
|
HandH1998
|
c7f254468f
|
[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) (#3888)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: b0urnee <2769086541@qq.com>
|
2025-03-06 20:54:52 -08:00 |
|
Qubitium-ModelCloud
|
56a724eba3
|
[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization (#3790)
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
|
2025-03-05 01:11:00 -08:00 |
|
Lianmin Zheng
|
e074d84e5b
|
[Minor] more code cleanup (#4077)
|
2025-03-04 21:23:47 -08:00 |
|
Ke Bao
|
9fafa62db7
|
Share target model embed and head weights for nextn (#4033)
|
2025-03-03 13:30:04 -08:00 |
|
Baizhou Zhang
|
90a4b7d98a
|
[Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
|
2025-02-28 18:13:56 -08:00 |
|
Nicolas Castet
|
127998cc41
|
Fix allgather ops inside cuda graphs (#3709)
|
2025-02-25 08:39:10 -08:00 |
|
Chaitanya Sri Krishna Lolla
|
6ce9dbe828
|
[ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 (#3237)
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-02-24 18:14:31 -08:00 |
|
laixin
|
1a6e97577a
|
Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3730)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2025-02-24 05:43:35 -08:00 |
|
Baizhou Zhang
|
b110084654
|
Refactor flashinfer logic for deepseek v3 and fix accuracy bug (#3785)
|
2025-02-24 04:07:25 -08:00 |
|
Yineng Zhang
|
714f3e6362
|
feat: support flashinfer mla with prefix cache (#3643)
|
2025-02-18 02:06:43 +08:00 |
|
Ke Bao
|
862dd76c76
|
Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 (#3582)
|
2025-02-15 05:28:34 +08:00 |
|
Yineng Zhang
|
70f894b810
|
feat: support flashinfer mla attention for deepseek v3 (#3550)
|
2025-02-14 08:50:14 +08:00 |
|