Cheng Wan
|
ac5010e0ba
|
Fix CUDA Graph Check under Deepep with DP FFN (#7451)
|
2025-06-22 20:35:58 -07:00 |
|
Liangsheng Yin
|
05c9bc8956
|
[minor] simplify the TokenToKVPoolAllocator (#7414)
|
2025-06-22 12:37:18 +08:00 |
|
Lifu Huang
|
1998ce4046
|
Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (#7412)
|
2025-06-21 16:09:19 -07:00 |
|
Cheng Wan
|
e879d8b7a8
|
[Feature] Comprehensive Hybrid Parallelism Support (#6389)
|
2025-06-20 14:43:11 -07:00 |
|
Stefan He
|
3774f07825
|
Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (#7099)
|
2025-06-19 00:56:37 -07:00 |
|
YanbingJiang
|
094c116f7d
|
Update python API of activation, topk, norm and rope and remove vllm dependency (#6614)
Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: sdp <sdp@gnr799219.jf.intel.com>
|
2025-06-17 22:11:50 -07:00 |
|
u4lr451
|
10d60cd41b
|
feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
|
2025-06-17 00:33:28 -07:00 |
|
KavioYu
|
873ae12cee
|
support custom weight loader for model runner (#7122)
Co-authored-by: kavioyu <kavioyu@tencent.com>
|
2025-06-16 16:28:15 -07:00 |
|
Lianmin Zheng
|
c64290dcb5
|
Use seq_len_fill_value in the cuda graph runners (#7233)
|
2025-06-16 15:57:07 -07:00 |
|
woodx
|
e30ef368ab
|
Feat/support rerank (#6058)
|
2025-06-16 10:50:01 -07:00 |
|
fzyzcjy
|
b4c41f7276
|
Refactor DeepGEMM integration (#7150)
|
2025-06-13 20:41:03 -07:00 |
|
Lianmin Zheng
|
6b12d6a8d5
|
Simplify the heuristics for setting --mem-fraction-static (#7054)
|
2025-06-10 19:01:39 -07:00 |
|
kyle-pena-kuzco
|
b56de8f943
|
Open AI API hidden states (#6716)
|
2025-06-10 14:37:29 -07:00 |
|
Lianmin Zheng
|
4a102a2b02
|
Minor style fix in cuda_graph_runner.py (#7053)
|
2025-06-10 06:32:41 -07:00 |
|
Lianmin Zheng
|
6406408a70
|
Clean up server_args.py (#7037)
|
2025-06-10 05:34:29 -07:00 |
|
Byron Hsu
|
c2b16795b5
|
Add decode req pool (#6980)
|
2025-06-09 21:23:36 -07:00 |
|
Baizhou Zhang
|
6716b41786
|
Update default settings for blackwell (#7023)
|
2025-06-09 20:37:47 -07:00 |
|
Lianmin Zheng
|
dc0705a504
|
Simplify prepare_extend_after_decode (#6987)
|
2025-06-09 16:39:21 -07:00 |
|
Baizhou Zhang
|
971a0dfa32
|
Extend cuda graph capture bs for B200 (#6937)
|
2025-06-08 05:13:22 -07:00 |
|
fzyzcjy
|
2fc1299562
|
Remove unnecessary kernels of num_token_non_padded (#6965)
|
2025-06-08 05:09:17 -07:00 |
|
Lianmin Zheng
|
20d3ad3b58
|
Fix CI and triton moe Configs (#6974)
|
2025-06-08 05:06:46 -07:00 |
|
fzyzcjy
|
f5599ef124
|
Refactor global_server_args_dict (#6866)
|
2025-06-07 03:10:35 -07:00 |
|
fzyzcjy
|
c499591ac8
|
Add canary for EPLB rebalancing (#6895)
|
2025-06-07 03:09:33 -07:00 |
|
Lianmin Zheng
|
60fdad7cf3
|
Sync the changes on cuda graph runners (#6932)
|
2025-06-06 18:23:52 -07:00 |
|
HAI
|
b819381fec
|
AITER backend extension and workload optimizations (#6838)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
|
2025-06-05 23:00:18 -07:00 |
|
fzyzcjy
|
0de5e7d40f
|
Support layerwise rebalancing experts (#6851)
|
2025-06-05 00:05:52 -07:00 |
|
zyksir
|
8e3797be1c
|
support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277)
|
2025-06-04 22:11:24 -07:00 |
|
Cheng Wan
|
81964328b7
|
Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736)
|
2025-06-04 15:53:22 -07:00 |
|
Cheng Wan
|
8a5480528d
|
[Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735)
|
2025-06-03 17:48:24 -07:00 |
|
fzyzcjy
|
b6d0ce9f78
|
Minor add metrics to expert location updater (#6816)
|
2025-06-02 23:59:11 -07:00 |
|
fzyzcjy
|
e05e29d178
|
Refactor CustomOp to avoid confusing bugs (#5382)
|
2025-06-02 10:27:36 -07:00 |
|
Lianmin Zheng
|
20fd53b8f6
|
Correctly abort the failed grammar requests & Improve the handling of abort (#6803)
|
2025-06-01 19:00:07 -07:00 |
|
Lianmin Zheng
|
2d72fc47cf
|
Improve profiler and integrate profiler in bench_one_batch_server (#6787)
|
2025-05-31 15:53:55 -07:00 |
|
YanbingJiang
|
888cb175a6
|
Add intel_amx backend for Radix Attention for CPU (#6408)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
|
2025-05-30 21:37:42 -07:00 |
|
Jianan Ji
|
22630ca242
|
Support sliding window in triton backend (#6509)
|
2025-05-30 01:11:53 -07:00 |
|
Baizhou Zhang
|
f2bd3515fb
|
Tune memory arguments on B200 (#6718)
|
2025-05-29 00:03:22 -07:00 |
|
fzyzcjy
|
ae6a5b2950
|
Minor refactor two-batch overlap (#6682)
|
2025-05-28 15:54:17 -07:00 |
|
fzyzcjy
|
673ff668f7
|
Speed up expert location update (#6661)
|
2025-05-27 10:00:09 -07:00 |
|
fzyzcjy
|
447be24228
|
Fix OOM when updating expert locations (#6660)
|
2025-05-27 09:59:53 -07:00 |
|
Yi Zhang
|
14d1075f2c
|
fix qwen3moe eplb prefill bug (#6617)
|
2025-05-26 02:15:21 -07:00 |
|
fzyzcjy
|
5ccf8fe1a0
|
Hint users when weight update timeouts (#6570)
|
2025-05-25 09:13:17 -07:00 |
|
fzyzcjy
|
0d47788025
|
Support overlapping two batches (#4068)
|
2025-05-24 17:39:07 -07:00 |
|
kk
|
7a5e6ce1cb
|
Fix GPU OOM (#6564)
Co-authored-by: michael <michael.zhang@amd.com>
|
2025-05-24 16:38:39 -07:00 |
|
fzyzcjy
|
c4831e2fcf
|
Fix accuracy is zero when enabling moe-dense-tp-size as in large scale EP (#6567)
|
2025-05-24 00:27:10 -07:00 |
|
Chang Su
|
4685fbb888
|
[VLM] Support chunk prefill for VLM (#6355)
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-05-22 20:32:41 -07:00 |
|
ryang
|
a6ae3af15e
|
Support XiaomiMiMo inference with mtp (#6059)
|
2025-05-22 14:14:49 -07:00 |
|
fzyzcjy
|
7a80f56513
|
Support dynamically rebalancing experts using EPLB (#6469)
|
2025-05-21 23:13:21 -07:00 |
|
fzyzcjy
|
fc992a09f9
|
Support updating expert locations dynamically (#6388)
|
2025-05-21 21:59:33 -07:00 |
|
Baizhou Zhang
|
d4c038daed
|
[Fix]Fix capture fail bug for DeepSeek (#6275)
|
2025-05-21 11:11:20 -07:00 |
|
fzyzcjy
|
ccfe5c009d
|
Support redundant experts in expert parallel (#6461)
|
2025-05-21 02:05:53 -07:00 |
|