Ke Bao
|
6b6e748775
|
Remove q concat in FA3 backend for DeepSeek decode (#5638)
|
2025-04-22 11:43:12 -07:00 |
|
JieXin Liang
|
917324862e
|
[fix] reduce dp capture bs (#5634)
Co-authored-by: alcanerian <alcanerian@gmail.com>
|
2025-04-22 11:08:45 -07:00 |
|
lukec
|
2ed96c7a8a
|
fix flashmla bug (#5272)
|
2025-04-22 10:36:23 -07:00 |
|
saltyfish66
|
2aa3f5e2d0
|
[feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (#5641)
Co-authored-by: yuethe <yuethe@tencent.com>
|
2025-04-22 09:33:13 -07:00 |
|
lambert0312
|
76d17c7ecb
|
Fix shared experts fusion error without quantization (#5632)
|
2025-04-22 09:22:26 -07:00 |
|
Connector Switch
|
70d040f904
|
[NFC] Remove duplicate compressed-tensors (#5640)
|
2025-04-22 09:10:25 -07:00 |
|
JieXin Liang
|
4418f599a5
|
Fix FA3 DeepSeek prefill performance regression (#5624)
Co-authored-by: ispobock <ispobaoke@gmail.com>
|
2025-04-22 01:41:41 -07:00 |
|
Yineng Zhang
|
04f2abcb34
|
fix: gemma 3 not use softcap (#5622)
|
2025-04-22 01:16:08 -07:00 |
|
JieXin Liang
|
506be6b892
|
[fix] fix compile_deep_gemm missing kv_b_proj (#5620)
|
2025-04-22 00:06:36 -07:00 |
|
JieXin Liang
|
2343d8df7d
|
[fix] force use deepgemm in compile_deep_gemm (#5618)
|
2025-04-21 21:36:02 -07:00 |
|
Ke Bao
|
11b23ae97b
|
Remove extra copy in deepseek forward absorb (#5578)
Co-authored-by: saienduri <saimanas.enduri@amd.com>
|
2025-04-21 19:33:21 -07:00 |
|
Yineng Zhang
|
b9c87e781d
|
chore: bump v0.4.5.post3 (#5611)
|
2025-04-21 18:16:20 -07:00 |
|
michael-amd
|
968ef51562
|
Support aiter RMSNorm in AMD (#5510)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
|
2025-04-21 17:40:39 -07:00 |
|
Lianmin Zheng
|
1343200299
|
Clean up mem settings (#5610)
|
2025-04-21 17:19:00 -07:00 |
|
JieXin Liang
|
c2942907d5
|
[feature] enable pre compile jit deep_gemm (#5580)
|
2025-04-21 16:52:53 -07:00 |
|
Liangsheng Yin
|
e69a219074
|
Enhance GPU memory settings (#5604)
|
2025-04-21 15:15:00 -07:00 |
|
Byron Hsu
|
bf98d2e377
|
[PD] Support prefill overlap + Ensure no race condition (#5609)
|
2025-04-21 12:12:56 -07:00 |
|
Byron Hsu
|
e65b9f21e3
|
[PD] Support decode overlap schedule (#5608)
|
2025-04-21 12:06:16 -07:00 |
|
Trevor Morris
|
4dce1cc608
|
[PD] Add NIXL transfer backend (#5477)
|
2025-04-22 01:36:12 +08:00 |
|
Byron Hsu
|
deded17f38
|
[PD] Fix edge case and simplify large page size + chunked prefill (#5589)
|
2025-04-21 10:27:02 -07:00 |
|
shangmingc
|
f29a718f63
|
[PD] Fix generate endpoint of min_lb for PD (#5598)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-04-21 21:39:18 +08:00 |
|
Yongtong Wu
|
3f57b00a59
|
Support PD bootstrap fields on /v1/chat/completions endpoint (#5488)
|
2025-04-21 01:10:58 -07:00 |
|
fzyzcjy
|
453d412cdb
|
Tiny update error hint (#5037)
|
2025-04-21 00:47:47 -07:00 |
|
fzyzcjy
|
dc86f25a57
|
Tiny remove duplicated code (#5021)
|
2025-04-21 00:47:32 -07:00 |
|
Chuyue Sun
|
08289eaa3e
|
Support o1 model on Azure (#4980)
Co-authored-by: Shan Yu <shanyu1@g.ucla.edu>
|
2025-04-21 00:46:09 -07:00 |
|
Lucius
|
3b6d539f63
|
[Fix] Enhance DP Attention for IPv6 Compatibility (#4937)
|
2025-04-21 00:44:11 -07:00 |
|
lambert0312
|
c44f2869c9
|
Modify metrics service endpoint (#3443)
|
2025-04-21 00:35:38 -07:00 |
|
fzyzcjy
|
685d8980c3
|
Tiny add warning when cannot recognize bool env var (#5348)
|
2025-04-20 23:11:29 -07:00 |
|
Zhiqiang Xie
|
70645f4d7d
|
upstream hicache fixes (#5570)
|
2025-04-20 23:08:30 -07:00 |
|
Qingquan Song
|
188f0955fa
|
Add Speculative Decoding Eagle3 topk > 1 (#5318)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
|
2025-04-20 22:58:28 -07:00 |
|
Lianmin Zheng
|
eef9433b46
|
Fix flush cache (#5590)
|
2025-04-20 22:56:40 -07:00 |
|
JieXin Liang
|
97cb762bb6
|
[misc] remove is_cuda_available (#5319)
|
2025-04-20 18:16:51 -07:00 |
|
fzyzcjy
|
1195182040
|
Tiny add Engine.flush_cache API (#5241)
|
2025-04-20 18:15:03 -07:00 |
|
fzyzcjy
|
5239d79568
|
Speedup shared expert weight construction by avoid cloning (#5188)
|
2025-04-20 18:12:01 -07:00 |
|
Sundara Raman Ramachandran
|
f08154193c
|
Perform Batch Tokenization. (#5141)
|
2025-04-20 18:10:37 -07:00 |
|
fzyzcjy
|
5fc4b6004e
|
Add sanity check for max_running_requests (#5016)
|
2025-04-20 17:56:49 -07:00 |
|
Brayden Zhong
|
b868526d94
|
Fix one more issue reported by torchfix (#4859)
|
2025-04-20 17:49:27 -07:00 |
|
Juwan Yoo
|
502524e2da
|
compressed_tensors: port w8a16 fp8 from vllm (#4852)
|
2025-04-20 17:48:31 -07:00 |
|
Enrique Shockwave
|
4c7640079c
|
check marlin format before attempting conversion (#4675)
|
2025-04-20 17:47:09 -07:00 |
|
kyle-pena-kuzco
|
9f3bd2ad39
|
Feat: Implement JSON Mode (response_format.type="json_object") (#4733)
Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net>
|
2025-04-20 17:41:22 -07:00 |
|
Yi Zhou
|
fac17acf08
|
add function call parser for DeepSeek V3 (#5224)
|
2025-04-20 17:38:08 -07:00 |
|
Adarsh Shirawalmath
|
8b39274e34
|
[Feature] Prefill assistant response - add continue_final_message parameter (#4226)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-04-20 17:37:18 -07:00 |
|
Byron Hsu
|
c951d312ed
|
[PD] Fix large page size + chunk prefill (#5588)
|
2025-04-20 17:21:54 -07:00 |
|
AmadeusW
|
dcb8232596
|
Fix ChatCompletionMessageGenericParam to allow for None content (#5452)
|
2025-04-20 17:15:38 -07:00 |
|
Yineng Zhang
|
66c0ff9e31
|
fix: use fa3 for gemma2 (#5586)
|
2025-04-20 17:02:09 -07:00 |
|
tarinkk
|
9a7e83e899
|
Fix enable chunked prefill for Llama4 (#5575)
|
2025-04-20 17:01:30 -07:00 |
|
lukec
|
417b44eba8
|
[Feat] upgrade pytorch2.6 (#5417)
|
2025-04-20 16:06:34 -07:00 |
|
fzyzcjy
|
475e2e378a
|
[PD] Fix server crash when using batch requests (#5531)
|
2025-04-20 16:02:23 -07:00 |
|
fzyzcjy
|
fba86b6b54
|
Tiny improve error message (#5526)
|
2025-04-20 16:00:15 -07:00 |
|
fzyzcjy
|
fa2f677e18
|
Fix torch memory saver not enabled in DP scenario (#5560)
|
2025-04-20 14:20:52 -07:00 |
|