Commit Graph

1975 Commits

Author SHA1 Message Date
Ke Bao
6b6e748775 Remove q concat in FA3 backend for DeepSeek decode (#5638) 2025-04-22 11:43:12 -07:00
JieXin Liang
917324862e [fix] reduce dp capture bs (#5634)
Co-authored-by: alcanerian <alcanerian@gmail.com>
2025-04-22 11:08:45 -07:00
lukec
2ed96c7a8a fix flashmla bug (#5272) 2025-04-22 10:36:23 -07:00
saltyfish66
2aa3f5e2d0 [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (#5641)
Co-authored-by: yuethe <yuethe@tencent.com>
2025-04-22 09:33:13 -07:00
lambert0312
76d17c7ecb Fix shared experts fusion error without quantization (#5632) 2025-04-22 09:22:26 -07:00
Connector Switch
70d040f904 [NFC] Remove duplicate compressed-tensors (#5640) 2025-04-22 09:10:25 -07:00
JieXin Liang
4418f599a5 Fix FA3 DeepSeek prefill performance regression (#5624)
Co-authored-by: ispobock <ispobaoke@gmail.com>
2025-04-22 01:41:41 -07:00
Yineng Zhang
04f2abcb34 fix: gemma 3 not use softcap (#5622) 2025-04-22 01:16:08 -07:00
JieXin Liang
506be6b892 [fix] fix compile_deep_gemm missing kv_b_proj (#5620) 2025-04-22 00:06:36 -07:00
JieXin Liang
2343d8df7d [fix] force use deepgemm in compile_deep_gemm (#5618) 2025-04-21 21:36:02 -07:00
Ke Bao
11b23ae97b Remove extra copy in deepseek forward absorb (#5578)
Co-authored-by: saienduri <saimanas.enduri@amd.com>
2025-04-21 19:33:21 -07:00
Yineng Zhang
b9c87e781d chore: bump v0.4.5.post3 (#5611) 2025-04-21 18:16:20 -07:00
michael-amd
968ef51562 Support aiter RMSNorm in AMD (#5510)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
2025-04-21 17:40:39 -07:00
Lianmin Zheng
1343200299 Clean up mem settings (#5610) 2025-04-21 17:19:00 -07:00
JieXin Liang
c2942907d5 [feature] enable pre compile jit deep_gemm (#5580) 2025-04-21 16:52:53 -07:00
Liangsheng Yin
e69a219074 Enhance GPU memory settings (#5604) 2025-04-21 15:15:00 -07:00
Byron Hsu
bf98d2e377 [PD] Support prefill overlap + Ensure no race condition (#5609) 2025-04-21 12:12:56 -07:00
Byron Hsu
e65b9f21e3 [PD] Support decode overlap schedule (#5608) 2025-04-21 12:06:16 -07:00
Trevor Morris
4dce1cc608 [PD] Add NIXL transfer backend (#5477) 2025-04-22 01:36:12 +08:00
Byron Hsu
deded17f38 [PD] Fix edge case and simplify large page size + chunked prefill (#5589) 2025-04-21 10:27:02 -07:00
shangmingc
f29a718f63 [PD] Fix generate endpoint of min_lb for PD (#5598)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-04-21 21:39:18 +08:00
Yongtong Wu
3f57b00a59 Support PD bootstrap fields on /v1/chat/completions endpoint (#5488) 2025-04-21 01:10:58 -07:00
fzyzcjy
453d412cdb Tiny update error hint (#5037) 2025-04-21 00:47:47 -07:00
fzyzcjy
dc86f25a57 Tiny remove duplicated code (#5021) 2025-04-21 00:47:32 -07:00
Chuyue Sun
08289eaa3e Support o1 model on Azure (#4980)
Co-authored-by: Shan Yu <shanyu1@g.ucla.edu>
2025-04-21 00:46:09 -07:00
Lucius
3b6d539f63 [Fix] Enhance DP Attention for IPv6 Compatibility (#4937) 2025-04-21 00:44:11 -07:00
lambert0312
c44f2869c9 Modify metrics service endpoint (#3443) 2025-04-21 00:35:38 -07:00
fzyzcjy
685d8980c3 Tiny add warning when cannot recognize bool env var (#5348) 2025-04-20 23:11:29 -07:00
Zhiqiang Xie
70645f4d7d upstream hicache fixes (#5570) 2025-04-20 23:08:30 -07:00
Qingquan Song
188f0955fa Add Speculative Decoding Eagle3 topk > 1 (#5318)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
2025-04-20 22:58:28 -07:00
Lianmin Zheng
eef9433b46 Fix flush cache (#5590) 2025-04-20 22:56:40 -07:00
JieXin Liang
97cb762bb6 [misc] remove is_cuda_available (#5319) 2025-04-20 18:16:51 -07:00
fzyzcjy
1195182040 Tiny add Engine.flush_cache API (#5241) 2025-04-20 18:15:03 -07:00
fzyzcjy
5239d79568 Speedup shared expert weight construction by avoid cloning (#5188) 2025-04-20 18:12:01 -07:00
Sundara Raman Ramachandran
f08154193c Perform Batch Tokenization. (#5141) 2025-04-20 18:10:37 -07:00
fzyzcjy
5fc4b6004e Add sanity check for max_running_requests (#5016) 2025-04-20 17:56:49 -07:00
Brayden Zhong
b868526d94 Fix one more issue reported by torchfix (#4859) 2025-04-20 17:49:27 -07:00
Juwan Yoo
502524e2da compressed_tensors: port w8a16 fp8 from vllm (#4852) 2025-04-20 17:48:31 -07:00
Enrique Shockwave
4c7640079c check marlin format before attempting conversion (#4675) 2025-04-20 17:47:09 -07:00
kyle-pena-kuzco
9f3bd2ad39 Feat: Implement JSON Mode (response_format.type="json_object") (#4733)
Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net>
2025-04-20 17:41:22 -07:00
Yi Zhou
fac17acf08 add function call parser for DeepSeek V3 (#5224) 2025-04-20 17:38:08 -07:00
Adarsh Shirawalmath
8b39274e34 [Feature] Prefill assistant response - add continue_final_message parameter (#4226)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-04-20 17:37:18 -07:00
Byron Hsu
c951d312ed [PD] Fix large page size + chunk prefill (#5588) 2025-04-20 17:21:54 -07:00
AmadeusW
dcb8232596 Fix ChatCompletionMessageGenericParam to allow for None content (#5452) 2025-04-20 17:15:38 -07:00
Yineng Zhang
66c0ff9e31 fix: use fa3 for gemma2 (#5586) 2025-04-20 17:02:09 -07:00
tarinkk
9a7e83e899 Fix enable chunked prefill for Llama4 (#5575) 2025-04-20 17:01:30 -07:00
lukec
417b44eba8 [Feat] upgrade pytorch2.6 (#5417) 2025-04-20 16:06:34 -07:00
fzyzcjy
475e2e378a [PD] Fix server crash when using batch requests (#5531) 2025-04-20 16:02:23 -07:00
fzyzcjy
fba86b6b54 Tiny improve error message (#5526) 2025-04-20 16:00:15 -07:00
fzyzcjy
fa2f677e18 Fix torch memory saver not enabled in DP scenario (#5560) 2025-04-20 14:20:52 -07:00