Commit Graph

2209 Commits

Author SHA1 Message Date
wangxiyu191
155214952b refactor: Extract repeated member variables in KVCache subclasses to base class. (#6323) 2025-05-18 15:28:15 -07:00
Chang Su
ebe58d545d [Misc] Implement RankZeroFilter for rank-specific logging in model_runner.py (#6333) 2025-05-18 15:27:13 -07:00
Chang Su
066cf44546 [OAI] Add rid tracing for v1/embeddings and fix rid type in Chat (#6397) 2025-05-18 13:05:38 -07:00
JieXin Liang
1f30c05d4a [fix] fix fa3 forward_decode with spec_decode (#6395)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-05-18 12:50:15 -07:00
doujiang24
9d24c3ffb0 chore: tiny remove duplicated code (#6392)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
2025-05-18 02:17:32 -07:00
Yury Sulsky
24161c5913 The Gemma template is missing a newline after the user role. (#6331)
Co-authored-by: Yury Sulsky <ysulsky@tesla.com>
2025-05-18 01:57:27 -07:00
libra
11553c1a37 Add pipeline parallelism for Qwen2 and Qwen3 Model (#6250) 2025-05-18 00:42:55 -07:00
Mick
01dd39bac1 refactor: minor refactors regarding multimodal processing (#6187) 2025-05-17 22:53:20 -07:00
Lianmin Zheng
b3f3d610fd Do not use FA3 for mistral (#6379) 2025-05-17 19:47:34 -07:00
Yineng Zhang
f07c6a009b chore: upgrade sgl-kernel v0.1.3 (#6377) 2025-05-17 19:47:05 -07:00
Lianmin Zheng
4bb816d444 Fix CI tests (#6362) 2025-05-17 19:16:45 -07:00
ybyang
c250939ecb [Fix Chat API] add request id for chat/completion for tracing (#6364) 2025-05-17 18:58:22 -07:00
ishandhanani
b6909aa223 fix: allow launch_dummy_health_check_server to start inside of running asyncio loop (#6330) 2025-05-17 18:32:41 -07:00
fzyzcjy
f87283573e Add expert distribution APIs for engine (#6290) 2025-05-17 18:31:51 -07:00
fzyzcjy
73187152a4 Reland tiny refactor DefaultModelLoader.Source (#6041) 2025-05-17 17:11:20 -07:00
fzyzcjy
4086566516 Fix expert distribution recorder and profiler command stuck forever (#6284) 2025-05-17 17:10:44 -07:00
fzyzcjy
fd08c04821 Support custom DeepEP tuning config (#6257) 2025-05-17 17:09:42 -07:00
fzyzcjy
26ebb849eb Tiny refactor bench_serving to extract RequestFuncOutput.init_new (#6108) 2025-05-17 17:08:52 -07:00
fzyzcjy
02973cd9a4 Tiny refactor bench_serving to improve extensibility (#6134) 2025-05-17 17:07:58 -07:00
fzyzcjy
6d95a35abf Support outputing details for bench_serving (#6107) 2025-05-17 17:06:52 -07:00
fzyzcjy
01d2838c0f Fix stop_profile does not wait for finishing (#4741) 2025-05-17 17:06:15 -07:00
xutizhou
e3b8a72291 [fix] illegal memory in _fwd_kernel_ep_scatter_2 and _fwd_kernel_ep_gather (#6348) 2025-05-17 17:01:42 -07:00
Lifu Huang
3cf1473a09 Use monotonic clock for interval measurement (#6211)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-17 16:49:18 -07:00
fzyzcjy
2716830802 Speed up when having padding tokens in DeepEP (#6175) 2025-05-17 16:44:05 -07:00
Chang Su
205d5cb407 perf: Optimize local attention memory allocation in FlashAttentionBackend (#6356) 2025-05-17 01:45:46 -07:00
fzyzcjy
2df9d40aa6 Minor code cleanup refactor for DeepSeek models (#6324) 2025-05-16 19:06:03 -07:00
fzyzcjy
8dc191f237 Fix one wasted kernel in DeepSeek and minor refactor (#6316) 2025-05-16 19:05:33 -07:00
Kiv Chen
64825b8395 model(vlm): mistral 3.1 (#5099)
Co-authored-by: KivenChen <sleigh-queue-0y@icloud.com>
2025-05-16 18:36:18 -07:00
Lianmin Zheng
dcc0a45618 Fix amd ci (#6360) 2025-05-16 15:33:10 -07:00
Lianmin Zheng
c2b7ddca49 [Minor] cleanup unused imports (#6358) 2025-05-16 14:52:38 -07:00
Fr4nk1in
4bd2952a37 feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 (#6121)
Co-authored-by: King.Zevin <zevin@mail.ustc.edu.cn>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-05-16 14:44:10 -07:00
Elfie Guo
6fc9357503 [2/2] Add python wrapper for CUTLASS FP8 Blockscale MoE Kernel. (#5694) 2025-05-16 13:14:07 -07:00
Baizhou Zhang
839fb31e5f [Fix] Improve dependencies for Blackwell image (#6334) 2025-05-16 12:38:22 -07:00
Yury Sulsky
f19a9204cd Support precomputed multimodal features for Qwen-VL and Gemma3 models. (#6136)
Co-authored-by: Yury Sulsky <ysulsky@tesla.com>
2025-05-16 12:26:15 -07:00
Lianmin Zheng
e07a6977e7 Minor improvements of TokenizerManager / health check (#6327) 2025-05-15 15:29:25 -07:00
Qiaolin Yu
cd8d4b9dfc Fix lora bench (#6302) 2025-05-15 10:09:55 -07:00
fzyzcjy
f194e14fb7 Reduce MoE memory usage (#6147) 2025-05-15 09:38:28 -07:00
Yi Liu
cfc9f9ab8d Fix gpu mem check on CPU (#6317)
Signed-off-by: yiliu30 <yi4.liu@intel.com>
2025-05-15 09:37:45 -07:00
JieXin Liang
9a405274e2 [misc] remove redundant platform codes (#6298) 2025-05-15 00:51:30 -07:00
quinnrong94
2e4babdb0a [Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109)
Co-authored-by: Yingyi <yingyihuang2000@outlook.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: kexueyu <kexueyu@tencent.com>
Co-authored-by: vincentmeng <vincentmeng@tencent.com>
Co-authored-by: pengmeng <pengmeng@tencent.com>
2025-05-15 00:48:09 -07:00
Zilin Zhu
44a3783d13 [fix][RL] Remove the incorrect barrier in init_weights_update_group (#5914) 2025-05-14 19:15:21 -07:00
Junrong Lin
f3bf611054 feat: add flush cache to EngineBase and HttpServerEngineAdapter (#6009) 2025-05-14 19:15:02 -07:00
Hubert Lu
198b9056d1 [AMD] Fix Llama 4 Scout and Maverick accuracy issues on MI300X (#6274) 2025-05-14 22:07:29 +00:00
Lifu Huang
3e350a931e [Bug] Fix accidental logger override caused by internVL. (#6282) 2025-05-13 23:29:25 -07:00
Ying Sheng
fb71725c98 Fix a bug in schedule_policy (#6276) 2025-05-13 18:04:00 -07:00
Chang Su
912788c095 perf: optimize local_block_table memory allocation (#6273) 2025-05-13 17:18:38 -07:00
Yineng Zhang
16267d4fa7 chore: bump v0.4.6.post4 (#6245) 2025-05-13 01:57:51 -07:00
JieXin Liang
17299f088a [misc] deep_gemm fallback to NVRTC when NVCC not found (#6252) 2025-05-13 01:41:35 -07:00
Kiv Chen
5380cd7ea3 model(vlm): pixtral (#5084) 2025-05-13 00:16:10 -07:00
Cheng Wan
b2e95f62b4 Fix two issues related to --moe-dense-tp-size=1 (#5657)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com>
2025-05-12 23:51:39 -07:00