Commit Graph

  • c2b16795b5 Add decode req pool (#6980) Byron Hsu 2025-06-09 21:23:36 -07:00
  • f6ebba537a Support both approximate and exact expert distribution collection (#6964) fzyzcjy 2025-06-10 11:56:17 +08:00
  • 6716b41786 Update default settings for blackwell (#7023) Baizhou Zhang 2025-06-09 20:37:47 -07:00
  • 1c8b42c84c chore: update pr test xeon (#7018) Yineng Zhang 2025-06-09 17:36:25 -07:00
  • f20f70003d Fix torch version in blackwell dockerfile (#7017) Qiaolin Yu 2025-06-09 20:06:50 -04:00
  • f40942ad63 Migrate to assertEqual (#6741) Emmanuel Ferdman 2025-06-10 02:47:39 +03:00
  • dc0705a504 Simplify prepare_extend_after_decode (#6987) Lianmin Zheng 2025-06-09 16:39:21 -07:00
  • a968c888c0 Fix torchvision version for Blackwell (#7015) Wenxuan Tan 2025-06-09 17:50:19 -05:00
  • a979daac3b Fallback to lower triton version for unfound fused moe configs (#7013) Baizhou Zhang 2025-06-09 15:41:03 -07:00
  • f1569876d5 feat: add direct routing strategy to DP worker (#6884) ishandhanani 2025-06-09 11:44:05 -07:00
  • 3465d7ae78 Update amd nightly models CI. (#6992) Sai Enduri 2025-06-09 10:54:08 -07:00
  • e58423b2b9 Fix cutlass MLA gets almost zero accuracy (#6998) fzyzcjy 2025-06-10 01:16:29 +08:00
  • 7059ae16fb chore: update pr test xeon (#7008) Yineng Zhang 2025-06-09 10:08:44 -07:00
  • 51d9a597f9 cleanup tmp dir (#7007) Yineng Zhang 2025-06-09 09:26:04 -07:00
  • 56ccd3c22c chore: upgrade flashinfer v0.2.6.post1 jit (#6958) Yineng Zhang 2025-06-09 09:22:39 -07:00
  • 98c00a2df1 Fix torch profiler bugs for bench_offline_throughput.py (#6557) Yueyang Pan 2025-06-09 20:33:41 +08:00
  • 451ffe74d9 support qwen3 emebedding (#6990) Pan Lyu 2025-06-09 16:32:49 +08:00
  • b1e5a33ae3 Eliminate stream sync to speed up LoRA batch init (#6960) Lifu Huang 2025-06-09 00:22:45 -07:00
  • 9d5fa68b90 Use torch.compile to fuse flash attention decode metadata preparation (#6973) Lianmin Zheng 2025-06-08 23:05:40 -07:00
  • 2c18642502 Enable more unit tests for AMD CI. (#6983) Sai Enduri 2025-06-08 19:41:55 -07:00
  • 18efb5e8e0 [perf][sgl-kernel] extend cutlass_mla_decode to support num_head < 128 (#6929) JieXin Liang 2025-06-09 10:37:34 +08:00
  • de1350ea20 Minor remove one kernel for DeepSeek (#6977) fzyzcjy 2025-06-09 08:41:35 +08:00
  • 86fe943bc3 Fix expert distribution dumping causes OOM (#6967) fzyzcjy 2025-06-09 08:41:14 +08:00
  • 9ecb18568b Fix triton sliding window test case (#6981) Lianmin Zheng 2025-06-08 17:20:46 -07:00
  • cc74499d51 Fix draft extend ut stability with flush cache (#6979) Ke Bao 2025-06-09 08:09:32 +08:00
  • 0c1f03a23d Sync cuda graph runners (#6976) Lianmin Zheng 2025-06-08 16:12:25 -07:00
  • 3712abfaf9 Fuse routed scaling factor in deepseek (#6970) Xiaoyu Zhang 2025-06-09 06:24:24 +08:00
  • 971a0dfa32 Extend cuda graph capture bs for B200 (#6937) Baizhou Zhang 2025-06-08 05:13:22 -07:00
  • 2fc1299562 Remove unnecessary kernels of num_token_non_padded (#6965) fzyzcjy 2025-06-08 20:09:17 +08:00
  • 20d3ad3b58 Fix CI and triton moe Configs (#6974) Lianmin Zheng 2025-06-08 05:06:46 -07:00
  • fa3592cfeb rebase h20 fused_moe config (#6966) Xiaoyu Zhang 2025-06-08 20:01:34 +08:00
  • 608668e143 Slightly improve the sampler to skip unnecessary steps (#6956) Lianmin Zheng 2025-06-08 03:18:54 -07:00
  • 6c0a48282a chore: bump sgl-kernel v0.1.7 (#6963) Yineng Zhang 2025-06-08 02:43:15 -07:00
  • 4740288303 [AMD] Add more tests to per-commit-amd (#6926) Hubert Lu 2025-06-08 01:08:37 -07:00
  • 1fb76ebb93 Revert "Fuse routed scaling factor in topk_reduce kernel (#6220)" (#6968) Yineng Zhang 2025-06-07 21:02:49 -07:00
  • c2c4f57f63 [DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model (#6853) Pavani Majety 2025-06-07 17:24:35 -07:00
  • 23881fa60c chore: upgrade sgl-kernel v0.1.6.post1 (#6957) Yineng Zhang 2025-06-07 17:18:55 -07:00
  • 8db3ac55a9 chore: bump sgl-kernel v0.1.6.post1 (#6955) Yineng Zhang 2025-06-07 15:25:46 -07:00
  • 3e56f557fd Add a CUDA kernel for fusing mapping and weighted sum for MoE. (#6916) Elfie Guo 2025-06-07 15:24:39 -07:00
  • 62fec60d81 Add H20 fused MoE kernel tuning configs for DeepSeek-R1/V3 (#6885) Xu Wenqing 2025-06-08 06:17:34 +08:00
  • e7759778e5 [misc] add is_cpu() (#6950) JieXin Liang 2025-06-08 06:13:45 +08:00
  • 77e928d00e Update server timeout time in AMD CI. (#6953) Sai Enduri 2025-06-07 15:10:27 -07:00
  • 515ef4facb Fuse routed scaling factor in topk_reduce kernel (#6220) Xiaoyu Zhang 2025-06-08 02:06:50 +08:00
  • f5599ef124 Refactor global_server_args_dict (#6866) fzyzcjy 2025-06-07 18:10:35 +08:00
  • c499591ac8 Add canary for EPLB rebalancing (#6895) fzyzcjy 2025-06-07 18:09:33 +08:00
  • e1ce44cdb1 Disabling mixed chunked prefill when eagle is enabled (#6874) Swipe4057 2025-06-07 13:06:58 +03:00
  • f1114e7ff3 [Docker] Add docker file for SGL Router (#6915) Arthur Cheng 2025-06-07 02:58:43 -07:00
  • bae4fdc7ab add fbgemm moe grouped gemm kernel benchmark (#6924) Xiaoyu Zhang 2025-06-07 17:57:30 +08:00
  • 6153f2ff6e chore: upgrade sgl-kernel v0.1.6 (#6945) JieXin Liang 2025-06-07 17:53:26 +08:00
  • 8b5f83ed3b reduce torch.zeros overhead in moe align block size kernel (#6369) Xiaoyu Zhang 2025-06-07 17:47:36 +08:00
  • 2a413829f4 Add triton version as a fused_moe_triton config search key to avoid performace decrease in different Triton version (#5955) Xiaoyu Zhang 2025-06-07 17:43:50 +08:00
  • d5c097a2f9 Tiny re-introduce profile id logging (#6912) fzyzcjy 2025-06-07 17:32:50 +08:00
  • 9736cd3b7d [Bugfix] pipeline parallelism and Eagle Qwen2 (#6910) Swipe4057 2025-06-07 11:58:50 +03:00
  • 2f715f51cc Minor compile fused topk (#6944) fzyzcjy 2025-06-07 16:40:38 +08:00
  • d664ca18f2 chore: bump sgl-kernel v0.1.6 (#6943) Yineng Zhang 2025-06-07 00:25:22 -07:00
  • 22fe787852 [sgl-kernel] update deepgemm (#6942) JieXin Liang 2025-06-07 14:24:41 +08:00
  • c4ffbeca19 Add triton fused moe kernel config for E=257 on B200 (#6939) Baizhou Zhang 2025-06-06 23:15:01 -07:00
  • f8eaaab817 [fix] logical_to_all_physical_map index 256 is out of bounds in EP parallel. (#6767) miter 2025-06-07 12:32:33 +08:00
  • 697b0f71f0 [Refactor] image data process in bench_serving (#6879) Xinyuan Tong 2025-06-06 21:11:17 -07:00
  • 132dad874d [PD] Optimize transfer queue forward logic for dummy rank (#6922) shangmingc 2025-06-07 09:26:14 +08:00
  • 60fdad7cf3 Sync the changes on cuda graph runners (#6932) Lianmin Zheng 2025-06-06 18:23:52 -07:00
  • 61ce91ed28 Tiny support customize DeepEP max dispatch tokens per rank (#6934) fzyzcjy 2025-06-07 08:18:35 +08:00
  • e6b7053b60 Fix a bug in abort & Improve docstrings for abort (#6931) Lianmin Zheng 2025-06-06 14:35:45 -07:00
  • 5f91c82526 [Feature] Support Flashinfer fmha on Blackwell (#6930) Jianan Ji 2025-06-06 15:57:50 -04:00
  • b819381fec AITER backend extension and workload optimizations (#6838) HAI 2025-06-05 23:00:18 -07:00
  • 562f279a2d [CPU] enable CI for PRs, add Dockerfile and auto build task (#6458) Zaili Wang 2025-06-06 04:43:54 +08:00
  • 8b2474898b bugfix(OAI): Fix image_data processing for jinja chat templates (#6877) Chang Su 2025-06-05 13:37:01 -07:00
  • 0df6765c83 [CUTLASS-FP4-MOE] Introduce CutlassMoEParams class for easy initialization of Cutlass Grouped Gems Metadata (#6887) Pavani Majety 2025-06-05 13:13:14 -07:00
  • 35b65cf0ca Use deepgemm instead of triton for fused_qkv_a_proj_with_mqa (#6890) fzyzcjy 2025-06-06 02:37:05 +08:00
  • dd1012fcbe [PD] Fix potential perf spike caused by tracker gc and optimize doc (#6764) shangmingc 2025-06-06 01:56:02 +08:00
  • 44aab7f91c oai: fix openAI client error with single request via batch api (#6170) Ravi Theja 2025-06-05 15:51:47 +05:30
  • 43baba649e [EP] Add cuda kernel for moe_ep_post_reorder (#6837) Yuan Luo 2025-06-05 15:33:47 +08:00
  • 0166403c20 Support Blackwell DeepEP docker images (#6868) fzyzcjy 2025-06-05 15:07:53 +08:00
  • bcf66ef3e1 Tiny allow profiler API to auto create directory (#6865) fzyzcjy 2025-06-05 15:07:03 +08:00
  • 0de5e7d40f Support layerwise rebalancing experts (#6851) fzyzcjy 2025-06-05 15:05:52 +08:00
  • 72a110f664 Tiny update error hints (#6846) fzyzcjy 2025-06-05 15:05:28 +08:00
  • 5aff1e9392 Fix Qwen3MoE missing token padding optimization (#6820) fzyzcjy 2025-06-05 15:04:59 +08:00
  • 8e3797be1c support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277) zyksir 2025-06-05 13:11:24 +08:00
  • 4474eaf552 Support LoRA in TestOpenAIVisionServer and fix fused kv_proj loading bug. (#6861) Lifu Huang 2025-06-04 22:08:30 -07:00
  • 499f5e620c Fix one missing arg in DeepEP (#6878) Cheng Wan 2025-06-04 19:14:47 -07:00
  • 81964328b7 Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736) Cheng Wan 2025-06-04 15:53:22 -07:00
  • f0f84975f4 feat: add dp-rank to KV events (#6852) ishandhanani 2025-06-04 15:29:34 -07:00
  • 3f1e433903 Decoder-only Scoring API (#6460) Chanh Nguyen 2025-06-04 14:14:54 -07:00
  • cf9815ba69 [Refactor] Multimodal data processing for VLM (#6659) Xinyuan Tong 2025-06-04 11:22:33 -07:00
  • bd75690f4e fix ep_moe_reorder kernel bugs (#6858) Xiaoyu Zhang 2025-06-04 19:13:59 +08:00
  • 180ff5eecc [fix] recover auto-dispatch for rmsnorm and rope (#6745) JieXin Liang 2025-06-04 12:44:20 +08:00
  • 37f1547587 [FEAT] Add transformers backend support (#5929) Marc Sun 2025-06-04 06:05:29 +02:00
  • 8a5480528d [Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735) Cheng Wan 2025-06-03 17:48:24 -07:00
  • b6d0ce9f78 Minor add metrics to expert location updater (#6816) fzyzcjy 2025-06-03 14:59:11 +08:00
  • 0ea330ca34 Fix wrong weight reference in dynamic EPLB (#6818) fzyzcjy 2025-06-03 14:26:04 +08:00
  • 27e327b415 fix new_page_count_next_decode (#6671) pansicheng 2025-06-03 13:48:52 +08:00
  • ff00895c46 Add CPU optimized kernels for topk and rope fusions (#6456) jianan-gu 2025-06-03 08:37:34 +08:00
  • ff91474825 [Router] Fix k8s Service Discovery (#6766) Arthur Cheng 2025-06-02 16:57:23 -07:00
  • eb38c7d1ca [1/2] Add Kernel support for Cutlass based Fused FP4 MoE (#6093) Pavani Majety 2025-06-02 13:48:03 -07:00
  • df7f61ee7d Speed up rebalancing when using non-static dispatch algorithms (#6812) fzyzcjy 2025-06-03 02:18:17 +08:00
  • ef21729c1d Fix profiles do not have consistent names (#6811) fzyzcjy 2025-06-03 02:17:22 +08:00
  • f5159315b2 Add simple utility to dump tensors for debugging (#6815) fzyzcjy 2025-06-03 02:15:31 +08:00
  • 6d7b6696d4 Tiny fix EPLB assertion about rebalancing period and recorder window size (#6813) fzyzcjy 2025-06-03 02:13:33 +08:00
  • 6376b632eb Tiny log prefill time (#6780) fzyzcjy 2025-06-03 01:28:27 +08:00
  • e05e29d178 Refactor CustomOp to avoid confusing bugs (#5382) fzyzcjy 2025-06-03 01:27:36 +08:00