Commit Graph

  • a2cb5913a0 Add draft extend CUDA graph for flashinfer backend (#6805) Ke Bao 2025-06-02 16:51:26 +08:00
  • 55444ed667 [EP] Add cuda kernel for moe_ep_pre_reorder (#6699) Yuan Luo 2025-06-02 11:49:01 +08:00
  • 20fd53b8f6 Correctly abort the failed grammar requests & Improve the handling of abort (#6803) Lianmin Zheng 2025-06-01 19:00:07 -07:00
  • 6a47b73024 Remove contiguous before Flashinfer groupwise fp8 gemm (#6804) Baizhou Zhang 2025-06-01 18:30:54 -07:00
  • c429919def misc: cache is_hopper_arch (#6799) Wenxuan Tan 2025-06-01 17:28:31 -05:00
  • 1da8d23051 chore: update blackwell docker (#6800) Yineng Zhang 2025-06-01 13:37:40 -07:00
  • 2f7420bc84 [Feat] Enable PDL automatically on Hopper architecture (#5981) Huapeng Zhou 2025-06-01 12:30:17 -07:00
  • c6a0cacc35 Update CI tests for Llama4 models (#6421) Ravi Theja 2025-06-01 09:22:15 +05:30
  • 0a9bfc20ab [Minor] Always append newline after image token when parsing chat message (#6797) Lifu Huang 2025-05-31 20:50:33 -07:00
  • 34c63731fc chore: upgrade sgl-kernel v0.1.5 (#6795) Yineng Zhang 2025-05-31 18:32:00 -07:00
  • 2d72fc47cf Improve profiler and integrate profiler in bench_one_batch_server (#6787) Lianmin Zheng 2025-05-31 15:53:55 -07:00
  • b520d02888 chore: bump sgl-kernel v0.1.5 (#6794) Yineng Zhang 2025-05-31 14:54:00 -07:00
  • 7dc0e39442 Bump torch to 2.7.0 (#6788) Qiaolin Yu 2025-05-31 17:43:12 -04:00
  • fb507b7b10 [FIX] mmmu bench serving result display error (#6525) (#6791) Yikai Zhang 2025-06-01 04:48:06 +08:00
  • f90945c45a fix(PD-disaggregation): Can not get local ip (#6792) storyicon 2025-06-01 04:47:14 +08:00
  • 094fbdacd5 Fix incorrect LoRA weight loading for fused gate_up_proj (#6734) Lifu Huang 2025-05-31 13:41:44 -07:00
  • 888cb175a6 Add intel_amx backend for Radix Attention for CPU (#6408) YanbingJiang 2025-05-31 12:37:42 +08:00
  • e39bca0756 ci: relax test_function_call_required (#6786) Chang Su 2025-05-30 19:18:42 -07:00
  • a2bb856543 Temporarily lower mmlu threshold for triton sliding window backend (#6785) Jianan Ji 2025-05-30 21:40:50 -04:00
  • ced3c07afe Support token-level quantization for EP MoE (#6782) Cheng Wan 2025-05-30 17:26:30 -07:00
  • f18b068f15 feat(tool call): Enhance Llama32Detector for improved JSON parsing in non-stream (#6784) Chang Su 2025-05-30 17:05:17 -07:00
  • 4fac524b14 update llama4 chat template and pythonic parser (#6679) Chao Yang 2025-05-30 17:01:22 -07:00
  • b581b22504 Fix one bug in the grouped-gemm triton kernel (#6772) Cheng Wan 2025-05-30 01:42:08 -07:00
  • 69dd878b51 Fix shared experts fusion error (#6289) Li Hui 2025-05-30 16:16:11 +08:00
  • 22630ca242 Support sliding window in triton backend (#6509) Jianan Ji 2025-05-30 04:11:53 -04:00
  • d279d4990c Fix aiohttp 'Chunk too big' in bench_serving (#6737) Yuhong Guo 2025-05-30 15:50:36 +08:00
  • 6cb00c6398 [PD] Optimize time out logic and add env var doc for mooncake (#6761) shangmingc 2025-05-30 15:45:02 +08:00
  • 62cac2c43a Update DeepSeek-R1-0528 function call chat template (#6765) Xu Wenqing 2025-05-30 15:42:57 +08:00
  • 2c3b71d678 Improve EPLB logical to physical dispatch map (#6727) fzyzcjy 2025-05-30 10:23:54 +08:00
  • 51cdd81f97 [fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight (#6265) Zilin Zhu 2025-05-30 07:28:10 +08:00
  • 73def253b5 Fix mem_fraction_static for AMD CI (#6748) Baizhou Zhang 2025-05-29 12:37:30 -07:00
  • d9d35def3d [test] add ut and bm for get_last_loc (#6746) JieXin Liang 2025-05-30 02:47:21 +08:00
  • 6df81e8a39 Support tuning DeepEP configs (#6742) fzyzcjy 2025-05-29 23:12:22 +08:00
  • 3ab7d9b55e Support picking variants of EPLB algorithms (#6728) fzyzcjy 2025-05-29 23:12:01 +08:00
  • 7e5071c92a Super tiny enable sole usage of expert distribution metrics and update doc (#6680) fzyzcjy 2025-05-29 23:11:38 +08:00
  • 78689d3393 PD Rust LB (PO2) (#6437) Liangsheng Yin 2025-05-29 20:50:10 +08:00
  • 1dc6864f17 [PD] Support completion endpoint (#6729) shangmingc 2025-05-29 16:26:18 +08:00
  • 485a023bd8 refactor apply_w8a8_block_fp8_linear in fp (#6545) ChangyiYang 2025-05-29 00:15:11 -07:00
  • 7e41290082 Add draft extend CUDA graph for Triton backend (#6705) Ke Bao 2025-05-29 15:13:07 +08:00
  • c673727e0e refactor(tool call): Fix BaseFormatDetector tool_index issue and refactor parse_streaming_increment (#6715) Chang Su 2025-05-29 00:08:45 -07:00
  • f4d4f93928 Add DeepSeek-R1-0528 function call chat template (#6725) Xu Wenqing 2025-05-29 15:05:07 +08:00
  • f2bd3515fb Tune memory arguments on B200 (#6718) Baizhou Zhang 2025-05-29 00:03:22 -07:00
  • c459536b0f [PD] bug fix: Update status if nixl receiver send a a dummy req. (#6720) dongmao zhang 2025-05-29 00:01:56 -07:00
  • 535c838674 [fix] more mem for draft_extend cuda_graph (#6726) JieXin Liang 2025-05-29 14:25:18 +08:00
  • 2163586e63 [feat] triton kernel for get_last_loc (#6676) JieXin Liang 2025-05-29 14:10:28 +08:00
  • e06b076105 Fix PP for Qwen3 MoE (#6709) iLeGend 2025-05-29 14:06:18 +08:00
  • 844a8f42c7 Fix LoRA bench (#6719) Wenxuan Tan 2025-05-28 18:38:55 -05:00
  • 791b3bfabb [Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell (#6479) Baizhou Zhang 2025-05-28 16:03:43 -07:00
  • 31589e177e Speed up when having padding tokens two-batch overlap (#6668) fzyzcjy 2025-05-29 07:00:58 +08:00
  • ae6a5b2950 Minor refactor two-batch overlap (#6682) fzyzcjy 2025-05-29 06:54:17 +08:00
  • 4839999b76 Overlap two kernels in DeepSeek with communication (#6711) fzyzcjy 2025-05-29 06:53:51 +08:00
  • 541a985f85 Fuse routed_scaling_factor in DeepSeek (#6710) fzyzcjy 2025-05-29 06:53:37 +08:00
  • 5170b010a6 [PD] Remove Unnecessary Exception Handling for FastQueue.get() (#6712) Hongbo Xu 2025-05-29 02:18:24 +08:00
  • d63e76f735 [CI] Fix setup of disaggregation with different tp (#6706) shangmingc 2025-05-29 02:17:27 +08:00
  • e9fd11c0d1 [Bugfix] Fix ChatCompletion endpoint of mini_lb when stream is set (#6703) shangmingc 2025-05-28 21:33:36 +08:00
  • c7588d593e [Bugfix] Fix slice operation when chunk size mismatch (#6697) shangmingc 2025-05-28 21:15:00 +08:00
  • 6b231325b9 [PD Perf] replace Queue to FastQueue (#6649) ybyang 2025-05-28 16:37:51 +08:00
  • b1c8d4e9f3 [PD] Abort unbootstrapped prefill requests through timeout (#6685) shangmingc 2025-05-28 15:40:54 +08:00
  • c25231c679 [CI] Fix flaky pp single node test (#6689) shangmingc 2025-05-28 15:40:26 +08:00
  • fba03b29e3 [Bugfix] Fix missing abort finish reason for PD with ChatCompletion (#6693) shangmingc 2025-05-28 15:39:46 +08:00
  • 461a730280 fix(deepseekv3): Fix DeepSeekV3Detector tool_index assignment and multi-tool call streaming support (#6655) Chang Su 2025-05-28 00:22:53 -07:00
  • 076103535c fix log_info_on_rank0 error when run benchmark (#6260) Xiaoyu Zhang 2025-05-28 15:20:01 +08:00
  • c087ddd686 Refine pre_reorder_triton_kernel slightly to improve performance (#6627) Yuan Luo 2025-05-28 15:15:23 +08:00
  • f4a8987f69 Update amd docker and nightly models. (#6687) Sai Enduri 2025-05-28 00:08:08 -07:00
  • 41ba767f0c feat: Add warnings for invalid tool_choice and UTs (#6582) Chang Su 2025-05-27 16:53:19 -07:00
  • f127355a30 Add batch test for draft extend (#6672) Ke Bao 2025-05-28 07:32:05 +08:00
  • bdb962d755 fix(tool call): Fix tool_index in PythonicDetector and issues with mixed output in non-streaming (#6678) Chang Su 2025-05-27 16:18:42 -07:00
  • 0b9557fcd7 Disable compiling arch below sm_90 in aarch64 by default (#6380) Qiaolin Yu 2025-05-27 18:50:02 -04:00
  • 87068b5cc7 Support gathering expert distribution details (#6665) fzyzcjy 2025-05-28 06:32:59 +08:00
  • a564e001b5 Fix DeepEP error in Qwen 3 MoE models (#6673) fzyzcjy 2025-05-28 06:12:54 +08:00
  • 2103b80607 [CI] update verlengine ci to 4-gpu test (#6007) Junrong Lin 2025-05-28 05:32:23 +08:00
  • e806f708c9 [PD] Make bootstrap code common between NIXL and Mooncake (#6473) Trevor Morris 2025-05-27 12:47:38 -07:00
  • fa6723f08f Revert "fix communicator for non-dp lm head (#6662)" (#6677) Yineng Zhang 2025-05-27 12:22:59 -07:00
  • 673ff668f7 Speed up expert location update (#6661) fzyzcjy 2025-05-28 01:00:09 +08:00
  • 447be24228 Fix OOM when updating expert locations (#6660) fzyzcjy 2025-05-28 00:59:53 +08:00
  • 183d9f969c DeepSeek: enable none block-quant FP8 quantizations (#6638) HAI 2025-05-27 09:06:40 -07:00
  • 631950280a Support EAGLE draft extend CUDA graph (#6606) Ke Bao 2025-05-27 17:35:17 +08:00
  • a3d7f4b673 fix communicator for non-dp lm head (#6662) Cheng Wan 2025-05-27 02:31:12 -07:00
  • b18416fbf8 Fix qwen3 tbo/dp-lm-head (#6652) Yi Zhang 2025-05-27 15:38:27 +08:00
  • ce9d690ef4 fix: fix nightly test from updating transformers (#6658) Mick 2025-05-27 15:28:11 +08:00
  • bdaefbbfbd Add environment flag for disabling message queue broadcaster (#6403) Baizhou Zhang 2025-05-26 22:32:41 -07:00
  • 45a31a82e4 docs: Update documentation to reflect xgrammar as default grammar backend (#6601) Vincent Zhong 2025-05-27 01:29:13 -04:00
  • 1aa0fbf416 Add note to add supported model to documentation (#6640) Brayden Zhong 2025-05-27 01:18:46 -04:00
  • 7a0bbe6a64 update toc for doc and dockerfile code style format (#6450) linzhuo 2025-05-27 13:05:11 +08:00
  • ae33584235 [Bugfix]: Fix call for function_call_parser.multi_format_detector in adapter.py (#6650) Chang Su 2025-05-26 21:57:10 -07:00
  • 477a101cbd Refactor LoRA handling to support adapter tensors in fused format (#6585) Lifu Huang 2025-05-26 21:51:54 -07:00
  • 1a8f5f6836 Super tiny rename environment variable (#6648) fzyzcjy 2025-05-27 12:01:16 +08:00
  • 32cd707002 Support TP in attention for two batch overlap (#6634) fzyzcjy 2025-05-27 11:28:12 +08:00
  • ebd1ed49d4 Tiny refactor communicator (#6646) fzyzcjy 2025-05-27 11:24:17 +08:00
  • f77da69964 chore: upgrade mooncake-transfer-engine (#6643) Yineng Zhang 2025-05-26 20:01:30 -07:00
  • d6864ce6d6 [New Model] Devstral support (#6547) Xinyuan Tong 2025-05-26 19:27:48 -07:00
  • 755a36614b fix: added "\n" to qwen25 tool parser structural tags (#6631) Shi Shuai 2025-05-27 10:25:45 +08:00
  • 79a39ac0cc follow-up: move Idefics2 to a shared location to eliminate unexpected dependency. (#6603) Lifu Huang 2025-05-26 19:23:59 -07:00
  • 3ce94f71f9 [PD] Handle P/D failure and reconnect without affecting other instances (#6263) shangmingc 2025-05-27 10:21:01 +08:00
  • ca95556c76 Tiny fix sampler error when prob is not contiguous (#6639) fzyzcjy 2025-05-27 10:19:08 +08:00
  • eb8f02dd87 Update nightly thresholds and dependencies. (#6635) Sai Enduri 2025-05-26 11:44:13 -07:00
  • 0ca3e56802 Tiny fix missing expert location dispatch info (#6620) fzyzcjy 2025-05-26 23:58:31 +08:00
  • 5c7aa00976 Fix EPLB algorithm fail to run when using 3 nodes for prefill (#6629) fzyzcjy 2025-05-26 23:43:24 +08:00
  • fe386acae6 Automatically configure for EPLB-related args (#6628) fzyzcjy 2025-05-26 23:42:49 +08:00
  • 14d1075f2c fix qwen3moe eplb prefill bug (#6617) Yi Zhang 2025-05-26 17:15:21 +08:00