Commit Graph

  • 69276f619a doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (#6199) mlmz 2025-05-11 23:22:11 +08:00
  • 41a645f556 Handle empty input string for embedding models (#5621) Ravi Theja 2025-05-11 20:47:15 +05:30
  • 230106304d chore: upgrade sgl-kernel v0.1.2.post1 (#6196) Yineng Zhang 2025-05-11 07:41:37 -07:00
  • 45b4dcf037 chore: bump sgl-kernel v0.1.2.post1 (#6195) Yineng Zhang 2025-05-11 02:24:10 -07:00
  • 213e8c7dd5 chore: upgrade deepgemm (#6073) Yineng Zhang 2025-05-11 02:17:24 -07:00
  • 41273fd71f fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (#6169) Yusong Gao 2025-05-11 15:28:21 +08:00
  • e9bebafb19 [fix] remove mixtral from is_fa3_default_architecture (#6191) JieXin Liang 2025-05-11 15:15:54 +08:00
  • 4d1c9db66c feat: support loogle eval (#6190) Yineng Zhang 2025-05-10 23:52:44 -07:00
  • 17c36c5511 [CI] Disabled deepep tests temporarily because it takes too much time. (#6186) Lianmin Zheng 2025-05-10 23:40:50 -07:00
  • 31d1f6e7f4 [PD] Add simple unit test for disaggregation feature (#5654) shangmingc 2025-05-11 13:35:27 +08:00
  • a823c6e834 Remove duplicate IO Struct test (#6180) Emmanuel Ferdman 2025-05-11 07:57:26 +03:00
  • 2ce8793519 Add typo checker in pre-commit (#6179) applesaucethebun 2025-05-11 00:55:00 -04:00
  • de167cf5fa Fix request abortion (#6184) Lianmin Zheng 2025-05-10 21:54:46 -07:00
  • 4319978c73 Fix data parallel perf regression (#6183) Lianmin Zheng 2025-05-10 19:18:35 -07:00
  • 03dd785cd0 Added async_encode method to Engine (#4701) Steven Shimizu 2025-05-10 18:58:40 -07:00
  • 66fc63d6b1 Revert "feat: add thinking_budget (#6089)" (#6181) Yineng Zhang 2025-05-10 16:07:45 -07:00
  • 921e4a8185 [Docs]Delete duplicate content (#6146) Ximingwang-09 2025-05-11 06:02:15 +08:00
  • 9d8ec2e67e Fix and Clean up chat-template requirement for VLM (#6114) XinyuanTong 2025-05-10 09:14:09 -07:00
  • c178abdabc [fix] fix determine_n_share_experts_fusion (#6118) JieXin Liang 2025-05-10 16:19:09 +08:00
  • b29a026e14 KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (#6016) Simon (Jiyou) Li 2025-05-10 06:50:06 +08:00
  • 678d8cc987 chore: bump v0.4.6.post3 (#6165) Yineng Zhang 2025-05-09 15:38:47 -07:00
  • d2cb3024f2 fix bug that gpu0 occupies more memory when hicache is turned on (#5778) huangtingwei 2025-05-10 06:36:08 +08:00
  • 1940cdec61 [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (#6162) Chang Su 2025-05-09 15:33:02 -07:00
  • 63484f9fd6 feat: add thinking_budget (#6089) thyecust 2025-05-09 23:22:09 +08:00
  • dff0ab92eb Update amd nightly concurrency. (#6141) Sai Enduri 2025-05-09 00:02:14 -07:00
  • e30c273bc9 opt flashinfer mla cat (#5822) xu-yfei 2025-05-09 14:17:14 +08:00
  • 0ab3f437ab Cutlass MLA: Disable split kv due to https://github.com/NVIDIA/cutlass/issues/2274 (#6101) Trevor Morris 2025-05-08 18:44:30 -07:00
  • cec98f1034 [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (#5745) yhyang201 2025-05-09 08:52:26 +08:00
  • 8dc4efd0ab docs: update README (#6132) Yineng Zhang 2025-05-08 15:18:48 -07:00
  • 6578cf27de chore: bump sgl-kernel 0.1.2 (#6131) Yineng Zhang 2025-05-08 15:16:28 -07:00
  • 087751a8f2 Remove unecessary is_fa3_supported check (#6112) Stefan He 2025-05-08 14:45:33 -07:00
  • 911f3ba6f4 upgrade xgrammar to 0.1.19 (#6129) Yixin Dong 2025-05-08 17:42:02 -04:00
  • f6f96b0521 [sgl-kernel] fix: fix cu118 compile error (#6123) PGFLMG 2025-05-09 05:26:51 +08:00
  • 2a936a841e [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (#6097) Hubert Lu 2025-05-08 13:46:58 -07:00
  • 5e02330137 [perf] dsv3 bmm fallback to bf16 (#5662) JieXin Liang 2025-05-09 02:43:39 +08:00
  • fa7d7fd9e5 [Feature] Add FlashAttention3 as a backend for VisionAttention (#5764) Zhu Chen 2025-05-09 01:01:19 +08:00
  • f1ff736d68 [fix] fix pyproject.toml dependencies (#6119) JieXin Liang 2025-05-08 17:14:36 +08:00
  • acc816d8a2 DeepEP normal support deepgemm-contiguous (#5626) lukec 2025-05-08 16:20:32 +08:00
  • a05bd83a94 Change AMD test threshold (#6091) fzyzcjy 2025-05-08 16:05:52 +08:00
  • cef91b1ed7 [PD] Add control to slow down a server (#5572) fzyzcjy 2025-05-08 16:03:08 +08:00
  • 6450c1228c Tiny refactor weight loading logic (#5232) fzyzcjy 2025-05-08 16:02:56 +08:00
  • b6cf3532b5 Tiny refactor ModelConfig.from_server_args (#5219) fzyzcjy 2025-05-08 16:02:43 +08:00
  • 3b2680a44d Overlap shared expert and routed expert computations (#5121) fzyzcjy 2025-05-08 16:02:32 +08:00
  • 79961afa82 optimize pad operations in fa3 to accelarate 100+us (#6077) Minglei Zhu 2025-05-07 23:40:08 -07:00
  • cfca4e0ed2 adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (#6111) Baizhou Zhang 2025-05-07 23:39:10 -07:00
  • e88dd482ed [CI]Add performance CI for VLM (#6038) XinyuanTong 2025-05-07 19:20:03 -07:00
  • 73600673bb Clean logs for DeepSeek-V3 launching (#6079) Baizhou Zhang 2025-05-07 18:54:50 -07:00
  • 8f508cc77f Update doc for MLA attention backends (#6034) Baizhou Zhang 2025-05-07 18:51:05 -07:00
  • 9bddf1c82d Deferring 8 GPU test (#6102) Cheng Wan 2025-05-07 21:49:58 -04:00
  • 24c13ca950 Clean up fa3 test from 8 gpus (#6105) Stefan He 2025-05-07 18:38:40 -07:00
  • b70957fcf8 [refactor] slightly tidy fp8 module (#5993) JieXin Liang 2025-05-08 08:28:24 +08:00
  • e444c13fb4 feat(engine): add bootstrap parameters to generate methods (dynamo) (#6075) ishandhanani 2025-05-07 10:33:58 -07:00
  • fee37d9e8d [Doc]Fix description for dp_size argument (#6063) Baizhou Zhang 2025-05-07 09:04:22 -07:00
  • c68de47915 Super tiny fix doc (#5233) fzyzcjy 2025-05-07 22:41:50 +08:00
  • 4c7b42424c Hint users DeepEP normal mode is incompatible with CUDA Graph (#5014) fzyzcjy 2025-05-07 22:40:59 +08:00
  • 38053c3372 Fix the timeout for 8 gpu tests (#6084) Lianmin Zheng 2025-05-07 03:13:12 -07:00
  • 00c2c1f08b [Feature] Support for Ascend NPU backend (#3853) Song Zhang 2025-05-07 11:32:53 +08:00
  • cb69194562 feat: add release workflow for SGLang kernels on aarch64 (#6010) Johnny 2025-05-07 04:42:07 +02:00
  • d25398cbc8 fix custom_allreduce namespace (#6039) Xiaoyu Zhang 2025-05-07 10:13:06 +08:00
  • 8a828666a3 Add DeepEP to CI PR Test (#5655) Jinyan Chen 2025-05-07 08:36:03 +08:00
  • aff584fa54 Fix sgl-kernel build on aarch64 platforms (#6062) Qiaolin Yu 2025-05-06 19:10:57 -04:00
  • 6f56614734 chore: upgrade cutlass 3.9.2 (#6004) Yineng Zhang 2025-05-06 13:34:08 -07:00
  • bdd17998e6 [Fix] Fix and rename flashmla CI test (#6045) Baizhou Zhang 2025-05-06 13:25:15 -07:00
  • c9abd7be01 fix: deepep dockerfile, use pip install deepep. (#5885) Hank Han 2025-05-07 01:55:26 +08:00
  • a3e4e9bf9e Better PD initialization (#5751) Liangsheng Yin 2025-05-07 01:12:57 +08:00
  • 6d4d3bc81d Fix not "import os" (#6057) Liangsheng Yin 2025-05-06 22:06:41 +08:00
  • 5f300141b7 docs: add new blog (#6048) Yineng Zhang 2025-05-06 00:48:10 -07:00
  • 1c05425bcb docs: add Google Cloud Vertex AI in Adoption and Sponsorship (#6047) Yineng Zhang 2025-05-06 00:31:48 -07:00
  • b26cb1c55a Fix problem of large page size with chunked prefill (#6046) Zhiqiang Xie 2025-05-06 00:19:47 -07:00
  • f8e460930a Fix prefill OOM error in the case of large page size (#5081) Zhiqiang Xie 2025-05-05 16:02:55 -07:00
  • 683707c314 [Security][Bug] Prevent binding to all TCP interfaces (#5752) Adarsh Shirawalmath 2025-05-06 00:51:45 +05:30
  • a68ed76682 feat: append more comprehensive fields in messages instead of merely role and content (#5996) mlmz 2025-05-06 02:43:34 +08:00
  • 82653f6622 feat: Add a unified merge_state API (#5428) DefTruth 2025-05-06 01:32:33 +08:00
  • 22da3d978f Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (#5555) Wenxuan Tan 2025-05-05 12:32:17 -05:00
  • b8559764f6 [Test] Add flashmla attention backend test (#5587) Huapeng Zhou 2025-05-06 01:32:02 +08:00
  • 56f6589ecb [PD] Optimize disaggregation ib device help info (#5781) shangmingc 2025-05-05 13:47:37 +08:00
  • 1232f7e8b7 Update dev container config to support live code sync and improve docker setup guide (#6018) Lifu Huang 2025-05-04 22:33:46 -07:00
  • 3008db9c1a [PD] Allow customizing reserved tokens to avoid KV cache waste (#6002) fzyzcjy 2025-05-05 11:23:15 +08:00
  • 357fb2dba5 fix: fix broadcast_pyobj breaking VerlEngine (#5997) Junrong Lin 2025-05-05 04:15:53 +08:00
  • 95c231e50d Tool Call: Add chat_template_kwargs documentation (#5679) vzed 2025-05-04 16:12:40 -04:00
  • 3042f1da61 Fix flaky issues of lora and add multi batch tests (#5957) Qiaolin Yu 2025-05-04 16:11:40 -04:00
  • 2b63798c7d [Minor] Fix duplicate method definitions in conversation.py (#6012) Lifu Huang 2025-05-04 13:02:53 -07:00
  • bf203cb7a2 [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (#5992) Baizhou Zhang 2025-05-04 09:49:13 -07:00
  • 8ebde73f7d [perf] H100 DeepSeek-V3 fused moe tuned config (#5998) JieXin Liang 2025-05-04 05:02:26 +08:00
  • 6b0fae797a Fix Phi3 serving which was broke by earlier change (#5991) Stefan He 2025-05-03 00:28:47 -07:00
  • 141a459644 fix: only upgrade nccl for cu128 (#5986) Yineng Zhang 2025-05-02 11:31:29 -07:00
  • d8ab60117f Overlap qk norm with two streams (#5977) Ke Bao 2025-05-03 00:26:30 +08:00
  • 6579cd7daf Fix set kv cache multi-stream (#5975) Ke Bao 2025-05-03 00:26:00 +08:00
  • 97ac42b634 [PD] NIXL backend Prefill TP & Decode TP+DP (#5681) Yongtong Wu 2025-05-02 22:14:03 +08:00
  • 1acca3a2c6 FA3 speed up: skip len operation and get batch size directly from forward batch (#5969) Lifu Huang 2025-05-02 00:26:12 -07:00
  • 6ea1e6ac6e Support MMMU benchmark for InternVL (#5968) XinyuanTong 2025-05-02 00:17:21 -07:00
  • 3409aaab32 Support InternVL3 (#5350) xm:D 2025-05-02 13:38:59 +08:00
  • 73dcf2b326 Remove token in token out in Native API (#5967) Chayenne 2025-05-01 21:59:43 -07:00
  • 170d1f218a feat: Refactor DeepSeekV3 function call (#5908) Chang Su 2025-05-01 21:28:57 -07:00
  • 73bc1d00fc Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (#5960) Sai Enduri 2025-05-01 20:56:59 -07:00
  • c5645e928f feat: add concurrency evaluation logic in mmmu benchmark (#5782) XinyuanTong 2025-05-01 18:20:08 -07:00
  • d33955d28a Properly return error response in vertex_generate HTTP endpoint (#5956) KCFindstr 2025-05-01 11:48:58 -07:00
  • 6fc175968c Optimize a pad operation to accelerate 25us (#5945) Stefan He 2025-05-01 10:48:55 -07:00
  • ad506a4e6b docs: Fix Qwen model typo (#5944) 江家瑋 2025-05-02 01:23:00 +08:00
  • ebaba85655 Update ci test and doc for MTP api change (#5952) Ke Bao 2025-05-02 00:30:27 +08:00