Commit Graph

  • 61bb285827 chore: upgrade xgrammar 0.1.21 (#7962) Yineng Zhang 2025-07-11 19:26:52 -07:00
  • 880221bd3b Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (#7236)" (#7968) fzyzcjy 2025-07-12 10:03:01 +08:00
  • 8f3173d0b0 chore: bump sgl-kernel v0.2.5 (#7964) Yineng Zhang 2025-07-11 18:24:20 -07:00
  • 26118a133d [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (#7932) Qi Yuhang 2025-07-12 05:29:13 +08:00
  • 475a249bb8 temporarily disable deepep-8-gpu and activate two small tests (#7961) Cheng Wan 2025-07-11 14:22:05 -07:00
  • 191d836ff6 fix: minor fix for modelopt weight load compatibility (#7953) Peng Zhang 2025-07-12 05:20:58 +08:00
  • 86044712c6 [feature] kv transfer support of ascend npu (#7795) ronnie_zheng 2025-07-11 10:07:51 +03:00
  • 615553079d Support Kimi K2 (#7940) Atream 2025-07-11 15:02:21 +08:00
  • 49a5915f53 [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (#7775) Xiaoyu Zhang 2025-07-11 06:12:39 +08:00
  • 766392c6bd [feature]Ascend quantization support (#7791) ronnie_zheng 2025-07-10 19:17:37 +03:00
  • 4a0d19198b Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (#6449) likesen-alibaba 2025-07-10 16:19:56 +08:00
  • 5748241549 add sentencepiece as dependency explicitly (#7922) Zaili Wang 2025-07-10 16:06:27 +08:00
  • 2d54d4bb64 Feat: Support Phi-3.5-MoE in SGLang (#7907) Binyao Jiang 2025-07-09 23:51:33 -07:00
  • b5e3d6031c vlm: support video as an input modality (#5888) Mick 2025-07-10 14:48:35 +08:00
  • 4ed57807c2 [bugfix] add pd router policy validation (#7904) Simo Lin 2025-07-09 21:52:06 -07:00
  • dd445a41f5 [feature] Add start step profile argument in /start_profile (#7608) kyleliang-nv 2025-07-09 18:42:15 -07:00
  • 7590f5224b [router] Update metrics when request completes (#7899) Shuaiyi Zhang 2025-07-09 23:18:09 +08:00
  • f9df11ae86 Remove unused imports (#7898) almaslof 2025-07-09 17:36:48 +03:00
  • d389bedf72 [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (#7838) jianan-gu 2025-07-09 17:04:21 +08:00
  • ac80f4da57 [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (#7885) Chunyuan WU 2025-07-09 16:53:53 +08:00
  • d487555f84 [CI] Add deepep tests to CI (#7872) Cheng Wan 2025-07-09 01:49:47 -07:00
  • e5888eddda Fixes typo in assertion message (#7895) Xinyuan Tong 2025-07-09 01:47:14 -07:00
  • 066f4ec91f chore: bump v0.4.9.post1 (#7882) Yineng Zhang 2025-07-09 00:28:17 -07:00
  • b6b6268ccf Revert "Embedding parallel by attn_tp (#7623)" (#7880) Yineng Zhang 2025-07-08 22:03:09 -07:00
  • 0870232195 Update native_api doc to match the change in the get_model_info endpoint (#7660) Yikai Zhang 2025-07-09 12:05:58 +08:00
  • 64c5907e12 [PD] Add guidance for prefill bootstrap timeout (#7846) Shangming Cai 2025-07-09 12:00:34 +08:00
  • 128f16a817 [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (#7818) Chunyuan WU 2025-07-09 10:27:24 +08:00
  • 4986104618 Bump xgrammar's version to 0.1.20 (#7866) ybyang 2025-07-09 08:55:30 +08:00
  • a37e1247c1 [Multimodal][Perf] Use pybase64 instead of base64 (#7724) Brayden Zhong 2025-07-08 17:00:58 -04:00
  • 136c6e0431 fix: Handles input_embeds in GenerateReqInput when n>1 (#7830) Xinyuan Tong 2025-07-08 14:00:42 -07:00
  • 43e20c0647 Support Mimo-VL (#7579) Xinyuan Tong 2025-07-08 14:00:25 -07:00
  • 4bab50a6b5 Fix llama4 vision (#7840) Xinyuan Tong 2025-07-08 14:00:03 -07:00
  • 2e7ab862e3 Fix illegal memory in trtllm allreduce fusion (#7864) Xiaoyu Zhang 2025-07-09 02:47:17 +08:00
  • 51ae40306a [router] forward stream_options in request (#7860) Shuaiyi Zhang 2025-07-08 23:03:38 +08:00
  • 653b873b91 Fix cache modules of triton import error (#7832) kk 2025-07-08 17:50:09 +08:00
  • d379bda4fa [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (#7853) Shangming Cai 2025-07-08 17:49:32 +08:00
  • 2b0e1d1ce0 [Minor] Fix sporadic CI timeout caused by underestimated tests. (#7850) Lifu Huang 2025-07-08 01:01:49 -07:00
  • 659907e32b Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (#7129) Zhiyu 2025-07-08 00:19:50 -07:00
  • cb9d91ea8a feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (#7762) SijiaYang 2025-07-08 05:47:21 +08:00
  • 6a6e0bb7fd docs: update README (#7821) Yineng Zhang 2025-07-07 02:47:04 -07:00
  • 076313bd09 [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (#7187) Haohui Mai 2025-07-07 02:09:58 -07:00
  • 9abe1163ac fix duplicate args in schedule_batch (#7816) Ziming Huang 2025-07-07 16:31:03 +08:00
  • 3646f6bb3e [misc] release new router version (#7798) Simo Lin 2025-07-06 22:54:17 -07:00
  • 35724aa182 [docs] update router readme (#7797) Simo Lin 2025-07-06 22:54:11 -07:00
  • 2fc824b84c Kernels for efficient KV cache IO (#7313) Zhiqiang Xie 2025-07-06 22:53:36 -07:00
  • 253454de9b Integrate triton moe kernel (#7689) Yuan Luo 2025-07-07 11:05:49 +08:00
  • ea3e7ffec7 [bugfix] Fix sgl-router get_server_info endpoint compatibility issue (#7813) Simo Lin 2025-07-06 19:52:57 -07:00
  • 8d4a01cbd7 Log the timestamps of each prefill/decode iteration (#6094) yuhsuan-t 2025-07-06 18:57:27 -07:00
  • a3398d8478 Optimize moe align block size kernel (#7794) Ke Bao 2025-07-07 09:20:30 +08:00
  • ba69c153f6 [RL]: Fix error tagging in multi-stage wake up (#7812) Nan Jiang 2025-07-06 16:51:29 -07:00
  • 3589aa79b0 [RL] Fix illegal memory for _import_static_state (#7733) Stefan He 2025-07-06 16:25:21 -07:00
  • e00715eb66 [AMD] Add test_fused_moe.py and test_rope_rocm.py to AMD CI (#5246) Hubert Lu 2025-07-06 01:47:16 -07:00
  • ea4bf12286 Fix division-by-zero bug in LoRA triton kernels. (#7785) Lifu Huang 2025-07-06 00:45:29 -07:00
  • a291439a59 Support logprobs in two-batch overlap (#7709) fzyzcjy 2025-07-06 10:05:32 +08:00
  • 54411f6afa fix: disable dsv3_router_gemm in dsv3_nextn (#7793) JieXin Liang 2025-07-06 10:01:01 +08:00
  • 625018d259 fix: free disk space (#7803) Yineng Zhang 2025-07-05 18:52:25 -07:00
  • 5732d904cc [misc] remove pdlb rust (#7796) Simo Lin 2025-07-05 17:44:51 -07:00
  • ec5f9c6269 chore: bump v0.4.9 (#7802) Yineng Zhang 2025-07-05 17:40:29 -07:00
  • 62f5522ffe chore: upgrade sgl-kernel v0.2.4 (#7801) Yineng Zhang 2025-07-05 17:37:40 -07:00
  • 01f9873048 Fix CI test OOM issue. (#7799) Lifu Huang 2025-07-05 15:11:02 -07:00
  • 199d621845 ci: fix port args (#7792) Mick 2025-07-06 06:06:42 +08:00
  • f200af0d8c chore: bump sgl-kernel v0.2.4 (#7800) Yineng Zhang 2025-07-05 15:03:31 -07:00
  • 5589b75024 Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (#7756) Lianmin Zheng 2025-07-05 12:17:05 -07:00
  • c04a8a820b [fix] fix misusing of is_cuda (#7790) JieXin Liang 2025-07-05 19:02:14 +08:00
  • 6c903611ca Fix incorrect spec_num_draft_tokens in draft_extend (#7757) Cheng Wan 2025-07-05 02:18:16 -07:00
  • 77cfea689d chore: upgrade sgl-kernel v0.2.3 (#7786) Yineng Zhang 2025-07-05 01:55:55 -07:00
  • 8fc910db03 DP Attention with Auto DeepEP Dispatch (#7222) Cheng Wan 2025-07-05 01:54:24 -07:00
  • 75354d9ae9 fix: use nvidia-nccl-cu12 2.27.5 (#7787) Yineng Zhang 2025-07-05 01:28:21 -07:00
  • 4fece12be9 chore: bump sgl-kernel v0.2.3 (#7784) Yineng Zhang 2025-07-05 00:05:45 -07:00
  • c797322280 fix: fix apply_shuffle_mul_sum (#7444) Mick 2025-07-05 14:23:30 +08:00
  • ef8a29c429 Embedding parallel by attn_tp (#7623) Gang Chen 2025-07-05 14:21:56 +08:00
  • 8e9fb43d82 Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (#7782) Qi Yuhang 2025-07-05 13:25:49 +08:00
  • 8364608930 add model: qwen2-audio (#7596) Leng Yue 2025-07-04 21:13:10 -07:00
  • da3890e82a [1/n]: add cutlass W4A8 moe kernel for hopper architecture (#7772) SijiaYang 2025-07-05 11:50:12 +08:00
  • cb432f1770 saving hidden_states.clone() (#7705) Cheng Wan 2025-07-04 20:07:42 -07:00
  • 1964c325de [feat] Support EAGLE3 for Qwen (#7745) Ximingwang-09 2025-07-05 10:50:28 +08:00
  • af5647748a [Fix] Alloc return type error (#7778) Caproni 2025-07-05 10:00:40 +08:00
  • af46f299f9 [RL] add pause and continue generation for async rl training (#7419) Zilin Zhu 2025-07-05 09:49:49 +08:00
  • 16a6b1d83a [RL] Add --nccl-port to prevent port conflict (#7418) Zilin Zhu 2025-07-05 09:48:57 +08:00
  • 14229ccf8f Move mem_fraction_static adjustment for multimodal models to server_args.py & Fix session control & Other cleanups (#7748) Lianmin Zheng 2025-07-04 16:33:33 -07:00
  • 975a5ec69c [fix] update bench_speculative.py for compatibility (#7764) Kay Yan 2025-07-04 16:32:54 +08:00
  • 1e3e3add3d fix(docs): fix the broken link in docs/references/production_metrics.md (#7741) Yuchen Cheng 2025-07-04 14:46:07 +08:00
  • 8c298031d5 refactor llama4 dp attention logic (#7729) Yi Zhang 2025-07-04 13:48:11 +08:00
  • 4de0395343 Add V2-lite model test (#7390) YanbingJiang 2025-07-04 13:25:50 +08:00
  • 8b1942c6cc Remove type conversion and fix id map in topk (#7759) Ke Bao 2025-07-04 09:13:32 +08:00
  • 489934be0a fuse renormal into moe topk softmax kernel python code (#7751) Yi Zhang 2025-07-04 07:22:14 +08:00
  • 43f93f632c fix CI: update native api ipynb (#7754) Xinyuan Tong 2025-07-03 15:25:00 -07:00
  • aca1101a13 chore: bump sgl-kernel 0.2.2 (#7755) Yineng Zhang 2025-07-03 12:49:10 -07:00
  • 2998c4bdf4 [optimize] fuse renormalize into moe_topk_softmax (#7744) Yi Zhang 2025-07-04 03:42:44 +08:00
  • 6840a7bbb2 [fix] put cpu in the first priority in get_device() (#7752) JieXin Liang 2025-07-04 02:49:32 +08:00
  • c01a1df588 [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (#7723) yilian49 2025-07-03 11:32:11 -07:00
  • 0099172327 feat: use D2D instead of H2H in pp (#7673) TianyuZhang1214 2025-07-04 01:58:50 +08:00
  • 264dc6e744 [optimize] add two stream norm for qwen3 (#7740) Yi Zhang 2025-07-04 00:59:17 +08:00
  • 646cef2e2e support qwen3 dense model dp attention (#7681) Yi Zhang 2025-07-04 00:58:20 +08:00
  • 1dce6c480f [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (#6771) Chunyuan WU 2025-07-04 00:51:38 +08:00
  • 9fcc9a80e7 [CPU] refine CPU integration code (#7647) Chunyuan WU 2025-07-04 00:51:09 +08:00
  • ac49dac009 [fix] fix dsv3_router_gemm filter (#7750) JieXin Liang 2025-07-04 00:25:32 +08:00
  • 1e0e549766 Ascend attention backend(PA&MLA) (#7722) ronnie_zheng 2025-07-03 19:23:19 +03:00
  • b58226510f fix dsv3 fused proj check (#7738) AniZpZ 2025-07-03 16:52:44 +08:00
  • 2c4feaf308 Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (#7278) ayrnb 2025-07-03 14:27:03 +08:00