Commit Graph

  • 282eb59ff3 Add bf16 output option for dsv3_router_gemm kernel (#7999) Baizhou Zhang 2025-07-19 18:49:37 -07:00
  • 4540a4666a [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (#8115) ybyang 2025-07-20 09:10:00 +08:00
  • abda2542d5 Fix tuning_fused_moe_triton.py (#8175) Cheng Wan 2025-07-19 17:33:50 -07:00
  • 8cddfa56a1 Clean warning logs for gate_proj loading in Lora (#8172) Baizhou Zhang 2025-07-19 15:56:50 -07:00
  • 4e3defe5a7 Support start up LoRA server without initial adapters (#8019) Lifu Huang 2025-07-19 15:38:09 -07:00
  • 60468da4e2 bugfix: fix sglang crash in NVIDIA MIG container (#8167) Garry Fang 2025-07-20 05:41:27 +08:00
  • 41d33e4736 [router] add ut for worker and errors (#8170) Simo Lin 2025-07-19 14:38:33 -07:00
  • bfdd226f35 Fix Dockerfile.gb200 (#8169) kyleliang-nv 2025-07-19 14:37:53 -07:00
  • 3de617a75b Fix LoRA buffer contamination during adapter eviction (#8103) Lifu Huang 2025-07-19 13:14:08 -07:00
  • bb0e8a32b5 Clean up server args (#8161) Lianmin Zheng 2025-07-19 11:32:52 -07:00
  • 1b427dae02 Update README.md (#8171) Lianmin Zheng 2025-07-19 11:04:19 -07:00
  • f3d9736156 Fix suffix mismatch for the metrics. (#8168) Charles Chen 2025-07-20 01:11:24 +08:00
  • 561dd7b2ce chore: upgrade sgl-kernel 0.2.6 (#8166) Yineng Zhang 2025-07-19 03:17:08 -07:00
  • f98e88b9fb chore: bump sgl-kernel v0.2.6 (#8165) Yineng Zhang 2025-07-19 00:56:18 -07:00
  • 15ad6c9086 [1/N] MoE Refactor: refactor select_experts (#7966) Cheng Wan 2025-07-19 00:51:15 -07:00
  • cfab0ff6e2 Add GB200 wide-EP docker (#8157) kyleliang-nv 2025-07-18 22:34:29 -07:00
  • b763cf7e8e [router] allow router to have empty workers (#8160) Simo Lin 2025-07-18 22:09:54 -07:00
  • 8fcc55cfa1 [router] router metrics cleanup (#8158) Simo Lin 2025-07-18 22:09:17 -07:00
  • 610381b75e [health_generate] fix: fix the /health_generate always success bug (#8028) Yingchun Lai 2025-07-19 13:08:46 +08:00
  • 1403ea5694 [PD] Support non-MLA models PD different TP with DP attention (#7931) Shangming Cai 2025-07-19 13:00:49 +08:00
  • b7e951a6db Feat: Support audio in Phi4-mm model (#8048) Binyao Jiang 2025-07-18 21:03:53 -07:00
  • d918ab7985 Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#7302) Haohui Mai 2025-07-18 19:59:39 -07:00
  • 3964b352c3 chore: tune mem fraction static for vlm (#6881) Mick 2025-07-19 08:19:27 +08:00
  • 9c7a46180c [Doc] Steps to add a new attention backend (#8155) Lianmin Zheng 2025-07-18 16:38:26 -07:00
  • 7750b91ca8 [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (#7661) Hubert Lu 2025-07-18 14:27:25 -07:00
  • c8f31042a8 [router] Refactor router and policy traits with dependency injection (#7987) Simo Lin 2025-07-18 14:24:24 -07:00
  • 1f76fc8747 [3/n] chore: decouple AWQ implementation from vLLM dependency (#8113) Hongbo Xu 2025-07-19 02:45:22 +08:00
  • 6737671c82 [Bugfix] Fix w8a8_int8 import error on NPU (#8147) Even Zhou 2025-07-19 02:34:55 +08:00
  • fd63b62eaa fix compressed tensors WNA16 imports (#8142) Enrique Shockwave 2025-07-18 19:34:14 +01:00
  • 719b29f218 feat: enchance green context stream creation robust with backward compatibility (#8136) Peng Zhang 2025-07-18 17:45:03 +08:00
  • d0510f08fe Revert "Fix different device type adjustment in PP" (#8141) Sai Enduri 2025-07-18 01:12:11 -07:00
  • 9d33fcfb8e Hicache Storage Layer Prototype (#7704) Zhiqiang Xie 2025-07-18 00:20:19 -07:00
  • 7891bac16b [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (#7820) jianan-gu 2025-07-18 13:03:56 +08:00
  • 48c1fa7bb6 [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input" (#7889) jianan-gu 2025-07-18 12:43:25 +08:00
  • 8aa5ae6b04 load draft model fix (#7506) yilian49 2025-07-18 00:41:32 -04:00
  • 8a32355704 Feat: Support Granite 3.0 MoE in SGLang (#7959) Minglei Zhu 2025-07-17 20:56:03 -07:00
  • 6e92da8fca [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (#8127) Qi Yuhang 2025-07-18 11:49:36 +08:00
  • e1020dc588 refactor: simply MultimodalTokens logic (#7924) Mick 2025-07-18 08:59:15 +08:00
  • 3586b4cef2 feat: add production metric for retracted requests due to insufficient kvcache (#7030) Zhao Chen 2025-07-18 02:59:05 +08:00
  • 4296021499 [Hunyuan]: Fix Dense Model Support (#8117) Asher 2025-07-18 01:00:11 +08:00
  • 01857fab61 fix: update HostKVCache init to report correct msg when available memory is not enough (#8102) Ziqi Fan 2025-07-17 06:24:34 -07:00
  • 519ff5c8e6 Super tiny fix typo (#8046) fzyzcjy 2025-07-17 21:15:51 +08:00
  • af1cc8fe2d [kernel] opt moe align block kernel by block/warp scan algorithm (#7884) Yuan Luo 2025-07-17 19:33:02 +08:00
  • 49b8777460 Refactor: move all quantization-related code to srt/layer/quantization (#7989) Cheng Wan 2025-07-17 00:47:07 -07:00
  • 02404a1e35 [ci] recover 8-gpu deepep test (#8105) Cheng Wan 2025-07-17 00:46:40 -07:00
  • 5c08a36cbf [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (#8110) hzh0425 2025-07-17 12:33:29 +08:00
  • 9069884b51 [ci] disable memory imbalance check for draft worker (#8108) Cheng Wan 2025-07-16 20:41:47 -07:00
  • 8a7a7770e5 [ci] limit cmake build nproc (#8100) Simo Lin 2025-07-16 18:09:28 -07:00
  • 795668dc73 feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (#7597) Yingchun Lai 2025-07-17 08:55:59 +08:00
  • 4395c87a9b refactor: unify names of the feature field of MultimodalDataItem (#8075) Mick 2025-07-17 08:52:38 +08:00
  • c28ad1990d [1/n] chore: decouple quantization implementation from vLLM dependency (#7992) Peng Zhang 2025-07-17 06:56:26 +08:00
  • 570d33437b [Feature] Layer-wise Prefill (#7634) Xiaoze Fan 2025-07-17 01:57:46 +08:00
  • d9eb5efc71 [misc] update nvshmem and pin deepEP commit hash (#8098) Simo Lin 2025-07-16 08:54:55 -07:00
  • 6dc4af4937 fix greenctx stream compability (#8090) Peng Zhang 2025-07-16 22:08:46 +08:00
  • b188a89a5d Fix CI xeon test with triton 3.3.1 (#8086) YanbingJiang 2025-07-16 17:12:23 +08:00
  • 497efe747d Revert "feat: replace Decord with video_reader-rs" (#8077) Mick 2025-07-16 11:04:56 +08:00
  • 69f453e5a4 Use device_group for all_gather when disabling overlap scheduling (#8001) Qiaolin Yu 2025-07-15 19:38:58 -07:00
  • 3bc43c683e Fix different device type adjustment in PP (#7760) Qiaolin Yu 2025-07-15 19:37:14 -07:00
  • 7498522f7d update transformers to 4.53.2 (#8029) Xinyuan Tong 2025-07-15 18:24:39 -07:00
  • 194841e329 remove kv_a.congigous in DeepseekV2AttentionMLA (#8058) strgrb 2025-07-16 09:20:41 +08:00
  • ebff5fcb06 feat: replace Decord with video_reader-rs (#5163) kozo 2025-07-16 09:17:34 +08:00
  • f06bd210c0 Update amd docker image. (#8045) Sai Enduri 2025-07-15 15:09:56 -07:00
  • 14f1f1514b H20 tune config for Kimi (#8047) yhang 2025-07-16 04:48:31 +08:00
  • 38216cf049 concurrently load weights of DeepseekV2ForCausalLM (#7943) Albert 2025-07-16 04:41:19 +08:00
  • 4a8837950a fix: resolve arm build issue (#8052) Yineng Zhang 2025-07-15 01:19:34 -07:00
  • f1f1d1d40d Fix the input tools format and history tool_calls in OpenAI API (#6556) jiawei 2025-07-15 15:58:55 +08:00
  • 9120e83d03 fix: remove redundant rotary embedding cache recomputation in MiniCPM (#8022) Xinyuan Tong 2025-07-15 00:12:45 -07:00
  • 6e923dbd30 feat: update multimodal data handling in engine entrypoint (#8002) Xinyuan Tong 2025-07-15 00:12:22 -07:00
  • c268c11c71 [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (#8023) Qi Yuhang 2025-07-15 15:02:44 +08:00
  • e6d5988442 Update CODEOWNERS (#8044) Chang Su 2025-07-14 23:38:15 -07:00
  • 9b560c3e1c fix: modality length mismatch with image_data (#7887) 杨睿 2025-07-15 14:27:54 +08:00
  • 5dc5866e8e Setup workflow for releasing mi300x and mi350x dockers. (#8035) Sai Enduri 2025-07-14 21:51:43 -07:00
  • 64e78bb31b prevent server crash from potential invalid grammar (#7897) ehuaa 2025-07-15 11:21:45 +08:00
  • 7c39e8a198 Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (#7982) hzh0425 2025-07-15 05:57:10 +08:00
  • d969504d9a Fix flaky CI: test_vlm_models (#8006) Lifu Huang 2025-07-14 14:56:41 -07:00
  • 1ebec1a8b0 [Feature] CUDA Green Context Support (#7649) ykcombat 2025-07-15 02:49:16 +08:00
  • d4d0c7c367 [Feature]TP Group Switching for PD-Multiplexing (#7653) ykcombat 2025-07-15 02:35:46 +08:00
  • 8d2cf38c79 [Minor] Remove redundant print (#8005) Lianmin Zheng 2025-07-14 10:55:13 -07:00
  • 2117f82def [ci] CI supports use cached models (#7874) Hank Han 2025-07-14 19:42:21 +08:00
  • c07f647c9f perf: add kimi k2 fused_moe tuning config for h30_3e (#8021) Yusong Gao 2025-07-14 17:56:11 +08:00
  • 07452cbe8e [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (#8010) Chunyuan WU 2025-07-14 16:32:43 +08:00
  • a562c8a35c [Dockerfile] Multi-arch support for ROCm (#7902) mqhc2020 2025-07-14 14:13:09 +08:00
  • cb736df854 Support for Phi-1.5 & Phi-2 models (#7862) Praneth Paruchuri 2025-07-14 07:13:40 +05:30
  • e2ed9d049a Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (#7844) Lifu Huang 2025-07-13 18:36:01 -07:00
  • b5dd5e8741 chore: remove unnecessary limits on quantization methods in test script (#7997) Peng Zhang 2025-07-14 07:11:49 +08:00
  • 9379da77de SWA Prefix Cache (#7367) Hanming Lu 2025-07-13 12:31:07 -07:00
  • 0c55cbcfc5 [BugFix] add verify logit_bias to avoid crash because of IndexError (#7749) ehuaa 2025-07-14 02:44:12 +08:00
  • c46e069d34 Tiny fix mooncake log warning wrong output (#7952) fzyzcjy 2025-07-13 12:22:44 +08:00
  • 42fc44100a [minor] Add server_args check for Llama4 with hybrid (#7988) Ying Sheng 2025-07-12 20:13:40 -07:00
  • 5f6756b038 [BugFix] fix pre_reorder_triton_kernel default int32 issue (#7814) Morpheus Guo 2025-07-13 04:42:36 +08:00
  • 98aa836bbf Overlap the gating function with shared experts in DeepSeek (#7978) Cheng Wan 2025-07-12 13:41:50 -07:00
  • 22bd857cb5 docs: update README (#7985) Yineng Zhang 2025-07-12 13:31:11 -07:00
  • ccfa084125 [script] update loogle test (#7975) Ying Sheng 2025-07-12 00:06:17 -07:00
  • bcc5ba94b4 [minor fix] SWA missing methods (#7972) Ying Sheng 2025-07-11 23:57:02 -07:00
  • cee9f329c4 [minor fix] llama4 hybrid memory (#7950) Ying Sheng 2025-07-11 23:11:36 -07:00
  • eb118d88c4 chore: bump v0.4.9.post2 (#7963) Yineng Zhang 2025-07-11 21:11:20 -07:00
  • 732fc8e405 chore: upgrade sgl-kernel 0.2.5 (#7971) Yineng Zhang 2025-07-11 20:35:06 -07:00
  • f2d5c4920e [router] add worker abstraction (#7960) Simo Lin 2025-07-11 20:17:48 -07:00
  • 2a2d3478af Fix wrong gemm branch cause 250us slower (#7969) fzyzcjy 2025-07-12 10:45:09 +08:00
  • aa2056091a delete uselese code caused by fuse allreduce+add_rmsnorm pr (#7970) Xiaoyu Zhang 2025-07-12 10:43:38 +08:00