Commit Graph

  • f19613e6c3 Dedicated toml files for CPU/XPU (#10734) Zaili Wang 2025-10-10 15:44:55 +08:00
  • 8df4945559 fix file and object naming scheme in HiCacheNixl to avoid data corruption (#10969) ziruiliu 2025-10-10 15:23:10 +08:00
  • ee3bd8a1c8 feat(hicache): Support passing prefix keys for l3 store. (#9045) hzh0425 2025-10-10 15:22:05 +08:00
  • d8467db727 fix: reinstall torch in deps install (#11414) Yineng Zhang 2025-10-09 22:58:18 -07:00
  • b5044fbf12 Replace pad with cat for better performance (#11388) Yuan Luo 2025-10-10 12:03:17 +08:00
  • 70fbb3adf6 [CI] Refactor PD disaggregation test suite (#11363) Shangming Cai 2025-10-10 09:50:39 +08:00
  • 9a7e7a6576 [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (#11365) Glen Liu 2025-10-09 21:43:03 -04:00
  • 0fe87213bb fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (#11389) Yingchun Lai 2025-10-10 09:40:05 +08:00
  • 1f106ee365 [grammar] Avoid server crash when grammar backend is None (#11401) Xinyuan Tong 2025-10-09 18:38:10 -07:00
  • 9b8ebb2798 move more files under srt/utils (#11285) Lianmin Zheng 2025-10-09 16:46:15 -07:00
  • 758b887ad1 chore: bump SGLang version to 0.5.3.post1 (#11324) sglang-bot 2025-10-09 15:19:59 -07:00
  • eb7d9261c0 [router] conversation item API: create, retrieve and delete (#11369) Keyang Ru 2025-10-09 14:43:16 -07:00
  • 44cb060785 chore: upgrade flashinfer 0.4.0 (#11364) Yineng Zhang 2025-10-09 14:17:54 -07:00
  • 88bb627d0d [router] change grpc client from mutable to clone (#11394) Simo Lin 2025-10-09 14:00:24 -04:00
  • b520958ec8 [router][grpc] Replace fake health check with correct ones (#11387) Chang Su 2025-10-09 09:13:57 -07:00
  • fa7e2c3049 fix bench_serving mishandling of internal states (#11376) shaharmor98 2025-10-09 14:24:50 +03:00
  • 8f2cd177af add code pp support for nixl (#11375) shaharmor98 2025-10-09 14:24:32 +03:00
  • ab926dd697 [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (#11373) Chang Su 2025-10-09 03:53:23 -07:00
  • a4b424c632 [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (#11309) Trevor Morris 2025-10-08 23:59:46 -07:00
  • a0557642ea [router][lint] Add unused_qualifications to cargo lint warnings (#11366) Chang Su 2025-10-08 22:17:11 -07:00
  • 84768d1017 [router] Refactor OpenAI router: split monolithic file and move location (#11359) Keyang Ru 2025-10-08 21:46:39 -07:00
  • 368fd20622 [router][grpc] disable health check generation and increase timeout (#11353) Simo Lin 2025-10-08 22:23:08 -04:00
  • 53bd00d975 [Generative Score API] Multi-Item scoring with custom attention mask. (#10979) Sundara Raman Ramachandran 2025-10-08 18:47:32 -07:00
  • e22b13c569 [Auto Sync] Update scheduler.py (20251009) (#11350) Yineng Zhang 2025-10-08 17:39:04 -07:00
  • a3c2ea4451 fix: fix revision for sgl-flash-attn in sgl-kernel (#11327) Mick 2025-10-09 06:50:44 +08:00
  • fccac7d126 [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (#11342) Chang Su 2025-10-08 15:38:37 -07:00
  • 7ac6b900f4 [router] Support history management using conversation (#11339) Keyang Ru 2025-10-08 15:24:02 -07:00
  • a1080b72a0 [router] Fix all unused_qualifications (#11341) Chang Su 2025-10-08 13:55:27 -07:00
  • a65ca73911 [router][grpc] Cleanup debug logs in grpc_server and grpc_router (#11340) Chang Su 2025-10-08 13:26:19 -07:00
  • 677aa0e25f [router] improve reasoning parser lock and reduce req cloning (#11336) Simo Lin 2025-10-08 14:18:15 -04:00
  • 01c9ee1ab4 [router] refactor generate to use new pipeline arch (#11323) Simo Lin 2025-10-08 12:38:50 -04:00
  • d6837aea4d model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (#10909) Netanel Haber 2025-10-08 19:37:38 +03:00
  • c882b5ae75 [CI] improve disaggregation CI. (#11264) Liangsheng Yin 2025-10-08 21:40:56 +08:00
  • e3bb7f5ae6 benchmark: enhance configurable multimodal benchmarking in bench_serving (#9812) Kevin Xiang Li 2025-10-08 01:31:36 -07:00
  • 92473e2e34 Support LoRA in bench_serving oai interface (#11318) Lifu Huang 2025-10-08 01:28:58 -07:00
  • 6c0bb32711 fix(decode): adjust ServerArgs import to explicit module path (#11007) JinYan Su 2025-10-08 16:27:50 +08:00
  • 0a7c4bded7 [Doc] Update mooncake nvlink transport doc for PD disaggregation (#11321) Shangming Cai 2025-10-08 15:59:29 +08:00
  • edefab0c64 [2/2] Support MHA prefill with FlashAttention 4. (#10937) Lifu Huang 2025-10-08 00:54:20 -07:00
  • 97cd38e58d Skip weight loading in deepgemm compilation (#11312) Cheng Wan 2025-10-07 21:52:46 -07:00
  • 3c06b673af [8/N] MoE Refactor: deprecate EPMoE (#11211) Cheng Wan 2025-10-07 21:51:41 -07:00
  • 7c3f07dbcb [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (#9545) Adarsh Shirawalmath 2025-10-08 10:08:48 +05:30
  • edd86b8853 [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (#11314) Chang Su 2025-10-07 20:50:20 -07:00
  • 4b4dc132fa Rename ngram_utils -> ngram_info (#11316) Liangsheng Yin 2025-10-08 11:49:46 +08:00
  • 5a9170d993 Optimize copy_kv_cache for spec decoding (#11126) YAMY 2025-10-08 02:43:30 +00:00
  • c4d77774e1 update sampling_params documentation with defaults (#11315) Xinyuan Tong 2025-10-07 18:36:26 -07:00
  • 832c84fba9 [Chore] Update xgrammar 0.1.24 -> 0.1.25 (#10710) DarkSharpness 2025-10-08 09:22:28 +08:00
  • 64d1505c0a ci: unify the model launch method of nightly ci (#11230) Mick 2025-10-08 09:13:14 +08:00
  • f3764c26a3 Clean match_prefix and prepare_for_extend for mem cache V2 (#11200) cctry 2025-10-07 17:54:18 -07:00
  • 7ba3de0e92 [oai serving chat] Add argument --sampling-defaults and fix ChatCompletionRequest defaults (#11304) Chang Su 2025-10-07 17:36:05 -07:00
  • fde9b96392 [router] cleanup worker health check to return early (#11310) Simo Lin 2025-10-07 19:53:10 -04:00
  • f094e0a490 [router][grpc] Fix request_id extraction when n > 1 (#11311) Chang Su 2025-10-07 16:27:56 -07:00
  • 4ed67c27e3 [router] support Openai router conversation API CRUD (#11297) Keyang Ru 2025-10-07 15:31:35 -07:00
  • cd4b39a900 [quantization] Properly ignore quantization for layers excluded in quant_config (#11205) Bowen Bao 2025-10-07 14:06:05 -07:00
  • 420c99acfe [router][grpc] Fix error message format in grpc chat handler (#11307) Chang Su 2025-10-07 13:54:02 -07:00
  • e3c7f09146 Update tool parser and related documentation (#11223) Xinyuan Tong 2025-10-07 11:03:40 -07:00
  • 6f1e03a456 [router][grpc] Fix sampling_params.stop_strs is None (#11306) Chang Su 2025-10-07 10:57:38 -07:00
  • f4affd4df5 [router] fix grpc connection conversion and add optimization (#11305) Simo Lin 2025-10-07 13:39:33 -04:00
  • df08bf9b9f [Doc]: Best Practice for HICache (#11001) hzh0425 2025-10-08 00:59:21 +08:00
  • 69efdd27bc [Doc] HiCache Design Documents (#11027) ykwd 2025-10-08 00:35:45 +08:00
  • 64582caa84 [router][grpc] Refactor chat template content format detection (#11288) Chang Su 2025-10-07 08:38:51 -07:00
  • 2fcd56eaf6 [router] add get server info and get model info in grpc server (#11303) Simo Lin 2025-10-07 11:36:52 -04:00
  • 0958a39704 [Docs] [Router] Update Observability and Common Issues Section (#11302) Wenyi Xu 2025-10-07 23:03:09 +08:00
  • 4f42c8cd3e [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (#11068) Yuan Luo 2025-10-07 22:31:11 +08:00
  • 3ddd7dc9f8 Introduce future indices (#11301) Liangsheng Yin 2025-10-07 22:24:02 +08:00
  • 501dfa6b42 Remove sampling info events and overlap thread file (#11300) Liangsheng Yin 2025-10-07 21:34:25 +08:00
  • 79d3495177 [router] add reasoning and tool parser argument in router (#11290) Simo Lin 2025-10-07 09:08:32 -04:00
  • 1519a89cfd Remove overlap thread (#11210) Liangsheng Yin 2025-10-07 20:12:12 +08:00
  • 24bc3fb0f9 EAGLE cache fix for SWARadixCache (#11231) Ke Bao 2025-10-07 18:21:37 +08:00
  • 8a8a608af9 [ci] fix pp test (#11294) Liangsheng Yin 2025-10-07 14:20:04 +08:00
  • 533e58a15d Feature/longbench v2 evaluation utils (#10949) Al-Ekram Elahee Hridoy 2025-10-07 00:17:31 -06:00
  • 9b4c449735 convert test_deterministic into unit tests (#11095) Alex Chi Z 2025-10-07 05:33:11 +02:00
  • a578d300ba [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (#11283) Chang Su 2025-10-06 18:54:51 -07:00
  • 8c9670375f chore: bump sgl-kernel version to 0.3.15 (#11281) sglang-bot 2025-10-06 18:17:51 -07:00
  • fb27d38305 docs: update sgl-kernel README (#11286) Yineng Zhang 2025-10-06 17:55:22 -07:00
  • fd8a0b29c0 fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration (#11282) Xinyuan Tong 2025-10-06 17:28:23 -07:00
  • afc35ccc5e Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components (#11261) Chenxi Li 2025-10-06 17:23:00 -07:00
  • a57f0e3d56 reverse the amd ci test back to 1200s and split the 8-gpu deepseek job into two. (#11238) sunxxuns 2025-10-06 19:27:57 -04:00
  • 708f4ff490 Rename max_micro_batch_size -> pp_max_micro_batch_size (#11279) Lianmin Zheng 2025-10-06 15:50:56 -07:00
  • e2daeb351c [Auto Sync] Update test_utils.py (20251006) (#11280) Lianmin Zheng 2025-10-06 15:49:57 -07:00
  • 0e7b353009 Fix code sync scripts (#11276) Lianmin Zheng 2025-10-06 15:35:01 -07:00
  • b07c9c76c5 [router][grpc] Refine streaming processes (#11277) Chang Su 2025-10-06 15:15:01 -07:00
  • 748f86f3de [Bug] Fix incorrect assertion in FA4 and add UT. (#11182) Lifu Huang 2025-10-06 14:58:39 -07:00
  • 73ea484af1 docker: add manifest to versioned docker releases (#11268) ishandhanani 2025-10-06 14:53:40 -07:00
  • 4aeb193fbd disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 (#11274) gongwei-130 2025-10-06 14:48:31 -07:00
  • 466992b2d0 [router][tool call] Clean up redundant detect_format and has_tool_markers (#11270) Chang Su 2025-10-06 14:04:02 -07:00
  • 155cbb51f0 Enable native ModelOpt quantization support (1/3) (#7149) Zhiyu 2025-10-06 13:24:15 -07:00
  • eb30b888db Remove env var warnings for release (#11262) Lianmin Zheng 2025-10-06 10:09:17 -07:00
  • 5ee777c98f [router] add ipv6 support across all components (#11219) Simo Lin 2025-10-06 11:16:59 -04:00
  • a4a3d82393 chore: bump SGLang version to 0.5.3 (#11263) sglang-bot 2025-10-06 05:07:02 -07:00
  • 0b13cbb7c9 chore: bump SGLang version to 0.5.3rc2 (#11259) sglang-bot 2025-10-06 01:12:10 -07:00
  • efbc687c28 Support DeepSeek V3.2 Exp (#11061) fzyzcjy 2025-10-06 15:24:15 +08:00
  • 292a867ad9 Add flashmla and fast hadamard transform to Dockerfile (#11235) Baizhou Zhang 2025-10-05 21:31:28 -07:00
  • 8fd41eae93 Improve bot release workflow (#11240) Kangyan-Zhou 2025-10-05 21:28:27 -07:00
  • 0cd1996eae feat: add shortcut detection for multimodal templates in Jinja format (#11209) Xinyuan Tong 2025-10-05 21:13:17 -07:00
  • b6b4b56395 Update condition for sgl-kernel-benchmark-test (#11254) Lianmin Zheng 2025-10-05 20:55:02 -07:00
  • f8924ad74b update sgl kernel version to 0.3.14.post1 (#11242) Lianmin Zheng 2025-10-05 20:30:40 -07:00
  • 2f80bd9f0e Bump torch_memory_saver 0.0.9rc2 (#11252) fzyzcjy 2025-10-06 11:26:20 +08:00
  • 366a603e95 Use cu128 for torch audio to fix some CI tests (#11251) Lianmin Zheng 2025-10-05 19:52:32 -07:00
  • baee08601b [quantization] Enable aiter mxfp4 fused_moe for Quark (#10048) Bowen Bao 2025-10-05 19:51:34 -07:00
  • c7a104c12b [quantization] Fix scale remapping for mllama4 (#10042) Bowen Bao 2025-10-05 19:51:15 -07:00