Commit Graph

  • 7151194bb2 Remove cumsum_buffer initilization (#7439) Ke Bao 2025-06-24 17:22:16 +08:00
  • 2ed68d7a6c [PD Disaggregation] replace transfer with batch transfer for better performance (#7236) Francis 2025-06-24 17:12:04 +08:00
  • e984d5073b enable aiter_biased_grouped_topk kernel (#7423) valarLip 2025-06-24 17:09:42 +08:00
  • 755f314785 [AMD] add aiter fused moe in DeepEP path (#7268) Alex Sun 2025-06-24 17:05:47 +08:00
  • 7c3a12c000 chore: bump v0.4.8 (#7493) Yineng Zhang 2025-06-23 23:14:22 -07:00
  • e846d95ef6 chore: bump sgl-kernel v0.2.0 (#7490) Yineng Zhang 2025-06-23 22:29:50 -07:00
  • 15f3401343 Fix MTP with Deepseek R1 Fp4 (#7376) Charles Chen 2025-06-23 21:38:07 -07:00
  • d04163b3fa Fix RequestValidationError response format (#7487) Chang Su 2025-06-23 20:35:11 -07:00
  • 7732bbe458 bugfix: Prevent global mutation of conv.stop_str across requests (#7347) huangtingwei 2025-06-24 10:36:23 +08:00
  • ed0a0b692c Perormance: Enable cuda graph for dp idle batch (#7269) u4lr451 2025-06-24 08:34:13 +08:00
  • fa42e41962 ci: Revert openai_server related tests in AMD suites (#7449) Chang Su 2025-06-23 15:28:22 -07:00
  • e5afb88b1c Support weight loading without mmap (#7469) Yuhong Guo 2025-06-24 06:13:59 +08:00
  • e5ddeb04d5 Quick fix for DeepGemm requant to also cover MTP. (#7378) Charles Chen 2025-06-23 12:08:54 -07:00
  • bdbb8d009a [perf] slightly imporve DeepSeek-R1-FP4 TP8 (#7481) JieXin Liang 2025-06-24 03:05:30 +08:00
  • 34c3f9b2d3 kvcache io kernels and test case (#7382) Zhiqiang Xie 2025-06-23 11:58:59 -07:00
  • 76139bfba0 update mooncake in dockerfile (#7480) Liangsheng Yin 2025-06-24 02:29:30 +08:00
  • f8d48fd311 Fix dtype for idle input in spec decoding (#7456) Cheng Wan 2025-06-23 11:23:25 -07:00
  • 34b6b8426f feat(func_call): Add more check in BaseFormatDetector.parse_streaming_increment (#7479) Chang Su 2025-06-23 11:15:47 -07:00
  • 25549433e8 Fix prefill OOM due to wrong token calculation when page > 1 (#7397) Liangsheng Yin 2025-06-24 02:12:29 +08:00
  • d6dddc19ff [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (#7472) Shangming Cai 2025-06-24 02:10:50 +08:00
  • 55e03b10c4 Fix a bug in BatchTokenIDOut & Misc style and dependency updates (#7457) Lianmin Zheng 2025-06-23 06:20:39 -07:00
  • 8aa68ed5c4 Solve docker build failed in the virtual machine (#7290) kk 2025-06-23 17:10:30 +08:00
  • 506c4928f5 feat: integrate deepgemm into EPMoE (#6821) xutizhou 2025-06-23 16:38:58 +08:00
  • 30ceccc74a Update hyperparameter_tuning.md (#7454) Lianmin Zheng 2025-06-22 22:42:55 -07:00
  • ac5010e0ba Fix CUDA Graph Check under Deepep with DP FFN (#7451) Cheng Wan 2025-06-22 20:35:58 -07:00
  • 3cee035e99 add fused moe config for qwen3 in triton3.3.1 (#7445) Yi Zhang 2025-06-23 09:29:25 +08:00
  • 30f2a44a96 [misc] Add PD service discovery support in router (#7361) Simo Lin 2025-06-22 17:54:14 -07:00
  • bd4f581896 Fix torch compile run (#7391) kk 2025-06-23 06:33:09 +08:00
  • 50f1b6d6b1 Remove copy after bmm (#7441) Ke Bao 2025-06-23 06:07:44 +08:00
  • 5962e70d8d FlashInfer NVFP4 MoE with EP & 2-stream shared expert (#7327) Trevor Morris 2025-06-22 13:38:47 -07:00
  • edc21cc8ae Tiny add logging for GC (#7406) fzyzcjy 2025-06-22 12:40:02 +08:00
  • 05c9bc8956 [minor] simplify the TokenToKVPoolAllocator (#7414) Liangsheng Yin 2025-06-22 12:37:18 +08:00
  • b7a2df0a44 refactor(test): reorganize OpenAI test file structure (#7408) Chang Su 2025-06-21 19:37:48 -07:00
  • 1998ce4046 Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (#7412) Lifu Huang 2025-06-21 16:09:19 -07:00
  • 72676cd6c0 feat(oai refactor): Replace openai_api with entrypoints/openai (#7351) Chang Su 2025-06-21 13:21:06 -07:00
  • 02bf31ef29 [fix] PD disaggregation when enable mtp and tp!=dp (#7420) Atream 2025-06-22 03:03:11 +08:00
  • 5ea5d22170 Fix CPU offloading for MLA memory pool (#7409) Liangsheng Yin 2025-06-22 02:39:05 +08:00
  • fdfd5224bf fix: Fix CI test_function_call_parser.py (#7425) Chang Su 2025-06-21 09:25:08 -07:00
  • 7f3ee861ae fix overlap pagecount (#6984) pansicheng 2025-06-21 15:34:45 +08:00
  • bec5891083 [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (#7394) ehuaa 2025-06-21 13:50:08 +08:00
  • 9edf6608c9 [BugFix]: fix EmbeddingReqInput single input error (#7396) woodx 2025-06-21 11:35:21 +08:00
  • ab74f8f09d Remove batches api in docs & example (#7400) Jinn 2025-06-20 21:46:31 -05:00
  • 5e7fdc79fa [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (#7329) Keyang Ru 2025-06-20 19:18:53 -07:00
  • 4d8d9b8efd chore: upgrade mooncake-transfer-engine 0.3.4 (#7401) Yineng Zhang 2025-06-20 16:38:54 -07:00
  • 5041df2d01 Fix 7285 Merge Conflicts (#7403) Cheng Wan 2025-06-20 16:02:50 -07:00
  • 256801e973 Update usage_processor.py (#7402) Cheng Wan 2025-06-20 15:55:38 -07:00
  • 73b13e69b4 Optimize DP attn scheduling for speculative decoding (#7285) Cheng Wan 2025-06-20 15:06:41 -07:00
  • 8609e637a9 Fix All-Gather under world size one (#7219) Cheng Wan 2025-06-20 14:57:34 -07:00
  • dea2b84bc3 [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (#7360) yhyang201 2025-06-21 05:51:21 +08:00
  • cfb2fb5afc [OAI refactor] Add rerank and score serving (#7399) woodx 2025-06-21 05:51:10 +08:00
  • 22bfed7509 [DeepSeekNextN] fix: residual of head norm can be None (#7398) Cheng Wan 2025-06-20 14:45:16 -07:00
  • e879d8b7a8 [Feature] Comprehensive Hybrid Parallelism Support (#6389) Cheng Wan 2025-06-20 14:43:11 -07:00
  • 0998808009 Refine OpenAI serving entrypoint to remove batch requests (#7372) Xinyuan Tong 2025-06-20 14:33:43 -07:00
  • 794be55af2 Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (#5485) Wenbo Yang 2025-06-21 01:07:43 +08:00
  • 187b85b7f3 [PD] Optimize custom mem pool usage and bump mooncake version (#7393) Shangming Cai 2025-06-21 00:50:39 +08:00
  • ceba0ce4f6 support return logprobs for pipeline (#7356) strgrb 2025-06-20 14:50:45 +08:00
  • 1ab6be1b26 Purge VerlEngine (#7326) Ata Fatahi 2025-06-19 23:47:21 -07:00
  • 4df5fc2156 Feat/refactor embedding server (#7322) woodx 2025-06-20 14:46:01 +08:00
  • a06912ad8b Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (#7371) Li Hui 2025-06-20 12:58:00 +08:00
  • 97011abc8a [Doc] add embedding rerank doc (#7364) woodx 2025-06-20 12:53:54 +08:00
  • 1d6515ef2a [Bugfix]Fix hang bug using dp attention with HiRadixCache (#7159) Huang Long 2025-06-20 11:34:36 +08:00
  • dea8aa7ab8 Clean unused import for mimo mtp model (#7370) Li Hui 2025-06-20 10:16:14 +08:00
  • 906dbc34f1 [Docker] optimize dockerfile remove deepep and blackwell merge it to… (#7343) ybyang 2025-06-20 08:42:40 +08:00
  • fadf18fdd5 docs: update installation (#7366) Yineng Zhang 2025-06-19 12:00:19 -07:00
  • f88e70853e [Bugfix][PD] Set conclude state before clear when failure happens (#7362) Shangming Cai 2025-06-20 02:26:53 +08:00
  • 4f838c09cd [PD] Transfer hidden states for mtp when disaggregation (#7242) Atream 2025-06-20 02:22:47 +08:00
  • d20a073bc3 Put _normalize_rid before other normalization in io_struct (#7363) Chang Su 2025-06-19 11:21:25 -07:00
  • 47367b768d [Refactor] Clean up radix cache related API (#7303) DarkSharpness 2025-06-19 09:58:48 -07:00
  • 650127a173 Reintroduce tiny fix sampler error when prob is not contiguous (#7354) fzyzcjy 2025-06-20 00:39:34 +08:00
  • 3774f07825 Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (#7099) Stefan He 2025-06-19 00:56:37 -07:00
  • 9179ea1595 add seed in CPU UTs to avoid flaky failure (#7333) Chunyuan WU 2025-06-19 10:12:14 +08:00
  • 20a503c7d1 fix: resolve blackwell deepep image issue (#7331) Yineng Zhang 2025-06-18 17:04:02 -07:00
  • ffd1a26e09 Add more refactored openai test & in CI (#7284) Jinn 2025-06-18 15:52:55 -05:00
  • 09ae5b20f3 Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (#7096) Simo Lin 2025-06-18 11:28:15 -07:00
  • 712bf9ec9b [pd] optimize dockerfile for pd disaggregation (#7319) ybyang 2025-06-19 02:26:41 +08:00
  • 9c6a0656a3 Fix profiler error when there are idle passes (#7003) fzyzcjy 2025-06-19 01:55:01 +08:00
  • 2ae809c5c1 Fix mini_lb for PD with long output: limit chunk size of decode response (#7301) ch-tiger1 2025-06-19 01:46:46 +08:00
  • 1de4db9bef update invalid link in doc (#7297) linzhuo 2025-06-18 16:37:36 +08:00
  • 31fccf5a4f chore: change logs fromINFO to DEBUG for dp and add force quit for tokenizer manager (#7251) ishandhanani 2025-06-18 01:36:43 -07:00
  • b783c1cb82 Fix hicache benchmark script bug - some sampled input_request is [] (#7300) Binyao Jiang 2025-06-17 23:47:11 -07:00
  • 094c116f7d Update python API of activation, topk, norm and rope and remove vllm dependency (#6614) YanbingJiang 2025-06-18 13:11:50 +08:00
  • e56685ac1b Upstreaming hicache bug fixes (#7267) Zhiqiang Xie 2025-06-17 17:44:57 -07:00
  • c26d7349d3 [PD] Add custom memory pool option to support Mooncake PD with NVLink (#7264) shangmingc 2025-06-18 08:21:37 +08:00
  • ceaa85c9e6 [PD] Support get local ip from NIC for PD disaggregation (#7237) shangmingc 2025-06-18 08:19:26 +08:00
  • 0650e5176f fix: only enable flash_attn test on sm80 sm90 (#7289) Yineng Zhang 2025-06-17 16:56:41 -07:00
  • fc554105f6 ci: Fix test_ebnf_generate_all_optional_function_params (#7288) Chang Su 2025-06-17 16:39:42 -07:00
  • 4f204db57c fix: resolve b200 dsv3 mtp issue (#7286) Yineng Zhang 2025-06-17 16:22:46 -07:00
  • 3eb4a800e8 Fix AWQ Dequant and Weight Loading of deepseek v2 (#6842) AniZpZ 2025-06-18 04:45:10 +08:00
  • e726131523 bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (#7283) Chang Su 2025-06-17 13:36:07 -07:00
  • 8c16da334e Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (#7164) Charles Chen 2025-06-17 11:26:23 -07:00
  • a39d928782 support qwen2 running on ascend npu device (#7022) Yijie Zhu 2025-06-18 02:24:10 +08:00
  • 10d60cd41b feat: mtp support dp-attention (#6081) u4lr451 2025-06-17 15:33:28 +08:00
  • 8a10c4c3d9 update ci node for xeon (#7265) DiweiSun 2025-06-17 14:44:08 +08:00
  • 405780bcf0 [amd] Opt dsv3 moe (#7160) kk 2025-06-17 13:26:51 +08:00
  • 1dffee31ac OAI Server Skeleton & Core Utility Endpoints (#7179) yhyang201 2025-06-17 11:45:55 +08:00
  • 70c471a868 [Refactor] OAI Server components (#7167) Xinyuan Tong 2025-06-16 20:45:20 -07:00
  • 1a9c2c9214 Fix AMD speculative decoding (#7252) Lianmin Zheng 2025-06-16 17:01:33 -07:00
  • 873ae12cee support custom weight loader for model runner (#7122) KavioYu 2025-06-17 07:28:15 +08:00
  • c64290dcb5 Use seq_len_fill_value in the cuda graph runners (#7233) Lianmin Zheng 2025-06-16 15:57:07 -07:00
  • 8e2363dc15 fix amd EP MoE FP8 issue (#7125) Alex Sun 2025-06-17 06:29:45 +08:00