Commit Graph

  • 094891c01a fix: Use is not None instead of != None for None checks. (#5687) vzed 2025-04-26 22:26:57 -04:00
  • a21ef36352 support for the DeepSeek model by enabling streaming response parsing (#5592) Frankey_8080 2025-04-27 09:59:31 +08:00
  • 3c4dc38a9a [fix] fix bench_one_batch_server (#5607) JieXin Liang 2025-04-27 09:49:45 +08:00
  • d8fbc7c096 [feature] support for roberta embedding models (#5730) DavidBao 2025-04-27 09:47:06 +08:00
  • c5e1026f47 Update amd docker image to sglang:v0.4.5.post3-rocm630. (#5697) saienduri 2025-04-26 18:46:57 -07:00
  • 799c4bb502 Fuse MLA set kv cache kernel (#5748) Ke Bao 2025-04-27 09:42:22 +08:00
  • 02723e1b0d CI: rewrite test_vision_chunked_prefill to speedup (#5682) Mick 2025-04-27 10:33:13 +09:00
  • df2cf583ce we fix the non existent access of decrypted_config_file (#5685) vzed 2025-04-26 21:32:37 -04:00
  • 133ded039a perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (#5716) saltyfish66 2025-04-27 09:15:07 +08:00
  • f87a6ab359 Resolves the 404 Not Found error when running compile_deep_gemm.py in multi-node setups (#5720) Yuhong Guo 2025-04-27 09:13:13 +08:00
  • eebfdb9459 [fix] fix potential bumpy throughtput with deepgemm (#5722) JieXin Liang 2025-04-27 09:12:48 +08:00
  • dfb322642f Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (#5728) Wenxuan Tan 2025-04-26 20:11:09 -05:00
  • 63c13a2c73 fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 (#5733) Kyungmin Lee 2025-04-27 10:10:23 +09:00
  • 4d1e52abea Add an assertion to enhance the robustness of the operator (#5736) liwenju0 2025-04-27 09:09:12 +08:00
  • 155890e4d1 [Minor] fix documentations (#5756) Lianmin Zheng 2025-04-26 17:48:43 -07:00
  • 1f963d7f64 Bugfix for minicpmo vision test (#5760) Yi Zhang 2025-04-26 23:18:02 +08:00
  • 04d0123fd9 [Fix]: support deepseek-vl2-tiny model (#5552) ZXN 2025-04-26 17:52:53 +08:00
  • feda9b11b3 fix: fix one more bug from merging mm_inputs (#5718) Mick 2025-04-26 09:28:33 +09:00
  • c3948ba67e Reorder loop in shared expert weight loading (#5719) Ke Bao 2025-04-26 08:27:42 +08:00
  • 269c457e05 [Docs] Update runtime/engine/readme.md (#5737) Michael Yao 2025-04-26 07:39:29 +08:00
  • 18ce468d56 update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. (#5740) Xiaoyu Zhang 2025-04-26 07:24:59 +08:00
  • 21514ff5bd Disable flaky eagle tests (#5753) Lianmin Zheng 2025-04-25 15:54:39 -07:00
  • 5641a09458 Revert "[Model] Support ArcticForCausalLM architecture (Snowflake/snowflake-arctic-instruct)" (#5754) Lianmin Zheng 2025-04-25 15:50:28 -07:00
  • 3dd3538c18 Pin torch audio to 2.6.0 (#5750) Lianmin Zheng 2025-04-25 15:06:28 -07:00
  • 93c6fb12c7 Fix: deepseek forward absorb (#5723) michael-amd 2025-04-25 13:48:55 -07:00
  • 11e27d0926 [PD]: Support Muti Prefill in one node (#5704) IAN 2025-04-26 00:30:47 +08:00
  • 50eda8398e [PD] Add kvargs table and thread pool for kvcache sender of mooncake (#5738) shangmingc 2025-04-25 18:15:01 +08:00
  • c55550cbf0 [PD] Better logs (#5715) Liangsheng Yin 2025-04-25 17:25:45 +08:00
  • 43fb95c2fa [Model] Support ArcticForCausalLM architecture (Snowflake/snowflake-arctic-instruct) (#5078) Brayden Zhong 2025-04-25 03:24:09 -04:00
  • 7d9679b74d Add MMMU benchmark results (#4491) Ravi Theja 2025-04-25 12:53:53 +05:30
  • b5be56944b [Doc] Fix a link to Weilin Zhao (#5706) Michael Yao 2025-04-25 02:02:27 +08:00
  • d2b8d0b8d8 Add example to use sgl engine with fastapi (#5648) Ravi Theja 2025-04-24 21:27:05 +05:30
  • a14654dd68 Fix weight loading bug for Deepseek v3+nextn (#5684) Baizhou Zhang 2025-04-24 06:29:56 -07:00
  • 5d93a950ee [BugFix] Fix combination of MTP and --n-share-experts-fusionwith R1 (#5707) Yuhong Guo 2025-04-24 21:13:51 +08:00
  • c998d04b46 vlm: enable radix cache for qwen-vl models (#5349) Mick 2025-04-24 12:35:05 +09:00
  • 7d0edf3cae chore: bump sgl-kernel 0.1.0 (#5688) Yineng Zhang 2025-04-23 14:23:59 -07:00
  • ce4ecba477 fix: only compile ApplyTokenBitmaskInplace cu124+ (#5686) Yineng Zhang 2025-04-23 14:17:42 -07:00
  • b1f6d89b5f fix: update truss bench_serving (#5683) Yineng Zhang 2025-04-23 13:28:35 -07:00
  • 7c99103f4c [Doc] Fix two 404 links caused by sglang typo (#5667) Michael Yao 2025-04-23 23:21:55 +08:00
  • de071366cd tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py (#5677) Lianmin Zheng 2025-04-23 05:31:17 -07:00
  • e0673969b9 [PD] Add support for dp attention with mooncake (#5530) shangmingc 2025-04-23 17:20:27 +08:00
  • 127ff8982e fix torchvision::nms not exist (#5671) Yineng Zhang 2025-04-23 02:17:21 -07:00
  • 8777a1d24b fix gemma3 unit test (#5670) Yineng Zhang 2025-04-23 02:14:01 -07:00
  • 711efe7814 Integrating PD disaggregation with DP attention and DeepEP (#5435) Cheng Wan 2025-04-23 01:46:01 -07:00
  • fbb5f229d4 fix awq_dequantize import (#5669) Yineng Zhang 2025-04-23 01:36:26 -07:00
  • 15fabcc07f fix sgl-kernel unit tests (#5666) Yineng Zhang 2025-04-23 01:18:30 -07:00
  • e62c49557d [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell (#5281) Elfie Guo 2025-04-22 22:28:20 -07:00
  • 71d1785f2d Remove unnecessary torch.full in DeepSeek (#5601) fzyzcjy 2025-04-23 12:24:29 +08:00
  • 3f87f83116 Fuse q_a_proj and kv_a_proj (#5619) Baizhou Zhang 2025-04-22 20:35:08 -07:00
  • ce5412b62e Turn on DeepGemm By Default and Update Doc (#5628) Baizhou Zhang 2025-04-22 16:10:08 -07:00
  • 7282ab741a fix: update bench_speculative (#5649) Yineng Zhang 2025-04-22 16:08:15 -07:00
  • b0feda090c Revert "Support aiter RMSNorm in AMD" (#5646) HAI 2025-04-22 15:20:24 -07:00
  • 6b6e748775 Remove q concat in FA3 backend for DeepSeek decode (#5638) Ke Bao 2025-04-23 02:43:12 +08:00
  • 917324862e [fix] reduce dp capture bs (#5634) JieXin Liang 2025-04-23 02:08:45 +08:00
  • 2ed96c7a8a fix flashmla bug (#5272) lukec 2025-04-23 01:36:23 +08:00
  • 2aa3f5e2d0 [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (#5641) saltyfish66 2025-04-23 00:33:13 +08:00
  • 76d17c7ecb Fix shared experts fusion error without quantization (#5632) lambert0312 2025-04-23 00:22:26 +08:00
  • 70d040f904 [NFC] Remove duplicate compressed-tensors (#5640) Connector Switch 2025-04-23 00:10:25 +08:00
  • 4418f599a5 Fix FA3 DeepSeek prefill performance regression (#5624) JieXin Liang 2025-04-22 16:41:41 +08:00
  • 04f2abcb34 fix: gemma 3 not use softcap (#5622) Yineng Zhang 2025-04-22 01:16:08 -07:00
  • 506be6b892 [fix] fix compile_deep_gemm missing kv_b_proj (#5620) JieXin Liang 2025-04-22 15:06:36 +08:00
  • 2343d8df7d [fix] force use deepgemm in compile_deep_gemm (#5618) JieXin Liang 2025-04-22 12:36:02 +08:00
  • 92bb64bc86 [Doc] Fix a 404 link to llama-405b (#5615) Michael Yao 2025-04-22 11:39:37 +08:00
  • 11b23ae97b Remove extra copy in deepseek forward absorb (#5578) Ke Bao 2025-04-22 10:33:21 +08:00
  • b9c87e781d chore: bump v0.4.5.post3 (#5611) Yineng Zhang 2025-04-21 18:16:20 -07:00
  • 968ef51562 Support aiter RMSNorm in AMD (#5510) michael-amd 2025-04-21 17:40:39 -07:00
  • 1343200299 Clean up mem settings (#5610) Lianmin Zheng 2025-04-21 17:19:00 -07:00
  • c2942907d5 [feature] enable pre compile jit deep_gemm (#5580) JieXin Liang 2025-04-22 07:52:53 +08:00
  • e69a219074 Enhance GPU memory settings (#5604) Liangsheng Yin 2025-04-22 06:15:00 +08:00
  • bf98d2e377 [PD] Support prefill overlap + Ensure no race condition (#5609) Byron Hsu 2025-04-21 12:12:56 -07:00
  • e65b9f21e3 [PD] Support decode overlap schedule (#5608) Byron Hsu 2025-04-21 12:06:16 -07:00
  • 4dce1cc608 [PD] Add NIXL transfer backend (#5477) Trevor Morris 2025-04-21 10:36:12 -07:00
  • deded17f38 [PD] Fix edge case and simplify large page size + chunked prefill (#5589) Byron Hsu 2025-04-21 10:27:02 -07:00
  • f29a718f63 [PD] Fix generate endpoint of min_lb for PD (#5598) shangmingc 2025-04-21 21:39:18 +08:00
  • 3f57b00a59 Support PD bootstrap fields on /v1/chat/completions endpoint (#5488) Yongtong Wu 2025-04-21 16:10:58 +08:00
  • 453d412cdb Tiny update error hint (#5037) fzyzcjy 2025-04-21 15:47:47 +08:00
  • dc86f25a57 Tiny remove duplicated code (#5021) fzyzcjy 2025-04-21 15:47:32 +08:00
  • 08289eaa3e Support o1 model on Azure (#4980) Chuyue Sun 2025-04-21 00:46:09 -07:00
  • 3b6d539f63 [Fix] Enhance DP Attention for IPv6 Compatibility (#4937) Lucius 2025-04-21 15:44:11 +08:00
  • 57131dd955 [Feat.] Enable grafana to show metrics (#4718) Huapeng Zhou 2025-04-21 15:43:42 +08:00
  • a7591ecf69 Update protocol.py to fix #4589 (#4590) moontidef 2025-04-21 15:42:50 +08:00
  • c44f2869c9 Modify metrics service endpoint (#3443) lambert0312 2025-04-21 15:35:38 +08:00
  • 685d8980c3 Tiny add warning when cannot recognize bool env var (#5348) fzyzcjy 2025-04-21 14:11:29 +08:00
  • 70645f4d7d upstream hicache fixes (#5570) Zhiqiang Xie 2025-04-20 23:08:30 -07:00
  • 188f0955fa Add Speculative Decoding Eagle3 topk > 1 (#5318) Qingquan Song 2025-04-20 22:58:28 -07:00
  • eef9433b46 Fix flush cache (#5590) Lianmin Zheng 2025-04-20 22:56:40 -07:00
  • 97cb762bb6 [misc] remove is_cuda_available (#5319) JieXin Liang 2025-04-21 09:16:51 +08:00
  • 1195182040 Tiny add Engine.flush_cache API (#5241) fzyzcjy 2025-04-21 09:15:03 +08:00
  • 5239d79568 Speedup shared expert weight construction by avoid cloning (#5188) fzyzcjy 2025-04-21 09:12:01 +08:00
  • f08154193c Perform Batch Tokenization. (#5141) Sundara Raman Ramachandran 2025-04-20 18:10:37 -07:00
  • 2b3bdc938e Correct grafana heatmap. (#5019) mac0ne 2025-04-21 08:58:56 +08:00
  • 5fc4b6004e Add sanity check for max_running_requests (#5016) fzyzcjy 2025-04-21 08:56:49 +08:00
  • b868526d94 Fix one more issue reported by torchfix (#4859) Brayden Zhong 2025-04-20 20:49:27 -04:00
  • 502524e2da compressed_tensors: port w8a16 fp8 from vllm (#4852) Juwan Yoo 2025-04-21 09:48:31 +09:00
  • 4c7640079c check marlin format before attempting conversion (#4675) Enrique Shockwave 2025-04-21 01:47:09 +01:00
  • 9f3bd2ad39 Feat: Implement JSON Mode (response_format.type="json_object") (#4733) kyle-pena-kuzco 2025-04-20 20:41:22 -04:00
  • 8de53da989 smaller and non gated models for docs (#5378) simveit 2025-04-21 02:38:25 +02:00
  • fac17acf08 add function call parser for DeepSeek V3 (#5224) Yi Zhou 2025-04-21 08:38:08 +08:00
  • 8b39274e34 [Feature] Prefill assistant response - add continue_final_message parameter (#4226) Adarsh Shirawalmath 2025-04-21 06:07:18 +05:30
  • 5156d5a413 Add test config yamls for Deepseek v3 (#5433) Baizhou Zhang 2025-04-20 17:28:52 -07:00