Commit Graph

  • de2faef97e Remove extra contiguous (#5953) Ke Bao 2025-05-02 00:28:46 +08:00
  • 67b7d5b1df [PD] Vectorise group_concurrent_contiguous in NumPy (#5834) Yuan Luo 2025-05-01 22:42:37 +08:00
  • 4322c31e24 Support XiaomiMiMo/MiMo model inference (#5921) ryang 2025-05-01 22:41:13 +08:00
  • 9858113c33 chore: bump v0.4.6.post2 (#5939) Yineng Zhang 2025-04-30 22:04:40 -07:00
  • 8441baad6e fix: update model runner (#5934) Yineng Zhang 2025-04-30 19:49:26 -07:00
  • 256c4c2519 fix: correct stream response when enable_thinking is set to false (#5881) mlmz 2025-05-01 10:44:37 +08:00
  • 9f21e75453 add Thor & Spark (#5915) Johnny 2025-05-01 04:43:40 +02:00
  • 7bcd8b1cb2 Fix lora batch processing when input lora_path contains None (#5930) Qiaolin Yu 2025-04-30 22:42:42 -04:00
  • 11383cec3c [PP] Add pipeline parallelism (#5724) Ying Sheng 2025-04-30 18:18:07 -07:00
  • e97e57e699 Remove unused method calculate_num_image_tokens from qwen2_vl.py (#5783) XinyuanTong 2025-04-30 17:46:59 -07:00
  • 9a6ad8916d chore: upgrade sgl-kernel 0.1.1 (#5933) Yineng Zhang 2025-04-30 16:13:30 -07:00
  • d353d08b4e chore: bump sgl-kernel 0.1.1 (#5932) Yineng Zhang 2025-04-30 14:01:49 -07:00
  • 08acdb5c3d [Feat] Scale up fa3 kernel to sm8x arch (#5912) PGFLMG 2025-05-01 04:59:36 +08:00
  • 2afba1b1c1 Add TP2 MOE benchmarks for AMD. (#5909) Sai Enduri 2025-04-30 11:38:20 -07:00
  • e330f2b86c [qwen3] support qwen3 ep moe (#5917) laixin 2025-05-01 00:15:21 +08:00
  • 3ddf5b9d61 [Misc] use parallel build for cmake in sgl-kernel (#5919) PGFLMG 2025-04-30 23:56:46 +08:00
  • 3cff963335 [fix] kimi-vl test in test_vision_openai_server.py (#5910) JieXin Liang 2025-04-30 14:59:10 +08:00
  • d50e36a79d support vlm benchmark profile (#5905) Yi Zhang 2025-04-30 14:48:27 +08:00
  • 8fefdd32c7 [Feature] add support kimi vl model (#5383) liwenju0 2025-04-30 12:31:19 +08:00
  • 403b855a22 Add sm_120 for blackwell (#5903) zhjunqin 2025-04-30 11:45:24 +08:00
  • 1698e94e67 Add A800 fused moe config for qwen3 235b (#5900) lambert0312 2025-04-30 11:18:11 +08:00
  • 58195dd588 [Fix] Unload lora in HF_Runner if needed (#5899) Qiaolin Yu 2025-04-29 23:17:42 -04:00
  • 799789afed Bump Flashinfer to 0.2.5 (#5870) Baizhou Zhang 2025-04-29 19:50:57 -07:00
  • cc4a80caf6 [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels #134 (#5830) ybyang 2025-04-30 10:38:54 +08:00
  • 3c8a52311a Fix check_env script (#5901) lambert0312 2025-04-30 09:54:54 +08:00
  • a043f7f2ab chore: use torch 2.6 for sgl-kernel build (#5898) Yineng Zhang 2025-04-29 17:51:18 -07:00
  • e3a5304475 Add AMD MI300x Nightly Testing. (#5861) saienduri 2025-04-29 17:34:32 -07:00
  • 28b26dbf48 [Bugfix]: fix missing queue_time_start for requests from grammar_queue (#5696) Chang Su 2025-04-29 17:31:44 -07:00
  • 2b06484bd1 feat: support pythonic tool call and index in tool call streaming (#5725) Chang Su 2025-04-29 17:30:44 -07:00
  • e4b6133b78 [fix] relax mem_fraction_static for h200 (#5893) JieXin Liang 2025-04-30 08:01:12 +08:00
  • dd408ee481 Auto set draft model path for MTP (#5793) Ke Bao 2025-04-30 07:25:40 +08:00
  • 9419e75d60 [CI] Add test_function_calling.py to run_suite.py (#5896) Chang Su 2025-04-29 15:54:53 -07:00
  • 2c7dbb7cc2 [FEATURE] Enhance platform compatibility for ARM (#5746) Johnny 2025-04-30 00:06:16 +02:00
  • 9a62191ba7 chore: update CODEOWNERS (#5895) Yineng Zhang 2025-04-29 14:12:04 -07:00
  • ae523675e5 [Doc] Tables instead of bulletpoints for sampling doc (#5841) simveit 2025-04-29 22:49:39 +02:00
  • 5c08aa4958 [Docs] Update docs for Qwen3 and Qwen3MoE (#5836) Adarsh Shirawalmath 2025-04-30 02:18:30 +05:30
  • f4c191a712 chore: update Dockerfile (#5894) Yineng Zhang 2025-04-29 12:55:13 -07:00
  • 771669cbe0 [fix]: PyO3 macOS linking and consolidate on tracing for logging Simo Lin 2025-04-29 11:26:38 -07:00
  • 1468769bde [Misc] add service discovery for sgl router Simo Lin 2025-04-29 10:21:19 -07:00
  • 91dda4cd06 Add A800 fused moe config for qwen3 30b (#5880) lambert0312 2025-04-29 17:02:24 +08:00
  • 8e5a6d3441 [Fix] Fix a bug for flashmla to run R1 model (#5875) pengcuo 2025-04-29 16:03:13 +08:00
  • 8465f035d1 Add qwen3 30b fused moe config (#5859) XinyuanTong 2025-04-29 00:24:00 -07:00
  • 8c0cfca87d Feat: support cuda graph for LoRA (#4115) Qiaolin Yu 2025-04-29 02:30:44 -04:00
  • 2c3ea29476 [Feature] support auto chat template (#4949) woodx 2025-04-29 13:34:18 +08:00
  • 5bb0accbcf cutlass 3.9 supported to improve fp8_blockwise_gemm (#5820) Xiaoyu Zhang 2025-04-29 12:52:36 +08:00
  • 8d463fe351 Cutlass MLA decode - fix dtype error (#5868) Trevor Morris 2025-04-28 21:12:58 -07:00
  • 26fc32d168 [CI] tune the test order to warmup the server (#5860) Lianmin Zheng 2025-04-28 19:27:37 -07:00
  • 1cc326032d simplify fused_moe config logging (#5801) Xiaoyu Zhang 2025-04-29 08:04:54 +08:00
  • 05ee219286 Support max_completion_tokens for OpenAIChatCompletions (#5857) Chang Su 2025-04-28 13:50:13 -07:00
  • dcae1fb2cd chore: bump v0.4.6.post1 (#5845) Yineng Zhang 2025-04-28 12:57:08 -07:00
  • a0251a3fd6 add fused moe config for qwen3moe fp8/bf16 (#5849) Yi Zhang 2025-04-29 02:55:52 +08:00
  • 663037a7a0 feat: update is_fa3_default_architecture (#5854) Yineng Zhang 2025-04-28 11:53:22 -07:00
  • f4a9f60cbd [Fix] Missing bootstrap_port field (#5823) XTY 2025-04-29 02:13:04 +08:00
  • ee71ed8a41 [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (#5847) PGFLMG 2025-04-29 02:03:17 +08:00
  • d364b9b0f2 ROCm: update AITER (#5816) HAI 2025-04-28 11:01:20 -07:00
  • 849c83a0c0 [CI] test chunked prefill more (#5798) Lianmin Zheng 2025-04-28 10:57:17 -07:00
  • d73ddeb196 feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (#5850) JiLi 2025-04-29 01:49:33 +08:00
  • f48b007c1d [Doc] Recover history of server_arguments.md (#5851) Baizhou Zhang 2025-04-28 10:48:21 -07:00
  • 74cb12a878 [config] qwen3moe_tune_h20 fp8 tp4 (#5846) ybyang 2025-04-29 01:21:06 +08:00
  • c6c6264073 [PD] support pd fake transfer for warmup (#5726) ybyang 2025-04-29 00:33:20 +08:00
  • 92ab0a2055 feat: Add fused moe triton config for qwen3bf16 moe on h20 (#5839) yhyang201 2025-04-29 00:30:59 +08:00
  • e132cba2a8 fused moe triton tuning script support qwen3 (#5842) Xiaoyu Zhang 2025-04-29 00:13:04 +08:00
  • 0045f4b2af feat: Add fused moe triton config for qwen3 moe on h100 (#5833) XinyuanTong 2025-04-28 08:37:13 -07:00
  • 8601300beb fix: fix the error where the content is None when reasoning and tool … (#5838) mlmz 2025-04-28 23:36:08 +08:00
  • 6fa6f38ed3 Feat: add support for thinking mode via chat_template_kwargs.enable_t… (#5551) mlmz 2025-04-28 22:07:45 +08:00
  • 693723d1f7 Revert "Tiny refactor DefaultModelLoader.Source" (#5825) Lianmin Zheng 2025-04-28 01:18:57 -07:00
  • 966eb90865 [Docs] Replace lists with tables for cleanup and readability in server_arguments (#5276) Michael Yao 2025-04-28 15:36:10 +08:00
  • 644ed409d1 Tiny refactor DefaultModelLoader.Source (#5482) fzyzcjy 2025-04-28 15:35:51 +08:00
  • 3029889cb4 Turn on overlap scheduler for multimodal models (#5771) Lianmin Zheng 2025-04-27 23:45:09 -07:00
  • ef15dcda26 Add a doc to fix sgl-kernel build link error in py39 with ccache (#5809) Xiaoyu Zhang 2025-04-28 12:34:27 +08:00
  • ad4df30741 Dockerfile.dev pip scikit_build_core (#5807) Xiaoyu Zhang 2025-04-28 12:14:20 +08:00
  • 41ac0c6d48 chore: upgrade sgl-kernel 0.1.0 (#5690) Yineng Zhang 2025-04-27 21:00:50 -07:00
  • 84810da4ae Add Cutlass MLA attention backend (#5390) Trevor Morris 2025-04-27 20:58:53 -07:00
  • 40d9b8acce Improve overlap scheduling (#5788) Liangsheng Yin 2025-04-28 11:19:16 +08:00
  • f0365820e8 [Misc] add structure logging, write to file and log tracing for SGL Router Simo Lin 2025-04-27 16:54:10 -07:00
  • 86317c09e9 [Docs] update grafana setup guide in production metrics (#5643) Huapeng Zhou 2025-04-28 06:36:33 +08:00
  • daed453e84 [CI] Improve github summary & enable fa3 for more models (#5796) Lianmin Zheng 2025-04-27 15:29:46 -07:00
  • ded04b2e0a Update nightly-test.yml (#5797) Lianmin Zheng 2025-04-27 15:27:24 -07:00
  • 84022c0e56 Release v0.4.6 (#5795) Baizhou Zhang 2025-04-27 14:07:05 -07:00
  • f9fb33efc3 Add 8-GPU Test for Deepseek-V3 (#5691) Baizhou Zhang 2025-04-27 12:46:12 -07:00
  • a38f6932cc [CI] Fix test case (#5790) Lianmin Zheng 2025-04-27 08:55:35 -07:00
  • beb65c7433 [PD]Reduce kv transfer threads (#5791) Liangsheng Yin 2025-04-27 23:03:30 +08:00
  • 621e96bf9b [CI] Fix ci tests (#5769) Lianmin Zheng 2025-04-27 07:18:10 -07:00
  • 35ca04d2fa [CI] fix port conflicts (#5789) Lianmin Zheng 2025-04-27 05:17:44 -07:00
  • 3c4e0ee64d [CI] Tune threshold (#5787) Lianmin Zheng 2025-04-27 04:10:22 -07:00
  • 9c088829ee Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (#5786) Lianmin Zheng 2025-04-27 04:03:02 -07:00
  • 005aad32ad Revert "[fix] fix bench_one_batch_server" (#5785) Lianmin Zheng 2025-04-27 03:48:33 -07:00
  • 4d23ba08f5 Simplify FA3 tests (#5779) Lianmin Zheng 2025-04-27 01:30:17 -07:00
  • 6e313c1b8b Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (#5777) Lianmin Zheng 2025-04-27 01:04:15 -07:00
  • a45a4b239d Split local attention test from fa3 test (#5774) Baizhou Zhang 2025-04-27 01:03:31 -07:00
  • 981a2619d5 Fix eagle test case (#5776) Lianmin Zheng 2025-04-27 01:00:54 -07:00
  • 8ba313304d Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (#5772) Lianmin Zheng 2025-04-26 23:26:08 -07:00
  • 021020632a add switch to disable open api doc (#3744) zhanweidu 2025-04-27 14:18:47 +08:00
  • 7e944246c3 Add memory_saver check (#4986) Kebe 2025-04-27 11:20:05 +08:00
  • a086a11305 Use sgl-kernel sgl_per_token_group_quant_int8 (#4971) lambert0312 2025-04-27 11:19:42 +08:00
  • bdbe5f816b update llguidance to 0.7.11; adds StructTag (#4870) Michał Moskal 2025-04-26 20:13:57 -07:00
  • 9ad28f639e fix(srt): check if sample_indices is not None before usage. (#5633) aoshen524 2025-04-26 22:51:01 -04:00
  • d7b1ce65a5 Handle JSONDecodeError while processing request data (#5599) yan97ao 2025-04-27 10:50:50 +08:00
  • f55933e1cc [misc] more decode step log for batch_one_batch (#5565) JieXin Liang 2025-04-27 10:50:28 +08:00
  • 408ba02218 Add Llama 4 to FA3 test (#5509) Stefan He 2025-04-26 19:49:31 -07:00