Commit Graph

600 Commits

Author SHA1 Message Date
Lianmin Zheng
2e8e7e353b Improve docs and developer guide (#9044) 2025-08-10 21:05:18 -07:00
Lianmin Zheng
2449a0afe2 Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00
Lifu Huang
f8a173bb50 Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops (#8940) 2025-08-10 01:04:45 -07:00
Binyao Jiang
f29aba8c6e Support glm4.1v and glm4.5v (#8798)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Chang Su <csu272@usc.edu>
2025-08-09 00:59:13 -07:00
Lianmin Zheng
706bd69cc5 Clean up server_args.py to have a dedicated function for model specific adjustments (#8983) 2025-08-08 19:56:50 -07:00
Yineng Zhang
9020f7fc32 chore: bump v0.5.0rc0 (#8959) 2025-08-08 09:16:18 -07:00
Wenbo Yang
1132547496 Add ernie4.py for ERNIE-4.5 (#7657) 2025-08-08 00:55:48 -07:00
Xinyuan Tong
3fa3c6cd6a Enables force reasoning based on chat template for Qwen3-Thinking (#8369)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Chang Su <csu272@usc.edu>
2025-08-06 20:02:47 -07:00
Lifu Huang
6210e2c4f0 Support GPU pinning for LoRA (#8697) 2025-08-06 19:39:45 -07:00
HouseWest
ca47e24f5d [Feature] improve TBO: two chunk overlap (#8144) 2025-08-05 21:11:01 -07:00
Praneth Paruchuri
d26ca84f39 Support bailing moe (#8680) 2025-08-05 20:40:34 -07:00
Yineng Zhang
8cd344586e chore: bump v0.4.10.post2 (#8727) 2025-08-03 03:43:29 -07:00
Guanhua Wang
f7b2853ff8 [feat] support minimum token load balance in dp attention (#7379) 2025-08-03 00:46:47 -07:00
Lifu Huang
8675bdf246 Support limiting max loaded loras in CPU. (#8650) 2025-08-03 00:02:23 -07:00
Nicolas Castet
82e6c3a65a Add support for NCCL symmetric memory for TP allreduces (#8238) 2025-08-01 23:30:55 +00:00
Zac
b17c5b0118 fix arg typo for --disaggregation-transfer-backend (#8664) 2025-08-01 10:00:47 -07:00
Cheng Wan
6c88f6c8d9 [5/N] MoE Refactor: Update MoE parallelism arguments (#8658) 2025-08-01 01:20:03 -07:00
Ke Bao
33f0de337d chore: bump v0.4.10.post1 (#8652) 2025-08-01 12:07:30 +08:00
Faraz
4b04998d38 TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
2025-07-31 16:03:40 -07:00
Yineng Zhang
023288645b chore: bump v0.4.10 (#8608) 2025-07-31 20:50:17 +08:00
Chang Su
51c38163c1 model: support Step3V (#8583)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: nnnobody-code <nnnobody@foxmail.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Qiaolin-Yu <qy254@cornell.edu>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-07-31 02:41:00 -07:00
Adarsh Shirawalmath
ec5f944271 [Model] Add support for Arcee Foundational Model (#8154) 2025-07-30 10:45:25 -07:00
Rui Chen
a730ce8162 [feature] [sgl-router] Add a dp-aware routing strategy (#6869) 2025-07-30 05:58:48 -07:00
Yineng Zhang
6478831be9 chore: bump v0.4.9.post6 (#8517) 2025-07-29 02:30:07 -07:00
Kaixi Hou
134fa43e19 [NVIDIA] Change to use num_local_experts (#8453) 2025-07-28 10:38:19 -07:00
Yineng Zhang
45bc170b36 chore: bump v0.4.9.post5 (#8458) 2025-07-28 02:11:06 -07:00
Qiaolin Yu
484d0e021d doc: add bench_one_batch_server in the benchmark doc (#8441) 2025-07-27 23:07:54 -07:00
Qiaolin Yu
2810338401 [feat] Support different attention backends for prefill and decode (#6338)
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-07-28 11:42:29 +08:00
Kevin Xiang Li
44d600cd67 Support precomputed_embeddings for Llama 4 (#8156)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-07-27 01:14:49 -07:00
Yineng Zhang
2272c2a5b5 chore: bump v0.4.9.post4 (#8305) 2025-07-25 17:12:47 -07:00
Chang Su
d8ee15643b [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (#8363) 2025-07-25 14:59:42 -07:00
Xiaoyu Zhang
9045cc1eb8 [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering torch.compile in forward pass (#8353) 2025-07-25 21:17:47 +08:00
Zaili Wang
15d2759174 [CPU] Add tutorial docs for SGL on CPU (#8000) 2025-07-25 00:03:16 -07:00
Yineng Zhang
01c000043c chore: bump v0.4.9.post3 (#8265) 2025-07-22 15:55:48 -07:00
Xinyuan Tong
8430bfe3e9 [Refactor] simplify multimodal data processing (#8107)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-07-20 21:43:09 -07:00
Praneth Paruchuri
83c104b188 Feat: Support for Persimmon Model (#7983) 2025-07-19 23:07:47 -07:00
Lifu Huang
4e3defe5a7 Support start up LoRA server without initial adapters (#8019) 2025-07-19 15:38:09 -07:00
Lianmin Zheng
bb0e8a32b5 Clean up server args (#8161) 2025-07-19 11:32:52 -07:00
Binyao Jiang
b7e951a6db Feat: Support audio in Phi4-mm model (#8048) 2025-07-18 21:03:53 -07:00
Lianmin Zheng
9c7a46180c [Doc] Steps to add a new attention backend (#8155) 2025-07-18 16:38:26 -07:00
Minglei Zhu
8a32355704 Feat: Support Granite 3.0 MoE in SGLang (#7959) 2025-07-17 20:56:03 -07:00
Praneth Paruchuri
cb736df854 Support for Phi-1.5 & Phi-2 models (#7862) 2025-07-13 18:43:40 -07:00
Lifu Huang
e2ed9d049a Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (#7844) 2025-07-13 18:36:01 -07:00
Yineng Zhang
22bd857cb5 docs: update README (#7985) 2025-07-12 13:31:11 -07:00
Yineng Zhang
eb118d88c4 chore: bump v0.4.9.post2 (#7963) 2025-07-11 21:11:20 -07:00
ronnie_zheng
86044712c6 [feature] kv transfer support of ascend npu (#7795)
Co-authored-by: liupeng <liupeng374@huawei.com>
2025-07-11 00:07:51 -07:00
Atream
615553079d Support Kimi K2 (#7940) 2025-07-11 00:02:21 -07:00
Binyao Jiang
2d54d4bb64 Feat: Support Phi-3.5-MoE in SGLang (#7907) 2025-07-09 23:51:33 -07:00
Yineng Zhang
066f4ec91f chore: bump v0.4.9.post1 (#7882) 2025-07-09 00:28:17 -07:00
Yikai Zhang
0870232195 Update native_api doc to match the change in the get_model_info endpoint (#7660)
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-07-08 21:05:58 -07:00