Commit Graph

226 Commits

Author SHA1 Message Date
Xinyuan Tong
3fa3c6cd6a Enables force reasoning based on chat template for Qwen3-Thinking (#8369)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Chang Su <csu272@usc.edu>
2025-08-06 20:02:47 -07:00
Lifu Huang
6210e2c4f0 Support GPU pinning for LoRA (#8697) 2025-08-06 19:39:45 -07:00
HouseWest
ca47e24f5d [Feature] improve TBO: two chunk overlap (#8144) 2025-08-05 21:11:01 -07:00
Guanhua Wang
f7b2853ff8 [feat] support minimum token load balance in dp attention (#7379) 2025-08-03 00:46:47 -07:00
Lifu Huang
8675bdf246 Support limiting max loaded loras in CPU. (#8650) 2025-08-03 00:02:23 -07:00
Nicolas Castet
82e6c3a65a Add support for NCCL symmetric memory for TP allreduces (#8238) 2025-08-01 23:30:55 +00:00
Zac
b17c5b0118 fix arg typo for --disaggregation-transfer-backend (#8664) 2025-08-01 10:00:47 -07:00
Cheng Wan
6c88f6c8d9 [5/N] MoE Refactor: Update MoE parallelism arguments (#8658) 2025-08-01 01:20:03 -07:00
Faraz
4b04998d38 TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
2025-07-31 16:03:40 -07:00
Chang Su
51c38163c1 model: support Step3V (#8583)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: nnnobody-code <nnnobody@foxmail.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Qiaolin-Yu <qy254@cornell.edu>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-07-31 02:41:00 -07:00
Kaixi Hou
134fa43e19 [NVIDIA] Change to use num_local_experts (#8453) 2025-07-28 10:38:19 -07:00
Qiaolin Yu
2810338401 [feat] Support different attention backends for prefill and decode (#6338)
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-07-28 11:42:29 +08:00
Kevin Xiang Li
44d600cd67 Support precomputed_embeddings for Llama 4 (#8156)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-07-27 01:14:49 -07:00
Chang Su
d8ee15643b [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (#8363) 2025-07-25 14:59:42 -07:00
Xinyuan Tong
8430bfe3e9 [Refactor] simplify multimodal data processing (#8107)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-07-20 21:43:09 -07:00
Lifu Huang
4e3defe5a7 Support start up LoRA server without initial adapters (#8019) 2025-07-19 15:38:09 -07:00
Lianmin Zheng
bb0e8a32b5 Clean up server args (#8161) 2025-07-19 11:32:52 -07:00
Lianmin Zheng
9c7a46180c [Doc] Steps to add a new attention backend (#8155) 2025-07-18 16:38:26 -07:00
Lifu Huang
e2ed9d049a Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (#7844) 2025-07-13 18:36:01 -07:00
ronnie_zheng
86044712c6 [feature] kv transfer support of ascend npu (#7795)
Co-authored-by: liupeng <liupeng374@huawei.com>
2025-07-11 00:07:51 -07:00
Atream
615553079d Support Kimi K2 (#7940) 2025-07-11 00:02:21 -07:00
Yikai Zhang
0870232195 Update native_api doc to match the change in the get_model_info endpoint (#7660)
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-07-08 21:05:58 -07:00
Shangming Cai
64c5907e12 [PD] Add guidance for prefill bootstrap timeout (#7846)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-07-08 21:00:34 -07:00
Xinyuan Tong
43f93f632c fix CI: update native api ipynb (#7754)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-07-03 15:25:00 -07:00
ronnie_zheng
1e0e549766 Ascend attention backend(PA&MLA) (#7722)
Co-authored-by: Maksim <makcum888e@mail.ru>
Co-authored-by: VDV1985 <vladdv85@mail.ru>
2025-07-03 09:23:19 -07:00
Lianmin Zheng
22352d47a9 Improve streaming, log_level, memory report, weight loading, and benchmark script (#7632)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
2025-06-29 23:16:19 -07:00
Shangming Cai
5c2142579a [PD] Raise error for incompatible mooncake version and some minor fixes (#7527)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-25 18:55:24 -07:00
Lianmin Zheng
30ceccc74a Update hyperparameter_tuning.md (#7454) 2025-06-22 22:42:55 -07:00
Chang Su
72676cd6c0 feat(oai refactor): Replace openai_api with entrypoints/openai (#7351)
Co-authored-by: Jin Pan <jpan236@wisc.edu>
2025-06-21 13:21:06 -07:00
Jinn
ab74f8f09d Remove batches api in docs & example (#7400) 2025-06-20 19:46:31 -07:00
woodx
97011abc8a [Doc] add embedding rerank doc (#7364) 2025-06-19 21:53:54 -07:00
Yijie Zhu
a39d928782 support qwen2 running on ascend npu device (#7022)
Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com>
2025-06-17 11:24:10 -07:00
Lianmin Zheng
21615cc3fe Minor style and doc fix (#7228) 2025-06-16 01:03:13 -07:00
Povilas Kanapickas
bd7cfbd2f8 [Fix] Reduce busy polling when scheduler is idle (#6026) 2025-06-12 14:58:22 -07:00
Lianmin Zheng
dbdf76ca98 Clean up docs for server args and sampling parameters (generated by grok) (#7076) 2025-06-10 19:55:42 -07:00
Ximingwang-09
f2a75a66c4 update doc (#7046)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
2025-06-11 10:02:01 +08:00
Lianmin Zheng
90bd3e32d6 Improve perf tuning docs (#7071) 2025-06-10 16:55:04 -07:00
kyle-pena-kuzco
b56de8f943 Open AI API hidden states (#6716) 2025-06-10 14:37:29 -07:00
shangmingc
dd1012fcbe [PD] Fix potential perf spike caused by tracker gc and optimize doc (#6764)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-06-05 10:56:02 -07:00
zyksir
8e3797be1c support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277) 2025-06-04 22:11:24 -07:00
Xinyuan Tong
cf9815ba69 [Refactor] Multimodal data processing for VLM (#6659)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-06-04 11:22:33 -07:00
Marc Sun
37f1547587 [FEAT] Add transformers backend support (#5929) 2025-06-03 21:05:29 -07:00
Lianmin Zheng
2d72fc47cf Improve profiler and integrate profiler in bench_one_batch_server (#6787) 2025-05-31 15:53:55 -07:00
shangmingc
6cb00c6398 [PD] Optimize time out logic and add env var doc for mooncake (#6761)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-05-30 00:45:02 -07:00
Trevor Morris
e806f708c9 [PD] Make bootstrap code common between NIXL and Mooncake (#6473) 2025-05-27 12:47:38 -07:00
Vincent Zhong
45a31a82e4 docs: Update documentation to reflect xgrammar as default grammar backend (#6601)
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
2025-05-27 13:29:13 +08:00
linzhuo
7a0bbe6a64 update toc for doc and dockerfile code style format (#6450)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-05-27 13:05:11 +08:00
simveit
e235be16fe Fix some issues with current docs. (#6588) 2025-05-26 01:04:34 +08:00
Chang Su
ed0c3035cd feat(Tool Calling): Support required and specific function mode (#6550) 2025-05-23 21:00:37 -07:00
ryang
a6ae3af15e Support XiaomiMiMo inference with mtp (#6059) 2025-05-22 14:14:49 -07:00