Commit Graph

640 Commits

Author SHA1 Message Date
Zhiqiang Xie
001f51940a [HiCache] change the default policy to write through (#9772) 2025-08-28 18:28:39 -07:00
Yineng Zhang
bc80dc4ce0 chore: bump v0.5.1.post3 (#9716) 2025-08-27 15:42:42 -07:00
yhyang201
a85363c199 [docs] Instructions for bench_serving.py (#9071)
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-08-26 18:30:57 -07:00
Xiaotong Jiang
1a0896e9c0 [doc] add kimik2 --tool-call-parser (#9647) 2025-08-26 10:39:40 -07:00
Netanel Haber
4cd08dc592 model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 (#9301) 2025-08-26 15:33:40 +08:00
Chayenne
9b08d975a0 [docs] Refactor, remove compiled results and add gpt-oss (#9613)
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
2025-08-25 15:27:06 -07:00
Yineng Zhang
e3e97a120b chore: bump v0.5.1.post2 (#9592) 2025-08-25 03:45:09 -07:00
Xinyuan Tong
ca4b86c564 fix: Update OpenAI client base URL in documentation (#9576) 2025-08-24 23:06:57 -07:00
Yineng Zhang
e0ab167db0 chore: bump v0.5.1.post1 (#9558) 2025-08-24 01:14:17 -07:00
Xiaotong Jiang
80425e59bb [doc] deepseekv31 support (#9544) 2025-08-23 16:54:58 -07:00
Lianmin Zheng
97a38ee85b Release 0.5.1 (#9533) 2025-08-23 07:09:26 -07:00
Xinyuan Tong
fedfe91c1a [Docs] Add doc and quick demo for gpt-oss responses api & buildin tools (#9497)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-21 23:51:52 -07:00
Xinyuan Tong
13ec8d427e [Docs]Update reasoning parser doc & fix outdated link (#9492)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-21 22:08:28 -07:00
Chayenne
05bd789791 [docs]: fix reasoning context in docs (#9483) 2025-08-21 20:04:12 -07:00
Xinyuan Tong
0b3a5b1151 Update reasoning parser doc (#9468)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-21 17:25:30 -07:00
Xinyuan Tong
e8449ab515 Add deepseek v3.1 thinking parser support and update docs (#9464)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-21 15:09:40 -07:00
Lifu Huang
b0980af89f Support pinning adapter via server args. (#9249) 2025-08-20 16:25:01 -07:00
Lianmin Zheng
1ec9769753 [Docs] Update contribution guide (#9383) 2025-08-19 23:37:45 -07:00
Lianmin Zheng
f20b6a3f2b [minor] Sync style changes (#9376) 2025-08-19 21:35:01 -07:00
Lianmin Zheng
ecc9f3e47a [Minor] Fix the style of sgl-kernel (#9332) 2025-08-18 23:45:00 -07:00
Yineng Zhang
7e8187e004 docs: fix spec (#9326) 2025-08-18 19:35:46 -07:00
Lianmin Zheng
c480a3f6ea Minor style fixes for sgl-kernel (#9289) 2025-08-18 09:38:35 -07:00
Netanel Haber
845d12a979 model: support nvidia/Llama-3_3-Nemotron-Super-49B-v1 (#9067)
Co-authored-by: Kyle Huang <kylhuang@nvidia.com>
2025-08-17 01:48:15 -07:00
Cheng Wan
295895120d [6/N] MoE Refactor: Cleanup MoE-related configs (#8849) 2025-08-14 21:14:53 -07:00
Yineng Zhang
fab0f6e77d chore: bump v0.5.0rc2 (#9203) 2025-08-14 16:11:16 -07:00
Chengxing Xie
c1c7dc4534 feat: Add model version tracking with API endpoints and response metadata (#8795) 2025-08-14 12:13:46 -07:00
Lianmin Zheng
9e426466af Clean up allocators (#9134)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-13 13:56:04 -07:00
Zhihao Liu
65736dc524 [Model] Support Qwen3ForSequenceClassification for Qwen3-Embed Model (#7957) 2025-08-13 11:14:54 -07:00
jacky.cheng
25caa7a8a9 [AMD] Support Wave attention backend with AMD GPU optimizations (#8660)
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Harsh Menon <harsh@nod-labs.com>
Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Co-authored-by: Harsh Menon <harsh@nod-labs.com>
Co-authored-by: Stanley Winata <stanley.winata@amd.com>
Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com>
Co-authored-by: Stanley Winata <stanley@nod-labs.com>
Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com>
Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com>
Co-authored-by: Ivan Butygin <ibutygin@amd.com>
2025-08-12 13:49:11 -07:00
Hangzhi
03d114496f Fix typos in supported models documentation (#9119) 2025-08-12 13:35:24 -07:00
li chaoran
2ecbd8b8bf [feat] add ascend readme and docker release (#8700)
Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>
Signed-off-by: lichaoran <pkwarcraft@gmail.com>
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2025-08-12 13:25:42 -07:00
Simo Lin
1ce30dd13e [router] update router documentation (#9121) 2025-08-12 13:16:34 -07:00
Yichao Cheng
fcc11e5ed5 update support new models doc (#9096)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-12 01:21:02 -07:00
Zhiqiang Xie
0eec4cb6cc HiCache, add bench long context plus minor fixs (#9086)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-11 16:54:52 -07:00
Faraz
f508cd3cb7 TRTLLM-MLA FP8 path (#8638)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
2025-08-11 14:02:13 -07:00
Lianmin Zheng
8c07fabda7 Update hyperparameter_tuning.md (#9083) 2025-08-11 13:44:11 -07:00
Liangsheng Yin
f9afa7dceb Fix docs for clip max new tokens (#9082) 2025-08-11 13:15:21 -07:00
Jimmy
0d9e89ec69 [PD]decode: add CLIP_MAX_NEW_TOKEN for pop_preallocated (#8866) 2025-08-11 13:08:11 -07:00
Hangzhi
3d64fda376 Fix broken Kimi models HuggingFace link (#9080) 2025-08-11 12:15:00 -07:00
Baizhou Zhang
75e6a7cde1 Support radix cache for Lora feature (#7216) 2025-08-11 10:14:11 -07:00
Lianmin Zheng
2e8e7e353b Improve docs and developer guide (#9044) 2025-08-10 21:05:18 -07:00
Lianmin Zheng
2449a0afe2 Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00
Lifu Huang
f8a173bb50 Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops (#8940) 2025-08-10 01:04:45 -07:00
Binyao Jiang
f29aba8c6e Support glm4.1v and glm4.5v (#8798)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Chang Su <csu272@usc.edu>
2025-08-09 00:59:13 -07:00
Lianmin Zheng
706bd69cc5 Clean up server_args.py to have a dedicated function for model specific adjustments (#8983) 2025-08-08 19:56:50 -07:00
Yineng Zhang
9020f7fc32 chore: bump v0.5.0rc0 (#8959) 2025-08-08 09:16:18 -07:00
Wenbo Yang
1132547496 Add ernie4.py for ERNIE-4.5 (#7657) 2025-08-08 00:55:48 -07:00
Xinyuan Tong
3fa3c6cd6a Enables force reasoning based on chat template for Qwen3-Thinking (#8369)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Chang Su <csu272@usc.edu>
2025-08-06 20:02:47 -07:00
Lifu Huang
6210e2c4f0 Support GPU pinning for LoRA (#8697) 2025-08-06 19:39:45 -07:00
HouseWest
ca47e24f5d [Feature] improve TBO: two chunk overlap (#8144) 2025-08-05 21:11:01 -07:00