b8zhong
|
6bc503af73
|
[Doc] Update support matrix for attn and hybrid attn (#11293)
|
2025-10-14 22:43:11 -07:00 |
|
Xun Sun
|
a40229f6f8
|
[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-14 19:40:54 -07:00 |
|
Simo Lin
|
e0c2af2ac2
|
[router] update router doc to latest features (#11639)
|
2025-10-14 18:32:30 -07:00 |
|
Wenyi Xu
|
642fa966f2
|
[Docs] [Router]: Update sg-router doc on circuit breaker (#11449)
|
2025-10-14 02:18:14 -07:00 |
|
Chenxi Li
|
28f80b1244
|
Implement LRU eviction policy for LoRA adapters (#11041)
|
2025-10-13 20:18:25 -07:00 |
|
Xiaoyu Zhang
|
88a6f9dab5
|
bench_serving support PD Disaggregation (#11542)
|
2025-10-13 19:43:26 -07:00 |
|
hzh0425
|
318424e2c8
|
[HICache]: Support 3FS-Store with page_first_direct layout (#11460)
|
2025-10-13 15:47:22 +08:00 |
|
Jonah Bernard
|
8e776c78a1
|
docs(router): add token-bucket rate limiting to the docs (#11485)
|
2025-10-12 20:03:27 -07:00 |
|
Lianmin Zheng
|
2ac46e94ef
|
Sync changes on io_struct.py and deterministic ops (#11498)
|
2025-10-12 16:03:10 -07:00 |
|
ykcombat
|
f5754d1256
|
[Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427)
|
2025-10-11 21:36:07 +08:00 |
|
Shangming Cai
|
0a7c4bded7
|
[Doc] Update mooncake nvlink transport doc for PD disaggregation (#11321)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-08 00:59:29 -07:00 |
|
Cheng Wan
|
3c06b673af
|
[8/N] MoE Refactor: deprecate EPMoE (#11211)
|
2025-10-07 21:51:41 -07:00 |
|
Xinyuan Tong
|
e3c7f09146
|
Update tool parser and related documentation (#11223)
|
2025-10-07 11:03:40 -07:00 |
|
hzh0425
|
df08bf9b9f
|
[Doc]: Best Practice for HICache (#11001)
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-10-08 00:59:21 +08:00 |
|
ykwd
|
69efdd27bc
|
[Doc] HiCache Design Documents (#11027)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-10-08 00:35:45 +08:00 |
|
Wenyi Xu
|
0958a39704
|
[Docs] [Router] Update Observability and Common Issues Section (#11302)
|
2025-10-07 08:03:09 -07:00 |
|
Lianmin Zheng
|
708f4ff490
|
Rename max_micro_batch_size -> pp_max_micro_batch_size (#11279)
|
2025-10-06 15:50:56 -07:00 |
|
Matt Nappo
|
8c57490210
|
[Feature] Option to save model weights to CPU when memory saver mode is enabled (#10873)
Co-authored-by: molocule <34072934+molocule@users.noreply.github.com>
|
2025-10-03 16:48:19 +08:00 |
|
fzyzcjy
|
5e786cca3a
|
Support single batch overlap (#10422)
|
2025-10-02 18:04:36 +08:00 |
|
narutolhy
|
d17986f8c6
|
Enable optional FP32 compute for LM Head (#10729)
Thanks to MiniMax Team and Chenyang Zhao's support.
|
2025-09-29 20:45:17 -07:00 |
|
Lianmin Zheng
|
dda34c2f93
|
Fix mem fraction static for nightly tests (#11076)
|
2025-09-29 12:57:41 -07:00 |
|
Lianmin Zheng
|
f68dd998b9
|
Rename customer label -> custom label (#10899)
Co-authored-by: Yingchun Lai <laiyingchun@apache.org>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-09-25 16:19:53 -07:00 |
|
kushanam
|
d7b20dd65d
|
chore: Initial support for input config files (#10534)
Co-authored-by: root <root@umbriel-b200-017.ipp4a1.colossus.nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-09-24 14:45:52 -07:00 |
|
Lifu Huang
|
08ecd0aa2a
|
[3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization (#10592)
|
2025-09-20 22:47:48 -07:00 |
|
Philip Kiely - Baseten
|
7f028b07c4
|
Fix formatting in long code blocks (#10528)
|
2025-09-16 12:02:05 -07:00 |
|
Lifu Huang
|
3f41b48c40
|
[2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (#10286)
|
2025-09-15 16:04:03 -07:00 |
|
Baizhou Zhang
|
8ad700f735
|
Cleaning codes for speculative attention mode (#10149)
|
2025-09-08 17:38:06 -07:00 |
|
Yineng Zhang
|
b7d1f17b8d
|
Revert "enable auto-round quantization model (#6226)" (#10148)
|
2025-09-07 22:31:11 -07:00 |
|
Weiwei
|
c8295d2353
|
enable auto-round quantization model (#6226)
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
|
2025-09-07 22:05:35 -07:00 |
|
Liangsheng Yin
|
6e95f5e5bd
|
Simplify Router arguments passing and build it in docker image (#9964)
|
2025-09-05 12:13:55 +08:00 |
|
Yingchun Lai
|
b32ab0705e
|
metrics: support customer buckets for prompt/generation_tokens_histogram (#9634)
|
2025-09-04 22:22:08 +08:00 |
|
Huapeng Zhou
|
75ee00112d
|
[Doc] Fix SGLang tool parser doc (#9886)
|
2025-09-04 21:52:53 +08:00 |
|
Lianmin Zheng
|
60e37f8028
|
Move parsers under a single folder (#9912)
|
2025-09-02 18:25:04 -07:00 |
|
Lifu Huang
|
1fbfdebe6b
|
[chore] fix dead links in doc (#9913)
|
2025-09-02 00:28:26 -07:00 |
|
Zhiqiang Xie
|
001f51940a
|
[HiCache] change the default policy to write through (#9772)
|
2025-08-28 18:28:39 -07:00 |
|
yhyang201
|
a85363c199
|
[docs] Instructions for bench_serving.py (#9071)
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-08-26 18:30:57 -07:00 |
|
Xiaotong Jiang
|
1a0896e9c0
|
[doc] add kimik2 --tool-call-parser (#9647)
|
2025-08-26 10:39:40 -07:00 |
|
Chayenne
|
9b08d975a0
|
[docs] Refactor, remove compiled results and add gpt-oss (#9613)
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
|
2025-08-25 15:27:06 -07:00 |
|
Xinyuan Tong
|
13ec8d427e
|
[Docs]Update reasoning parser doc & fix outdated link (#9492)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2025-08-21 22:08:28 -07:00 |
|
Chayenne
|
05bd789791
|
[docs]: fix reasoning context in docs (#9483)
|
2025-08-21 20:04:12 -07:00 |
|
Lifu Huang
|
b0980af89f
|
Support pinning adapter via server args. (#9249)
|
2025-08-20 16:25:01 -07:00 |
|
Yineng Zhang
|
7e8187e004
|
docs: fix spec (#9326)
|
2025-08-18 19:35:46 -07:00 |
|
Cheng Wan
|
295895120d
|
[6/N] MoE Refactor: Cleanup MoE-related configs (#8849)
|
2025-08-14 21:14:53 -07:00 |
|
jacky.cheng
|
25caa7a8a9
|
[AMD] Support Wave attention backend with AMD GPU optimizations (#8660)
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Harsh Menon <harsh@nod-labs.com>
Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Co-authored-by: Harsh Menon <harsh@nod-labs.com>
Co-authored-by: Stanley Winata <stanley.winata@amd.com>
Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com>
Co-authored-by: Stanley Winata <stanley@nod-labs.com>
Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com>
Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com>
Co-authored-by: Ivan Butygin <ibutygin@amd.com>
|
2025-08-12 13:49:11 -07:00 |
|
Simo Lin
|
1ce30dd13e
|
[router] update router documentation (#9121)
|
2025-08-12 13:16:34 -07:00 |
|
Zhiqiang Xie
|
0eec4cb6cc
|
HiCache, add bench long context plus minor fixs (#9086)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-11 16:54:52 -07:00 |
|
Faraz
|
f508cd3cb7
|
TRTLLM-MLA FP8 path (#8638)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
|
2025-08-11 14:02:13 -07:00 |
|
Lianmin Zheng
|
8c07fabda7
|
Update hyperparameter_tuning.md (#9083)
|
2025-08-11 13:44:11 -07:00 |
|
Liangsheng Yin
|
f9afa7dceb
|
Fix docs for clip max new tokens (#9082)
|
2025-08-11 13:15:21 -07:00 |
|
Jimmy
|
0d9e89ec69
|
[PD]decode: add CLIP_MAX_NEW_TOKEN for pop_preallocated (#8866)
|
2025-08-11 13:08:11 -07:00 |
|