Baizhou Zhang
|
983ef22cf3
|
[Doc] Update deterministic inference flag in server_arguments.md (#11978)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-22 14:12:15 -07:00 |
|
Baizhou Zhang
|
ef4a8097b8
|
Rename flashmla kernel options of nsa backend for better readability (#11876)
|
2025-10-21 13:14:16 -07:00 |
|
b8zhong
|
f9a7d9b3dc
|
support server arg override KV cache to bf16 to avoid slow cases (#11749)
|
2025-10-19 02:49:48 +08:00 |
|
Xun Sun
|
a40229f6f8
|
[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-14 19:40:54 -07:00 |
|
Chenxi Li
|
28f80b1244
|
Implement LRU eviction policy for LoRA adapters (#11041)
|
2025-10-13 20:18:25 -07:00 |
|
Lianmin Zheng
|
2ac46e94ef
|
Sync changes on io_struct.py and deterministic ops (#11498)
|
2025-10-12 16:03:10 -07:00 |
|
Cheng Wan
|
3c06b673af
|
[8/N] MoE Refactor: deprecate EPMoE (#11211)
|
2025-10-07 21:51:41 -07:00 |
|
Lianmin Zheng
|
708f4ff490
|
Rename max_micro_batch_size -> pp_max_micro_batch_size (#11279)
|
2025-10-06 15:50:56 -07:00 |
|
Matt Nappo
|
8c57490210
|
[Feature] Option to save model weights to CPU when memory saver mode is enabled (#10873)
Co-authored-by: molocule <34072934+molocule@users.noreply.github.com>
|
2025-10-03 16:48:19 +08:00 |
|
fzyzcjy
|
5e786cca3a
|
Support single batch overlap (#10422)
|
2025-10-02 18:04:36 +08:00 |
|
narutolhy
|
d17986f8c6
|
Enable optional FP32 compute for LM Head (#10729)
Thanks to MiniMax Team and Chenyang Zhao's support.
|
2025-09-29 20:45:17 -07:00 |
|
Lianmin Zheng
|
f68dd998b9
|
Rename customer label -> custom label (#10899)
Co-authored-by: Yingchun Lai <laiyingchun@apache.org>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-09-25 16:19:53 -07:00 |
|
kushanam
|
d7b20dd65d
|
chore: Initial support for input config files (#10534)
Co-authored-by: root <root@umbriel-b200-017.ipp4a1.colossus.nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-09-24 14:45:52 -07:00 |
|
Philip Kiely - Baseten
|
7f028b07c4
|
Fix formatting in long code blocks (#10528)
|
2025-09-16 12:02:05 -07:00 |
|
Baizhou Zhang
|
8ad700f735
|
Cleaning codes for speculative attention mode (#10149)
|
2025-09-08 17:38:06 -07:00 |
|
Yingchun Lai
|
b32ab0705e
|
metrics: support customer buckets for prompt/generation_tokens_histogram (#9634)
|
2025-09-04 22:22:08 +08:00 |
|
Zhiqiang Xie
|
001f51940a
|
[HiCache] change the default policy to write through (#9772)
|
2025-08-28 18:28:39 -07:00 |
|
Lifu Huang
|
b0980af89f
|
Support pinning adapter via server args. (#9249)
|
2025-08-20 16:25:01 -07:00 |
|
Cheng Wan
|
295895120d
|
[6/N] MoE Refactor: Cleanup MoE-related configs (#8849)
|
2025-08-14 21:14:53 -07:00 |
|
Lianmin Zheng
|
2e8e7e353b
|
Improve docs and developer guide (#9044)
|
2025-08-10 21:05:18 -07:00 |
|
Lianmin Zheng
|
2449a0afe2
|
Refactor the docs (#9031)
|
2025-08-10 19:49:45 -07:00 |
|