Kaixi Hou
|
134fa43e19
|
[NVIDIA] Change to use num_local_experts (#8453)
|
2025-07-28 10:38:19 -07:00 |
|
Qiaolin Yu
|
2810338401
|
[feat] Support different attention backends for prefill and decode (#6338)
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-07-28 11:42:29 +08:00 |
|
Kevin Xiang Li
|
44d600cd67
|
Support precomputed_embeddings for Llama 4 (#8156)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2025-07-27 01:14:49 -07:00 |
|
Chang Su
|
d8ee15643b
|
[Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (#8363)
|
2025-07-25 14:59:42 -07:00 |
|
Xinyuan Tong
|
8430bfe3e9
|
[Refactor] simplify multimodal data processing (#8107)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-07-20 21:43:09 -07:00 |
|
Lifu Huang
|
4e3defe5a7
|
Support start up LoRA server without initial adapters (#8019)
|
2025-07-19 15:38:09 -07:00 |
|
Lianmin Zheng
|
bb0e8a32b5
|
Clean up server args (#8161)
|
2025-07-19 11:32:52 -07:00 |
|
Lianmin Zheng
|
9c7a46180c
|
[Doc] Steps to add a new attention backend (#8155)
|
2025-07-18 16:38:26 -07:00 |
|
Lifu Huang
|
e2ed9d049a
|
Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (#7844)
|
2025-07-13 18:36:01 -07:00 |
|
ronnie_zheng
|
86044712c6
|
[feature] kv transfer support of ascend npu (#7795)
Co-authored-by: liupeng <liupeng374@huawei.com>
|
2025-07-11 00:07:51 -07:00 |
|
Atream
|
615553079d
|
Support Kimi K2 (#7940)
|
2025-07-11 00:02:21 -07:00 |
|
Yikai Zhang
|
0870232195
|
Update native_api doc to match the change in the get_model_info endpoint (#7660)
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
|
2025-07-08 21:05:58 -07:00 |
|
Shangming Cai
|
64c5907e12
|
[PD] Add guidance for prefill bootstrap timeout (#7846)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-07-08 21:00:34 -07:00 |
|
Xinyuan Tong
|
43f93f632c
|
fix CI: update native api ipynb (#7754)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-07-03 15:25:00 -07:00 |
|
ronnie_zheng
|
1e0e549766
|
Ascend attention backend(PA&MLA) (#7722)
Co-authored-by: Maksim <makcum888e@mail.ru>
Co-authored-by: VDV1985 <vladdv85@mail.ru>
|
2025-07-03 09:23:19 -07:00 |
|
Lianmin Zheng
|
22352d47a9
|
Improve streaming, log_level, memory report, weight loading, and benchmark script (#7632)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
|
2025-06-29 23:16:19 -07:00 |
|
Shangming Cai
|
5c2142579a
|
[PD] Raise error for incompatible mooncake version and some minor fixes (#7527)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-06-25 18:55:24 -07:00 |
|
Lianmin Zheng
|
30ceccc74a
|
Update hyperparameter_tuning.md (#7454)
|
2025-06-22 22:42:55 -07:00 |
|
Chang Su
|
72676cd6c0
|
feat(oai refactor): Replace openai_api with entrypoints/openai (#7351)
Co-authored-by: Jin Pan <jpan236@wisc.edu>
|
2025-06-21 13:21:06 -07:00 |
|
Jinn
|
ab74f8f09d
|
Remove batches api in docs & example (#7400)
|
2025-06-20 19:46:31 -07:00 |
|
woodx
|
97011abc8a
|
[Doc] add embedding rerank doc (#7364)
|
2025-06-19 21:53:54 -07:00 |
|
Yijie Zhu
|
a39d928782
|
support qwen2 running on ascend npu device (#7022)
Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com>
|
2025-06-17 11:24:10 -07:00 |
|
Lianmin Zheng
|
21615cc3fe
|
Minor style and doc fix (#7228)
|
2025-06-16 01:03:13 -07:00 |
|
Povilas Kanapickas
|
bd7cfbd2f8
|
[Fix] Reduce busy polling when scheduler is idle (#6026)
|
2025-06-12 14:58:22 -07:00 |
|
Lianmin Zheng
|
dbdf76ca98
|
Clean up docs for server args and sampling parameters (generated by grok) (#7076)
|
2025-06-10 19:55:42 -07:00 |
|
Ximingwang-09
|
f2a75a66c4
|
update doc (#7046)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
|
2025-06-11 10:02:01 +08:00 |
|
Lianmin Zheng
|
90bd3e32d6
|
Improve perf tuning docs (#7071)
|
2025-06-10 16:55:04 -07:00 |
|
kyle-pena-kuzco
|
b56de8f943
|
Open AI API hidden states (#6716)
|
2025-06-10 14:37:29 -07:00 |
|
shangmingc
|
dd1012fcbe
|
[PD] Fix potential perf spike caused by tracker gc and optimize doc (#6764)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-06-05 10:56:02 -07:00 |
|
zyksir
|
8e3797be1c
|
support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277)
|
2025-06-04 22:11:24 -07:00 |
|
Xinyuan Tong
|
cf9815ba69
|
[Refactor] Multimodal data processing for VLM (#6659)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-06-04 11:22:33 -07:00 |
|
Marc Sun
|
37f1547587
|
[FEAT] Add transformers backend support (#5929)
|
2025-06-03 21:05:29 -07:00 |
|
Lianmin Zheng
|
2d72fc47cf
|
Improve profiler and integrate profiler in bench_one_batch_server (#6787)
|
2025-05-31 15:53:55 -07:00 |
|
shangmingc
|
6cb00c6398
|
[PD] Optimize time out logic and add env var doc for mooncake (#6761)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-30 00:45:02 -07:00 |
|
Trevor Morris
|
e806f708c9
|
[PD] Make bootstrap code common between NIXL and Mooncake (#6473)
|
2025-05-27 12:47:38 -07:00 |
|
Vincent Zhong
|
45a31a82e4
|
docs: Update documentation to reflect xgrammar as default grammar backend (#6601)
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
|
2025-05-27 13:29:13 +08:00 |
|
linzhuo
|
7a0bbe6a64
|
update toc for doc and dockerfile code style format (#6450)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-05-27 13:05:11 +08:00 |
|
simveit
|
e235be16fe
|
Fix some issues with current docs. (#6588)
|
2025-05-26 01:04:34 +08:00 |
|
Chang Su
|
ed0c3035cd
|
feat(Tool Calling): Support required and specific function mode (#6550)
|
2025-05-23 21:00:37 -07:00 |
|
ryang
|
a6ae3af15e
|
Support XiaomiMiMo inference with mtp (#6059)
|
2025-05-22 14:14:49 -07:00 |
|
Byron Hsu
|
7513558074
|
[PD] Add doc and simplify sender.send (#6019)
|
2025-05-21 21:22:21 -07:00 |
|
fzyzcjy
|
f0653886a5
|
Expert distribution recording without overhead for EPLB (#4957)
|
2025-05-19 20:07:43 -07:00 |
|
Yury Sulsky
|
f19a9204cd
|
Support precomputed multimodal features for Qwen-VL and Gemma3 models. (#6136)
Co-authored-by: Yury Sulsky <ysulsky@tesla.com>
|
2025-05-16 12:26:15 -07:00 |
|
quinnrong94
|
2e4babdb0a
|
[Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109)
Co-authored-by: Yingyi <yingyihuang2000@outlook.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: kexueyu <kexueyu@tencent.com>
Co-authored-by: vincentmeng <vincentmeng@tencent.com>
Co-authored-by: pengmeng <pengmeng@tencent.com>
|
2025-05-15 00:48:09 -07:00 |
|
Brayden Zhong
|
9a91fa0ed1
|
docs: fix a bad redirect (#6300)
|
2025-05-14 10:27:19 -07:00 |
|
Lianmin Zheng
|
e8e18dcdcc
|
Revert "fix some typos" (#6244)
|
2025-05-12 12:53:26 -07:00 |
|
applesaucethebun
|
d738ab52f8
|
fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-13 01:42:38 +08:00 |
|
Cheng Wan
|
25c83fff6a
|
Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
|
2025-05-11 23:36:29 -07:00 |
|
Lianmin Zheng
|
01bdbf7f80
|
Improve structured outputs: fix race condition, server crash, metrics and style (#6188)
|
2025-05-11 08:36:16 -07:00 |
|
Adarsh Shirawalmath
|
94d42b6794
|
[Docs] minor Qwen3 and reasoning parser docs fix (#6032)
|
2025-05-11 08:22:46 -07:00 |
|