Commit Graph

181 Commits

Author SHA1 Message Date
Vincent Zhong
45a31a82e4 docs: Update documentation to reflect xgrammar as default grammar backend (#6601)
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
2025-05-27 13:29:13 +08:00
linzhuo
7a0bbe6a64 update toc for doc and dockerfile code style format (#6450)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-05-27 13:05:11 +08:00
simveit
e235be16fe Fix some issues with current docs. (#6588) 2025-05-26 01:04:34 +08:00
Chang Su
ed0c3035cd feat(Tool Calling): Support required and specific function mode (#6550) 2025-05-23 21:00:37 -07:00
ryang
a6ae3af15e Support XiaomiMiMo inference with mtp (#6059) 2025-05-22 14:14:49 -07:00
Byron Hsu
7513558074 [PD] Add doc and simplify sender.send (#6019) 2025-05-21 21:22:21 -07:00
fzyzcjy
f0653886a5 Expert distribution recording without overhead for EPLB (#4957) 2025-05-19 20:07:43 -07:00
Yury Sulsky
f19a9204cd Support precomputed multimodal features for Qwen-VL and Gemma3 models. (#6136)
Co-authored-by: Yury Sulsky <ysulsky@tesla.com>
2025-05-16 12:26:15 -07:00
quinnrong94
2e4babdb0a [Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109)
Co-authored-by: Yingyi <yingyihuang2000@outlook.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: kexueyu <kexueyu@tencent.com>
Co-authored-by: vincentmeng <vincentmeng@tencent.com>
Co-authored-by: pengmeng <pengmeng@tencent.com>
2025-05-15 00:48:09 -07:00
Brayden Zhong
9a91fa0ed1 docs: fix a bad redirect (#6300) 2025-05-14 10:27:19 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Cheng Wan
25c83fff6a Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
2025-05-11 23:36:29 -07:00
Lianmin Zheng
01bdbf7f80 Improve structured outputs: fix race condition, server crash, metrics and style (#6188) 2025-05-11 08:36:16 -07:00
Adarsh Shirawalmath
94d42b6794 [Docs] minor Qwen3 and reasoning parser docs fix (#6032) 2025-05-11 08:22:46 -07:00
mlmz
69276f619a doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (#6199) 2025-05-11 08:22:11 -07:00
applesaucethebun
2ce8793519 Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-11 12:55:00 +08:00
Yineng Zhang
66fc63d6b1 Revert "feat: add thinking_budget (#6089)" (#6181) 2025-05-10 16:07:45 -07:00
Ximingwang-09
921e4a8185 [Docs]Delete duplicate content (#6146)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
2025-05-10 15:02:15 -07:00
XinyuanTong
9d8ec2e67e Fix and Clean up chat-template requirement for VLM (#6114)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-05-11 00:14:09 +08:00
thyecust
63484f9fd6 feat: add thinking_budget (#6089) 2025-05-09 08:22:09 -07:00
Zhu Chen
fa7d7fd9e5 [Feature] Add FlashAttention3 as a backend for VisionAttention (#5764)
Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-05-08 10:01:19 -07:00
Baizhou Zhang
8f508cc77f Update doc for MLA attention backends (#6034) 2025-05-07 18:51:05 -07:00
Baizhou Zhang
fee37d9e8d [Doc]Fix description for dp_size argument (#6063) 2025-05-08 00:04:22 +08:00
mlmz
a68ed76682 feat: append more comprehensive fields in messages instead of merely role and content (#5996) 2025-05-05 11:43:34 -07:00
Wenxuan Tan
22da3d978f Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (#5555) 2025-05-05 10:32:17 -07:00
vzed
95c231e50d Tool Call: Add chat_template_kwargs documentation (#5679) 2025-05-04 13:12:40 -07:00
Chayenne
73dcf2b326 Remove token in token out in Native API (#5967) 2025-05-01 21:59:43 -07:00
Chang Su
2b06484bd1 feat: support pythonic tool call and index in tool call streaming (#5725) 2025-04-29 17:30:44 -07:00
simveit
ae523675e5 [Doc] Tables instead of bulletpoints for sampling doc (#5841) 2025-04-29 13:49:39 -07:00
Qiaolin Yu
8c0cfca87d Feat: support cuda graph for LoRA (#4115)
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
2025-04-28 23:30:44 -07:00
Lianmin Zheng
849c83a0c0 [CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00
Baizhou Zhang
f48b007c1d [Doc] Recover history of server_arguments.md (#5851) 2025-04-28 10:48:21 -07:00
Michael Yao
966eb90865 [Docs] Replace lists with tables for cleanup and readability in server_arguments (#5276) 2025-04-28 00:36:10 -07:00
Trevor Morris
84810da4ae Add Cutlass MLA attention backend (#5390) 2025-04-27 20:58:53 -07:00
Lianmin Zheng
155890e4d1 [Minor] fix documentations (#5756) 2025-04-26 17:48:43 -07:00
Michael Yao
b5be56944b [Doc] Fix a link to Weilin Zhao (#5706)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
2025-04-25 02:02:27 +08:00
Michael Yao
92bb64bc86 [Doc] Fix a 404 link to llama-405b (#5615)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
2025-04-21 20:39:37 -07:00
simveit
8de53da989 smaller and non gated models for docs (#5378) 2025-04-20 17:38:25 -07:00
Adarsh Shirawalmath
8b39274e34 [Feature] Prefill assistant response - add continue_final_message parameter (#4226)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-04-20 17:37:18 -07:00
Baizhou Zhang
072b4d0398 Add document for LoRA serving (#5521) 2025-04-20 14:37:57 -07:00
Yineng Zhang
a6f892e5d0 Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (#5544) 2025-04-18 16:50:21 -07:00
Wenxuan Tan
bfa3922451 Avoid computing lse in Ragged Prefill when there's no prefix. (#5476)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-04-18 01:13:57 -07:00
Michael Yao
a0fc5bc144 [docs] Fix several consistency issues in sampling_params.md (#5373)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-04-18 10:54:40 +08:00
mlmz
f13d65a7ea Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (#5503) 2025-04-17 11:37:43 -07:00
Baizhou Zhang
6fb29ffd9e Deprecate enable-flashinfer-mla and enable-flashmla (#5480) 2025-04-17 01:43:33 -07:00
Baizhou Zhang
4fb05583ef Deprecate disable-mla (#5481) 2025-04-17 01:43:14 -07:00
Didier Durand
92d1561b70 Update attention_backend.md: plural form (#5489) 2025-04-17 01:42:40 -07:00
Baizhou Zhang
a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) 2025-04-15 22:01:22 -07:00
mRSun15
3efc8e2d2a add attention backend supporting matrix in the doc (#5211)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-04-15 17:16:34 -07:00