tarinkk
|
eb6c2c1663
|
Hybrid kv cache for LLaMA4 (#6563)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
|
2025-06-27 18:58:55 -07:00 |
|
Yineng Zhang
|
69183f8808
|
chore: bump v0.4.8.post1 (#7559)
|
2025-06-26 02:21:12 -07:00 |
|
Shangming Cai
|
5c2142579a
|
[PD] Raise error for incompatible mooncake version and some minor fixes (#7527)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-06-25 18:55:24 -07:00 |
|
Yineng Zhang
|
7c3a12c000
|
chore: bump v0.4.8 (#7493)
|
2025-06-23 23:14:22 -07:00 |
|
Lianmin Zheng
|
30ceccc74a
|
Update hyperparameter_tuning.md (#7454)
|
2025-06-22 22:42:55 -07:00 |
|
Chang Su
|
72676cd6c0
|
feat(oai refactor): Replace openai_api with entrypoints/openai (#7351)
Co-authored-by: Jin Pan <jpan236@wisc.edu>
|
2025-06-21 13:21:06 -07:00 |
|
Jinn
|
ab74f8f09d
|
Remove batches api in docs & example (#7400)
|
2025-06-20 19:46:31 -07:00 |
|
woodx
|
97011abc8a
|
[Doc] add embedding rerank doc (#7364)
|
2025-06-19 21:53:54 -07:00 |
|
Yineng Zhang
|
fadf18fdd5
|
docs: update installation (#7366)
|
2025-06-19 12:00:19 -07:00 |
|
linzhuo
|
1de4db9bef
|
update invalid link in doc (#7297)
|
2025-06-18 01:37:36 -07:00 |
|
Yijie Zhu
|
a39d928782
|
support qwen2 running on ascend npu device (#7022)
Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com>
|
2025-06-17 11:24:10 -07:00 |
|
Yineng Zhang
|
f9dc9dd28b
|
chore: bump v0.4.7.post1 (#7248)
|
2025-06-16 15:20:29 -07:00 |
|
Lianmin Zheng
|
21615cc3fe
|
Minor style and doc fix (#7228)
|
2025-06-16 01:03:13 -07:00 |
|
Lifu Huang
|
98538822d5
|
Add Phi-4-mm to supported VLM supported model list. (#7178)
|
2025-06-13 23:17:40 -07:00 |
|
Povilas Kanapickas
|
bd7cfbd2f8
|
[Fix] Reduce busy polling when scheduler is idle (#6026)
|
2025-06-12 14:58:22 -07:00 |
|
Lianmin Zheng
|
dbdf76ca98
|
Clean up docs for server args and sampling parameters (generated by grok) (#7076)
|
2025-06-10 19:55:42 -07:00 |
|
Ximingwang-09
|
f2a75a66c4
|
update doc (#7046)
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
|
2025-06-11 10:02:01 +08:00 |
|
Lianmin Zheng
|
0f218731e3
|
Do not run frontend_reasoning.ipynb to reduce the CI load (#7073)
|
2025-06-10 17:15:31 -07:00 |
|
Yudi Xue
|
14c18d25df
|
Frontend language separate reasoning support (#6031)
|
2025-06-10 17:11:29 -07:00 |
|
Lianmin Zheng
|
90bd3e32d6
|
Improve perf tuning docs (#7071)
|
2025-06-10 16:55:04 -07:00 |
|
kyle-pena-kuzco
|
b56de8f943
|
Open AI API hidden states (#6716)
|
2025-06-10 14:37:29 -07:00 |
|
Lianmin Zheng
|
bb185b0e92
|
Update README.md (#7040)
|
2025-06-10 01:59:14 -07:00 |
|
Yineng Zhang
|
4f723edd3b
|
chore: bump v0.4.7 (#7038)
|
2025-06-10 01:56:20 -07:00 |
|
Yueyang Pan
|
98c00a2df1
|
Fix torch profiler bugs for bench_offline_throughput.py (#6557)
|
2025-06-09 20:33:41 +08:00 |
|
HAI
|
b819381fec
|
AITER backend extension and workload optimizations (#6838)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
|
2025-06-05 23:00:18 -07:00 |
|
shangmingc
|
dd1012fcbe
|
[PD] Fix potential perf spike caused by tracker gc and optimize doc (#6764)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-06-05 10:56:02 -07:00 |
|
zyksir
|
8e3797be1c
|
support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277)
|
2025-06-04 22:11:24 -07:00 |
|
Xinyuan Tong
|
cf9815ba69
|
[Refactor] Multimodal data processing for VLM (#6659)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-06-04 11:22:33 -07:00 |
|
Marc Sun
|
37f1547587
|
[FEAT] Add transformers backend support (#5929)
|
2025-06-03 21:05:29 -07:00 |
|
Lianmin Zheng
|
2d72fc47cf
|
Improve profiler and integrate profiler in bench_one_batch_server (#6787)
|
2025-05-31 15:53:55 -07:00 |
|
shangmingc
|
6cb00c6398
|
[PD] Optimize time out logic and add env var doc for mooncake (#6761)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-30 00:45:02 -07:00 |
|
Baizhou Zhang
|
791b3bfabb
|
[Feature] Support Flashinfer fp8 blockwise GEMM kernel on Blackwell (#6479)
|
2025-05-28 16:03:43 -07:00 |
|
Trevor Morris
|
e806f708c9
|
[PD] Make bootstrap code common between NIXL and Mooncake (#6473)
|
2025-05-27 12:47:38 -07:00 |
|
Vincent Zhong
|
45a31a82e4
|
docs: Update documentation to reflect xgrammar as default grammar backend (#6601)
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
|
2025-05-27 13:29:13 +08:00 |
|
Brayden Zhong
|
1aa0fbf416
|
Add note to add supported model to documentation (#6640)
|
2025-05-27 13:18:46 +08:00 |
|
linzhuo
|
7a0bbe6a64
|
update toc for doc and dockerfile code style format (#6450)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-05-27 13:05:11 +08:00 |
|
fzyzcjy
|
25be63d0b2
|
Auto handle PD disaggregation in bench_serving (#6587)
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-05-25 22:41:27 -07:00 |
|
simveit
|
e235be16fe
|
Fix some issues with current docs. (#6588)
|
2025-05-26 01:04:34 +08:00 |
|
Yineng Zhang
|
7e257cd666
|
chore: bump v0.4.6.post5 (#6566)
|
2025-05-24 00:48:05 -07:00 |
|
Chang Su
|
ed0c3035cd
|
feat(Tool Calling): Support required and specific function mode (#6550)
|
2025-05-23 21:00:37 -07:00 |
|
ryang
|
a6ae3af15e
|
Support XiaomiMiMo inference with mtp (#6059)
|
2025-05-22 14:14:49 -07:00 |
|
Byron Hsu
|
7513558074
|
[PD] Add doc and simplify sender.send (#6019)
|
2025-05-21 21:22:21 -07:00 |
|
Wenxuan Tan
|
66324895c6
|
[docs] Fix torch version (#6472)
|
2025-05-20 10:53:14 -07:00 |
|
fzyzcjy
|
f0653886a5
|
Expert distribution recording without overhead for EPLB (#4957)
|
2025-05-19 20:07:43 -07:00 |
|
simveit
|
506e5de8fe
|
Improve supported models doc (#6430)
|
2025-05-20 01:43:35 +08:00 |
|
applesaucethebun
|
6dc6b30637
|
Add missing model to doc (#6396)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-18 12:57:58 -07:00 |
|
Vincent Zhong
|
e9ef39d2e9
|
docs: Update the MD files (#6373)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-17 09:23:16 -07:00 |
|
Kiv Chen
|
64825b8395
|
model(vlm): mistral 3.1 (#5099)
Co-authored-by: KivenChen <sleigh-queue-0y@icloud.com>
|
2025-05-16 18:36:18 -07:00 |
|
Yury Sulsky
|
f19a9204cd
|
Support precomputed multimodal features for Qwen-VL and Gemma3 models. (#6136)
Co-authored-by: Yury Sulsky <ysulsky@tesla.com>
|
2025-05-16 12:26:15 -07:00 |
|
quinnrong94
|
2e4babdb0a
|
[Feat] Support FlashMLA backend with MTP and FP8 KV cache (#6109)
Co-authored-by: Yingyi <yingyihuang2000@outlook.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: kexueyu <kexueyu@tencent.com>
Co-authored-by: vincentmeng <vincentmeng@tencent.com>
Co-authored-by: pengmeng <pengmeng@tencent.com>
|
2025-05-15 00:48:09 -07:00 |
|