Commit Graph

211 Commits

Author SHA1 Message Date
Tony Lu
1e18a341e9 [Bugfix] fix pd chat completion protocol for batching support (#10016)
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
2025-09-04 01:43:16 -07:00
Liangsheng Yin
5dfcd6c207 add proctitle for tokenizers (#9952)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-03 13:31:38 +08:00
Lianmin Zheng
60e37f8028 Move parsers under a single folder (#9912) 2025-09-02 18:25:04 -07:00
JieXin Liang
1db649ac02 [feat] apply deep_gemm compile_mode to skip launch (#9879) 2025-09-02 03:20:30 -07:00
Yineng Zhang
349b491c63 chore: upgrade flashinfer 0.3.0 (#9864) 2025-09-01 03:07:19 -07:00
ybyang
5f77e1292d Support Multi Process Tokenizer Manager(#6555) (#8964)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-09-01 01:00:13 -07:00
Teng Ma
f05c68733e [HiCache] Clear kvcache in storage backend with fastAPI (#9750)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2025-08-31 17:41:44 +08:00
Yineng Zhang
9970e3bf32 chore: upgrade sgl-kernel 0.3.7.post1 with deepgemm fix (#9822) 2025-08-30 04:02:25 -07:00
Yineng Zhang
3d8fc43400 chore: upgrade flashinfer 0.3.0rc1 (#9793) 2025-08-29 16:24:17 -07:00
gongwei-130
3fd1431df2 support enable in the reasoning field to enable thingking for thinkin… (#9715) 2025-08-29 10:57:32 -07:00
gongwei-130
9a7c8842ba accomendate json schema in the "schema" field, not in "json_schema" field of response_format (#9786) 2025-08-28 23:51:50 -07:00
Yineng Zhang
b962a296ed chore: upgrade sgl-kernel 0.3.7 (#9708) 2025-08-27 14:00:31 -07:00
Xinyuan Tong
68a54e063e Sets default model name in request classes (#9683)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-27 10:43:03 -07:00
cicirori
b6c14ec0b4 add response_format support for completion API (#9665) 2025-08-26 15:01:29 -07:00
Xiaotong Jiang
0936c766ed Fix kimi k2 function calling format (#9606) 2025-08-26 00:50:59 -07:00
GavinZhu-GMI
0ef583b7de fix: allow user to specify function as role (#9635) 2025-08-26 00:47:20 -07:00
Jonas
a0a77d937b Fix Harmony reasoning parser for and auto-separation for gpt-oss models (#9190)
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
Co-authored-by: minleminzui <2969413251@qq.com>
Co-authored-by: maocheng23 <maocheng@berkeley.edu>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-25 15:26:26 -07:00
Binyao Jiang
3affa9dcc3 Fix GLM45 tool call multi-turn bug (#9500) 2025-08-25 13:46:13 -07:00
Yineng Zhang
938e986e15 chore: upgrade flashinfer 0.2.14.post1 (#9578) 2025-08-25 00:12:17 -07:00
Yuhao Zhou
17d5eda887 bugfix for undefined logging functions in HarmonyBrowserTool & HarmonyPythonTool (#9229) 2025-08-25 00:10:35 -07:00
fzyzcjy
2600fc0d47 Overlapped weight offload (#8034) 2025-08-23 02:06:46 -07:00
Chanh Nguyen
127d4b0d5e Support GC Freezing to improve latency & throughput (#9241)
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2025-08-23 13:43:09 +08:00
Xinyuan Tong
6c855db82c Revert "bugfix: Fix output_ids extraction in detokenizer_manager" (#9467) 2025-08-21 17:24:25 -07:00
Xinyuan Tong
e8449ab515 Add deepseek v3.1 thinking parser support and update docs (#9464)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-21 15:09:40 -07:00
gongwei-130
10d34f74e2 fix: should return a invalid request response when schema missing (#9461) 2025-08-21 14:06:50 -07:00
gongwei-130
9ba7253094 accomendate reasoning_effort set in chat_template_kwargs (#9458) 2025-08-21 13:22:03 -07:00
hlu1
dae9a80f43 [fix] Fix mxfp4 weight loading bug with TP sharding in GPT-OSS (#9433)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-21 03:50:51 -07:00
fzyzcjy
42c8704560 Add PDL support for quant kernel and rope kernel (#9106) 2025-08-20 01:56:29 -07:00
Keyang Ru
f515449582 Fix gpt-oss response api streaming issue (#9368) 2025-08-19 20:19:42 -07:00
江家瑋
ca533580f2 [Docs] Correct and clarify notes in Engine docstring (#9313)
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
2025-08-18 13:24:19 -07:00
gongwei-130
0cf3fbeb18 should return invalide request for empty prompt (#9315) 2025-08-18 11:44:11 -07:00
Chengxing Xie
c1c7dc4534 feat: Add model version tracking with API endpoints and response metadata (#8795) 2025-08-14 12:13:46 -07:00
Hongbo Xu
2cc9eeab01 [4/n]decouple quantization implementation from vLLM dependency (#9191)
Co-authored-by: AniZpZ <aniz1905@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-08-14 12:05:46 -07:00
eigen
4dbf43601d fix: zero_init buffer (#9065)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-08-14 02:39:09 -07:00
Jiaqi Gu
c9ee738515 Fuse writing KV buffer into rope kernel (part 2: srt) (#9014)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-08-12 13:15:30 -07:00
Chang Su
f2a5de284b [Bugfix] Fix accuracy-test-1-gpu failure caused by builtin_tools (#9114) 2025-08-12 09:56:13 -07:00
Chang Su
a218490136 (gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support (#9043) 2025-08-11 18:59:18 -07:00
Chang Su
a6452b7188 bugfix: Fix output_ids extraction in detokenizer_manager (#9047) 2025-08-11 03:17:32 -07:00
zhyncs
f4ae50e97c fix: use flashinfer v0.2.11.post1 2025-08-11 02:49:25 -07:00
Yineng Zhang
84cb449eec Revert "chore: upgrade flashinfer 0.2.11 (#9036)" (#9057) 2025-08-11 00:16:39 -07:00
Yineng Zhang
dd001a5477 chore: upgrade flashinfer 0.2.11 (#9036) 2025-08-10 17:35:37 -07:00
Lianmin Zheng
4ea9d74a3e Simplify health check (#9034) 2025-08-10 17:35:05 -07:00
Stefan He
8ecf6b9d24 Support Flatten Tensor Update Weights to speed up MOE Update Weights by 20% (#8079) 2025-08-10 16:08:59 -07:00
Lianmin Zheng
9a44b643c6 Fix CI (#9012) 2025-08-09 13:33:42 -07:00
Yineng Zhang
326a901df4 chore: upgrade sgl-kernel 0.3.3 (#8998) 2025-08-09 01:22:01 -07:00
Lianmin Zheng
706bd69cc5 Clean up server_args.py to have a dedicated function for model specific adjustments (#8983) 2025-08-08 19:56:50 -07:00
ishandhanani
4e7f025219 chore(gb200): update to CUDA 12.9 and improve build process (#8772) 2025-08-08 13:42:47 -07:00
Zilin Zhu
dd650e0e21 [RL] fix skip_server_warmup and rl health_generate logic (#8757) 2025-08-08 04:34:38 -07:00
Lianmin Zheng
a947154286 Revert "Support Multi Process Tokenizer Manager" (#8960) 2025-08-08 02:28:27 -07:00
ybyang
7490e3f67d Support Multi Process Tokenizer Manager (#6555)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: lw9527 <952799980@qq.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
2025-08-08 01:45:50 -07:00