Commit Graph

267 Commits

Author SHA1 Message Date
Lianmin Zheng
f68dd998b9 Rename customer label -> custom label (#10899)
Co-authored-by: Yingchun Lai <laiyingchun@apache.org>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-25 16:19:53 -07:00
Xinyuan Tong
71f24ef8f6 feat: add cache_salt support to request (#10718)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-09-23 23:30:25 -07:00
Lianmin Zheng
38c00ed7a1 Fix multimodal registry and code sync scripts (#10759)
Co-authored-by: cctry <shiyang@x.ai>
2025-09-22 15:36:01 -07:00
Zhihao Zhang
e7bc600304 [Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-18 16:42:41 -07:00
harrisonlimh
14fdd52740 feat: add priority based scheduling with priority based request acceptance and preemption (#8746) 2025-09-16 17:10:10 -07:00
Yingchun Lai
fc2c3a3d8e metrics: support customer labels specified in request header (#10143) 2025-09-14 20:00:08 -07:00
Liangsheng Yin
305c9e8c2d [4/N]DP refactor: support watching mode get_load and shortest queue strategy (#10201) 2025-09-15 10:06:08 +08:00
Feng Su
4c21b09074 [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 (#9962)
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
2025-09-15 02:08:02 +08:00
艾力可
165abeebca Typo: in --enable-custom-logit-processor: agree with cli arg (#10076) 2025-09-14 02:27:09 -07:00
Sundara Raman Ramachandran
a360511d7b [Generative Score API] Scoring(Prefill-only) optimizations. (#9748) 2025-09-14 01:57:06 +08:00
Sundara Raman Ramachandran
94d0f656fb [Performance] Dynamic Batch Tokenizer (#9382) 2025-09-14 01:56:04 +08:00
Liangsheng Yin
78f139812a [1/N] DP-Refactor: move communicators into tokenizer_communicator_mixin (#10028) 2025-09-08 16:27:37 +08:00
Liangsheng Yin
e719bb0e84 [1/2] Refactor multi-tokenizer manager (#10074) 2025-09-07 19:13:34 +08:00
Jimmy
f40038fb09 [Vulnerability]feat(conn): set bootstrap server host (#9931) 2025-09-05 17:36:17 +08:00
Huang Long
f98366604b fix MultiTokenizerWrapper name (#10049)
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
2025-09-05 13:39:46 +08:00
Yingchun Lai
b32ab0705e metrics: support customer buckets for prompt/generation_tokens_histogram (#9634) 2025-09-04 22:22:08 +08:00
ybyang
5f77e1292d Support Multi Process Tokenizer Manager(#6555) (#8964)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-09-01 01:00:13 -07:00
Liangsheng Yin
6d3c20cf5b fix set_interal_state API (#9850) 2025-09-01 01:31:35 +08:00
Teng Ma
f05c68733e [HiCache] Clear kvcache in storage backend with fastAPI (#9750)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2025-08-31 17:41:44 +08:00
Sundara Raman Ramachandran
ea0696b924 [Performance] Batch Send from Tokenizer Manager. (#9436) 2025-08-26 01:43:54 +08:00
Chanh Nguyen
127d4b0d5e Support GC Freezing to improve latency & throughput (#9241)
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2025-08-23 13:43:09 +08:00
Liangsheng Yin
9b5f0f64f5 Fix tiny misalign with previous truncation setting in tokenizer_manager (#9430) 2025-08-21 14:05:35 +08:00
Liangsheng Yin
eb19ccadae [bug] fix errors related to context length in SD (#9388) 2025-08-21 10:32:34 +08:00
Lifu Huang
b0980af89f Support pinning adapter via server args. (#9249) 2025-08-20 16:25:01 -07:00
Liangsheng Yin
08ebdf79d0 Fix the --allow-auto-truncate argument in tokenizer manager. (#9391)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-20 16:56:47 +08:00
datdo-msft
98b44e9e56 [PD] Propagate internal server errors from aborted requests to clients instead of blindly returning 200's (#8936) 2025-08-18 14:23:46 -07:00
Chengxing Xie
c1c7dc4534 feat: Add model version tracking with API endpoints and response metadata (#8795) 2025-08-14 12:13:46 -07:00
Sundara Raman Ramachandran
a027a9b4b3 [Generative Score API] Optimization to Remove Decode. (#8840) 2025-08-14 05:12:24 +08:00
Lifu Huang
5ded39cab2 Fix race condition in async lora unload (#9084) 2025-08-11 22:59:29 -07:00
Lianmin Zheng
4ea9d74a3e Simplify health check (#9034) 2025-08-10 17:35:05 -07:00
Lianmin Zheng
a947154286 Revert "Support Multi Process Tokenizer Manager" (#8960) 2025-08-08 02:28:27 -07:00
ybyang
7490e3f67d Support Multi Process Tokenizer Manager (#6555)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: lw9527 <952799980@qq.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
2025-08-08 01:45:50 -07:00
Lifu Huang
6210e2c4f0 Support GPU pinning for LoRA (#8697) 2025-08-06 19:39:45 -07:00
Chang Su
92cc32d9fc Support v1/responses and use harmony in serving_chat (#8837)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-06 16:20:34 -07:00
Baizhou Zhang
f2d68ded6d Rename lora_path to lora_id in batches (#8437) 2025-08-03 21:08:28 -07:00
ybyang
6f9baf1002 [Improvements] Merge health check route (#8444)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Kan Wu <wukanustc@gmail.com>
2025-08-03 01:59:06 -07:00
Lifu Huang
8675bdf246 Support limiting max loaded loras in CPU. (#8650) 2025-08-03 00:02:23 -07:00
Wenchen Lo
ea93079b30 model: adapt mllama4 to VisionAttention (#8512)
Co-authored-by: root <mickjagger19@icloud.com>
2025-08-02 00:39:40 -07:00
Xinyuan Tong
7e831efee8 Fix chat template handling for OpenAI serving (#8635)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-07-31 21:49:45 -07:00
Lianmin Zheng
a4c3b121d8 Split the scheduler into multiple mixin classes to reduce the file size (#8483) 2025-07-29 12:46:50 -07:00
fzyzcjy
0ce84c822b Support colocating requests (#7973) 2025-07-28 22:51:49 -07:00
harrisonlimh
747dd45077 feat: throttle requests at scheduler based on --max_queued_requests (#7565) 2025-07-28 22:32:33 +08:00
Lifu Huang
df90645525 Support overlapped lora updates (#8213) 2025-07-27 13:00:44 -07:00
Mick
3212c2ad3f vlm: optimize tensor transport (#6003)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-07-26 17:41:01 +08:00
Lifu Huang
8abd3e77fe Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (#8261) 2025-07-23 00:32:16 -07:00
Lianmin Zheng
55381a46ac Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (#8181) 2025-07-19 22:41:30 -07:00
ybyang
4540a4666a [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (#8115)
Signed-off-by: ybyang <ybyang7@iflytek.com>
2025-07-19 18:10:00 -07:00
Lifu Huang
4e3defe5a7 Support start up LoRA server without initial adapters (#8019) 2025-07-19 15:38:09 -07:00
Yingchun Lai
610381b75e [health_generate] fix: fix the /health_generate always success bug (#8028) 2025-07-18 22:08:46 -07:00
ehuaa
0c55cbcfc5 [BugFix] add verify logit_bias to avoid crash because of IndexError (#7749) 2025-07-14 02:44:12 +08:00