huangtingwei
|
cb9e0e4180
|
[HiCacheStorage] fix abort request host memory leaks (#9874)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-09-01 18:59:29 -07:00 |
|
ybyang
|
5f77e1292d
|
Support Multi Process Tokenizer Manager(#6555) (#8964)
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
Co-authored-by: huanglong <huanglong@linux.alibaba.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-09-01 01:00:13 -07:00 |
|
Liangsheng Yin
|
6d3c20cf5b
|
fix set_interal_state API (#9850)
|
2025-09-01 01:31:35 +08:00 |
|
Zhiqiang Xie
|
8b6966d020
|
[HiCache] Storage Refactoring (#9797)
Co-authored-by: pansicheng <27603155+pansicheng@users.noreply.github.com>
|
2025-08-31 22:58:21 +08:00 |
|
Lianmin Zheng
|
25c7395934
|
Fix input logprob index (#9841)
Co-authored-by: Sheng Shen <sheng.s@berkeley.edu>
|
2025-08-31 02:56:47 -07:00 |
|
Teng Ma
|
f05c68733e
|
[HiCache] Clear kvcache in storage backend with fastAPI (#9750)
Co-authored-by: hzh0425 <hzh0425@apache.org>
|
2025-08-31 17:41:44 +08:00 |
|
VDV1985
|
ba861293cf
|
[feat]Ascend NPU Gemma-3-12b and Gemma-3-27b support (#8909)
|
2025-08-31 00:25:07 -07:00 |
|
Liangsheng Yin
|
836873b99f
|
Fix memory leak when aborting decode request in PD-Disagg (#9817)
Co-authored-by: Lianmin Zheng <15100009+merrymercy@users.noreply.github.com>
|
2025-08-30 14:36:03 +08:00 |
|
Zhiqiang Xie
|
54e872d343
|
[HiCache] resolve conflict between chunked-prefill and hicache hit count (#9776)
|
2025-08-30 01:30:54 +08:00 |
|
wangyu
|
a38c149758
|
feat(draft_model): support draft_model for RemoteModelLoader (#6407)
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
|
2025-08-28 16:09:52 -07:00 |
|
huangtingwei
|
55349e361d
|
support mooncake store dp attention (#9684)
|
2025-08-28 12:31:31 +08:00 |
|
hzh0425
|
c04c17edfa
|
refactor(hicache): Introduce generic HiCacheStorageConfig for improved configuration management (#9555)
Co-authored-by: Teng Ma <805522925@qq.com>
|
2025-08-26 17:55:20 -07:00 |
|
Zhiqiang Xie
|
43de1d7304
|
HiCache Storage fix host memory leak (#9648)
|
2025-08-26 10:49:40 -07:00 |
|
hzh0425
|
79ce3688bb
|
BugFix(hicache): Fix host indices out of bound error (#9637)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-26 10:42:23 -07:00 |
|
ykwd
|
80dc76e11a
|
[Fix] HiCache Bugfix & Mooncake Error Handling Enhance (#8901)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-25 19:05:10 -07:00 |
|
Jonas
|
a0a77d937b
|
Fix Harmony reasoning parser for and auto-separation for gpt-oss models (#9190)
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
Co-authored-by: minleminzui <2969413251@qq.com>
Co-authored-by: maocheng23 <maocheng@berkeley.edu>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2025-08-25 15:26:26 -07:00 |
|
Sundara Raman Ramachandran
|
ea0696b924
|
[Performance] Batch Send from Tokenizer Manager. (#9436)
|
2025-08-26 01:43:54 +08:00 |
|
SCDESPERTATE
|
b5c6529e17
|
[PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats (#7317)
|
2025-08-24 23:16:43 -07:00 |
|
hzh0425
|
83871aa12d
|
feat(hicache): Supports 3fs-hicache compatibility with dp-attention (#9372)
|
2025-08-23 02:08:32 -07:00 |
|
fzyzcjy
|
2600fc0d47
|
Overlapped weight offload (#8034)
|
2025-08-23 02:06:46 -07:00 |
|
fzyzcjy
|
0374304a2c
|
Add enable_flashinfer_mxfp4_bf16_moe for higher precision and slower moe backend (#9004)
|
2025-08-23 15:38:40 +08:00 |
|
Chanh Nguyen
|
127d4b0d5e
|
Support GC Freezing to improve latency & throughput (#9241)
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2025-08-23 13:43:09 +08:00 |
|
huangtingwei
|
6078d5fcc0
|
[HiCacheStorage] backup optimization for MLA model (#8865)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-22 18:03:51 +08:00 |
|
pansicheng
|
70cf4abccc
|
3fs zerocopy (#9109)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-22 17:56:38 +08:00 |
|
Yongfei Xu
|
9708d353b7
|
Support MHA with chunked prefix cache for flashinfer/flashmla backend, support page size > 1 for MHA chunked prefix (#8616)
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
|
2025-08-21 18:19:44 -07:00 |
|
Xinyuan Tong
|
6c855db82c
|
Revert "bugfix: Fix output_ids extraction in detokenizer_manager" (#9467)
|
2025-08-21 17:24:25 -07:00 |
|
Liangsheng Yin
|
9b5f0f64f5
|
Fix tiny misalign with previous truncation setting in tokenizer_manager (#9430)
|
2025-08-21 14:05:35 +08:00 |
|
Liangsheng Yin
|
eb19ccadae
|
[bug] fix errors related to context length in SD (#9388)
|
2025-08-21 10:32:34 +08:00 |
|
Lifu Huang
|
d4bce29721
|
Fix incorrect logic in chat template handling. (#9336)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2025-08-20 16:25:36 -07:00 |
|
Lifu Huang
|
b0980af89f
|
Support pinning adapter via server args. (#9249)
|
2025-08-20 16:25:01 -07:00 |
|
Liangsheng Yin
|
08ebdf79d0
|
Fix the --allow-auto-truncate argument in tokenizer manager. (#9391)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-20 16:56:47 +08:00 |
|
datdo-msft
|
98b44e9e56
|
[PD] Propagate internal server errors from aborted requests to clients instead of blindly returning 200's (#8936)
|
2025-08-18 14:23:46 -07:00 |
|
Binyao Jiang
|
66d6be0874
|
Bug fix: use correct mm_items in embed_mm_inputs (#8893)
|
2025-08-16 19:55:56 -07:00 |
|
Shangming Cai
|
384f8ab5ce
|
[PD] Support PD disaggregation with Prefill PP (#8846)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: root <huzhiyuan@xiaohongshu.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com>
Co-authored-by: zitto <zhjc1124@gmail.com>
|
2025-08-16 18:31:31 -07:00 |
|
Brayden Zhong
|
bc938ea13f
|
Fix DP load for embedding (#9165)
|
2025-08-15 23:58:44 -07:00 |
|
Trevor Morris
|
eff4eb3fdd
|
Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) (#7667)
|
2025-08-15 22:08:11 -07:00 |
|
Cheng Wan
|
295895120d
|
[6/N] MoE Refactor: Cleanup MoE-related configs (#8849)
|
2025-08-14 21:14:53 -07:00 |
|
Chengxing Xie
|
c1c7dc4534
|
feat: Add model version tracking with API endpoints and response metadata (#8795)
|
2025-08-14 12:13:46 -07:00 |
|
pansicheng
|
733446dd36
|
fix io group (#9154)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-14 12:46:42 +08:00 |
|
Cheng Wan
|
b87aacb5c5
|
[DP Attention] Refactor: adding some utility functions (#9136)
|
2025-08-13 21:08:06 -07:00 |
|
Sundara Raman Ramachandran
|
a027a9b4b3
|
[Generative Score API] Optimization to Remove Decode. (#8840)
|
2025-08-14 05:12:24 +08:00 |
|
Lianmin Zheng
|
9e426466af
|
Clean up allocators (#9134)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-13 13:56:04 -07:00 |
|
huangtingwei
|
0edda32001
|
Support page first layout zero copy for mooncake store (#8651)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-12 15:59:26 -07:00 |
|
ronnie_zheng
|
48afa8f14f
|
[feat] Enable Ascend profiling on SGLang (#8610)
Co-authored-by: liyou_b <2953090824@qq.com>
|
2025-08-12 13:28:31 -07:00 |
|
Lifu Huang
|
5ded39cab2
|
Fix race condition in async lora unload (#9084)
|
2025-08-11 22:59:29 -07:00 |
|
Zhiqiang Xie
|
9f78f391ae
|
HiCache Storage: generate hash when inserting new nodes (#9053)
|
2025-08-11 14:18:59 -07:00 |
|
Liangsheng Yin
|
f9afa7dceb
|
Fix docs for clip max new tokens (#9082)
|
2025-08-11 13:15:21 -07:00 |
|
Baizhou Zhang
|
75e6a7cde1
|
Support radix cache for Lora feature (#7216)
|
2025-08-11 10:14:11 -07:00 |
|
Chang Su
|
a6452b7188
|
bugfix: Fix output_ids extraction in detokenizer_manager (#9047)
|
2025-08-11 03:17:32 -07:00 |
|
Lianmin Zheng
|
4ea9d74a3e
|
Simplify health check (#9034)
|
2025-08-10 17:35:05 -07:00 |
|