Commit Graph

30 Commits

Author SHA1 Message Date
Lianmin Zheng
8b84e69f25 Fix tp token sync for dp attention (#3062) 2025-01-22 18:51:40 -08:00
Lianmin Zheng
022614d26e Add some flags to allow sync token ids across TP ranks (#3060) 2025-01-22 15:05:51 -08:00
Lianmin Zheng
3d8f1c9bcf Use int64 as indices for set_kv_buffer (#3039) 2025-01-21 19:46:09 -08:00
Hongpeng Guo
583697cd71 [Enhancement] Custom Logit Processor Improvement (#2998)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-20 02:00:35 -08:00
Hongpeng Guo
e403d23757 [Feature] Add sampler custom logits processor (#2396)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
2025-01-19 14:46:53 -08:00
Lianmin Zheng
21ec66e59e Minor follow-up fixes for the logprob refactor (#2670) 2024-12-30 05:42:08 -08:00
Lianmin Zheng
9c6ba2484f Refactor logprob computation to return the real logprob used in sampling (#2664) 2024-12-30 04:51:38 -08:00
Lianmin Zheng
7a1aecb938 Simplify pytorch sampling kernel and logit processor (#2491) 2024-12-16 14:11:09 -08:00
Qun Yang
37ee906f61 Add more support for intel Gaudi accelerators (#2357) 2024-12-06 01:16:33 -08:00
Lianmin Zheng
ba4ee37fa4 Update sampler.py to skip the success check (#2197) 2024-11-26 00:58:57 -08:00
Lianmin Zheng
ffd20fcd03 Make constrained decoding work for overlap scheduler (#2095) 2024-11-19 15:04:43 -08:00
Lianmin Zheng
df7fe4521a Crash the CI jobs on model import errors (#2072) 2024-11-17 22:18:11 -08:00
Lianmin Zheng
ebaa2f3199 Rename arguments --disable-nan-detection to --enable-nan-detection (#2066) 2024-11-17 16:53:44 -08:00
Lianmin Zheng
8f8f96a621 Fix the perf regression due to additional_stop_token_ids (#1773) 2024-10-23 16:45:21 -07:00
Lianmin Zheng
05b3bf5e8e Crash the server on warnings in CI (#1772) 2024-10-23 16:27:13 -07:00
Lianmin Zheng
ad4125d1a9 Fuse more ops & Simplify token mapping (#1758) 2024-10-22 23:20:43 -07:00
Lianmin Zheng
f0f8a7699b Simplify the nan detection and greedy check in sampler (#1709) 2024-10-18 20:21:24 -07:00
Lianmin Zheng
6a5b352aaf Use is_flashinfer_available to replace is_hip for flashinfer check (#1596)
Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>
2024-10-06 22:54:05 -07:00
Ying Sheng
c98e84c21e [Minor, Performance] Use torch.argmax for greedy sampling (#1589) 2024-10-06 13:15:05 -07:00
Lianmin Zheng
7f24ea95c3 Fuse top_k and top_k in the sampler (#1457) 2024-09-18 04:35:35 -07:00
HAI
3a6e04185b [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1420) 2024-09-17 07:43:52 +00:00
Lianmin Zheng
2fa5cec775 Simplify sampler and its error handling (#1441) 2024-09-16 21:23:31 -07:00
Liangsheng Yin
70b6802982 Optimize conflicts between CUDA graph and vocab mask tensors (#1392) 2024-09-13 20:27:53 -07:00
Lianmin Zheng
46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) 2024-09-10 17:11:16 -07:00
Liangsheng Yin
a5a134f39f Fix bugs in sampler with CUDA graph / torch.compile (#1306) 2024-09-02 23:18:48 +00:00
Liangsheng Yin
47f20da223 Fix regex mask (#1296) 2024-09-01 21:50:58 -07:00
Liangsheng Yin
381dd57bd6 Sampler cudagraph (#1253) 2024-08-28 18:58:52 -07:00
Yineng Zhang
f25f4dfde5 hotfix: revert sampler CUDA Graph (#1242) 2024-08-28 21:16:47 +10:00
Liangsheng Yin
75ce37f401 Move sampler into CUDA graph (#1201)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 07:02:50 -07:00
Liangsheng Yin
83e23c69b3 Improve code style of sampler (#1168) 2024-08-21 16:48:24 -07:00