Lianmin Zheng
|
8b84e69f25
|
Fix tp token sync for dp attention (#3062)
|
2025-01-22 18:51:40 -08:00 |
|
Lianmin Zheng
|
022614d26e
|
Add some flags to allow sync token ids across TP ranks (#3060)
|
2025-01-22 15:05:51 -08:00 |
|
Lianmin Zheng
|
3d8f1c9bcf
|
Use int64 as indices for set_kv_buffer (#3039)
|
2025-01-21 19:46:09 -08:00 |
|
Hongpeng Guo
|
583697cd71
|
[Enhancement] Custom Logit Processor Improvement (#2998)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
|
2025-01-20 02:00:35 -08:00 |
|
Hongpeng Guo
|
e403d23757
|
[Feature] Add sampler custom logits processor (#2396)
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
|
2025-01-19 14:46:53 -08:00 |
|
Lianmin Zheng
|
21ec66e59e
|
Minor follow-up fixes for the logprob refactor (#2670)
|
2024-12-30 05:42:08 -08:00 |
|
Lianmin Zheng
|
9c6ba2484f
|
Refactor logprob computation to return the real logprob used in sampling (#2664)
|
2024-12-30 04:51:38 -08:00 |
|
Lianmin Zheng
|
7a1aecb938
|
Simplify pytorch sampling kernel and logit processor (#2491)
|
2024-12-16 14:11:09 -08:00 |
|
Qun Yang
|
37ee906f61
|
Add more support for intel Gaudi accelerators (#2357)
|
2024-12-06 01:16:33 -08:00 |
|
Lianmin Zheng
|
ba4ee37fa4
|
Update sampler.py to skip the success check (#2197)
|
2024-11-26 00:58:57 -08:00 |
|
Lianmin Zheng
|
ffd20fcd03
|
Make constrained decoding work for overlap scheduler (#2095)
|
2024-11-19 15:04:43 -08:00 |
|
Lianmin Zheng
|
df7fe4521a
|
Crash the CI jobs on model import errors (#2072)
|
2024-11-17 22:18:11 -08:00 |
|
Lianmin Zheng
|
ebaa2f3199
|
Rename arguments --disable-nan-detection to --enable-nan-detection (#2066)
|
2024-11-17 16:53:44 -08:00 |
|
Lianmin Zheng
|
8f8f96a621
|
Fix the perf regression due to additional_stop_token_ids (#1773)
|
2024-10-23 16:45:21 -07:00 |
|
Lianmin Zheng
|
05b3bf5e8e
|
Crash the server on warnings in CI (#1772)
|
2024-10-23 16:27:13 -07:00 |
|
Lianmin Zheng
|
ad4125d1a9
|
Fuse more ops & Simplify token mapping (#1758)
|
2024-10-22 23:20:43 -07:00 |
|
Lianmin Zheng
|
f0f8a7699b
|
Simplify the nan detection and greedy check in sampler (#1709)
|
2024-10-18 20:21:24 -07:00 |
|
Lianmin Zheng
|
6a5b352aaf
|
Use is_flashinfer_available to replace is_hip for flashinfer check (#1596)
Co-authored-by: Zhang Liangang <liangang.zhang@intel.com>
|
2024-10-06 22:54:05 -07:00 |
|
Ying Sheng
|
c98e84c21e
|
[Minor, Performance] Use torch.argmax for greedy sampling (#1589)
|
2024-10-06 13:15:05 -07:00 |
|
Lianmin Zheng
|
7f24ea95c3
|
Fuse top_k and top_k in the sampler (#1457)
|
2024-09-18 04:35:35 -07:00 |
|
HAI
|
3a6e04185b
|
[Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1420)
|
2024-09-17 07:43:52 +00:00 |
|
Lianmin Zheng
|
2fa5cec775
|
Simplify sampler and its error handling (#1441)
|
2024-09-16 21:23:31 -07:00 |
|
Liangsheng Yin
|
70b6802982
|
Optimize conflicts between CUDA graph and vocab mask tensors (#1392)
|
2024-09-13 20:27:53 -07:00 |
|
Lianmin Zheng
|
46094e0c1b
|
Deprecate --disable-flashinfer and introduce --attention-backend (#1380)
|
2024-09-10 17:11:16 -07:00 |
|
Liangsheng Yin
|
a5a134f39f
|
Fix bugs in sampler with CUDA graph / torch.compile (#1306)
|
2024-09-02 23:18:48 +00:00 |
|
Liangsheng Yin
|
47f20da223
|
Fix regex mask (#1296)
|
2024-09-01 21:50:58 -07:00 |
|
Liangsheng Yin
|
381dd57bd6
|
Sampler cudagraph (#1253)
|
2024-08-28 18:58:52 -07:00 |
|
Yineng Zhang
|
f25f4dfde5
|
hotfix: revert sampler CUDA Graph (#1242)
|
2024-08-28 21:16:47 +10:00 |
|
Liangsheng Yin
|
75ce37f401
|
Move sampler into CUDA graph (#1201)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-26 07:02:50 -07:00 |
|
Liangsheng Yin
|
83e23c69b3
|
Improve code style of sampler (#1168)
|
2024-08-21 16:48:24 -07:00 |
|