Lianmin Zheng
|
bdf946bf81
|
Support loading pre-sharded moe weights (#2716)
|
2025-01-02 15:07:37 -08:00 |
|
yukavio
|
8c8779cd05
|
[Fix] fix retract error in eagle speculative decoding (#2711)
Co-authored-by: kavioyu <kavioyu@tencent.com>
|
2025-01-02 10:28:39 -08:00 |
|
Mick
|
1775b963db
|
[Fix] fix incorrectly overwriting the port specified in ServerArgs (#2714)
|
2025-01-02 10:28:22 -08:00 |
|
Yineng Zhang
|
ba5112ff69
|
feat: support moe_align_block_size_triton (#2712)
Co-authored-by: WANDY666 <1060304770@qq.com>
|
2025-01-02 21:47:44 +08:00 |
|
yukavio
|
815dce0554
|
Eagle speculative decoding part 4: Add EAGLE2 worker (#2150)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2025-01-02 03:22:34 -08:00 |
|
Lianmin Zheng
|
ad20b7957e
|
Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
|
2025-01-02 02:09:08 -08:00 |
|
fzyzcjy
|
9183c23eca
|
Speed up update_weights_from_tensor (#2695)
|
2025-01-02 02:05:19 -08:00 |
|
kk
|
148254d4db
|
Improve moe reduce sum kernel performance (#2705)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-01-02 01:11:06 -08:00 |
|
kk
|
b6e0cfb5e1
|
ROCm base image update (#2692)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2025-01-01 12:12:19 +08:00 |
|
Xiaoyu Zhang
|
286cad3ee3
|
h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B (#2689)
|
2024-12-31 23:17:36 +08:00 |
|
Ying Sheng
|
dc7eb01f19
|
[Fix] fix openai adapter (#2685)
|
2024-12-31 10:48:19 +00:00 |
|
Lianmin Zheng
|
b0524c3789
|
Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging (#2684)
Co-authored-by: yukavio <kavioyu@gmail.com>
|
2024-12-31 02:25:05 -08:00 |
|
Yineng Zhang
|
d49b13c6f8
|
feat: use CUDA 12.4 by default (for FA3) (#2682)
|
2024-12-31 15:52:09 +08:00 |
|
Lianmin Zheng
|
f44d143949
|
Support target model verification in the attention backend (#2678)
Co-authored-by: yukavio <kavioyu@gmail.com>
|
2024-12-30 22:58:55 -08:00 |
|
Lianmin Zheng
|
339c69a243
|
Improve the computation for time_per_output_token Prometheus metrics (#2674)
|
2024-12-30 21:40:14 -08:00 |
|
Lianmin Zheng
|
21ec66e59e
|
Minor follow-up fixes for the logprob refactor (#2670)
|
2024-12-30 05:42:08 -08:00 |
|
HAI
|
c5210dfa38
|
AMD DeepSeek_V3 FP8 Numerical fix (#2667)
|
2024-12-30 21:31:12 +08:00 |
|
mobicham
|
a29dd9501d
|
Add GemLite caching after each capture (#2669)
|
2024-12-30 05:27:29 -08:00 |
|
Lianmin Zheng
|
9c6ba2484f
|
Refactor logprob computation to return the real logprob used in sampling (#2664)
|
2024-12-30 04:51:38 -08:00 |
|
Lianmin Zheng
|
8c3b420eec
|
[Docs] clean up structured outputs docs (#2654)
|
2024-12-29 23:57:16 -08:00 |
|
HAI
|
e6f523b5f2
|
fix typo in python/sglang/srt/layers/quantization/fp8.py (#2655)
|
2024-12-29 23:45:02 -08:00 |
|
Lianmin Zheng
|
03d5fbfd44
|
Release 0.4.1.post3 - upload the config.json to PyPI (#2647)
|
2024-12-29 14:25:53 -08:00 |
|
Shi Shuai
|
fad29f7f52
|
CI: Fix unittest for engine input token ids and output token ids (#2646)
|
2024-12-29 13:28:59 -08:00 |
|
Shi Shuai
|
35bdb48557
|
[Feature] Get Token IDs with Engine.generate() (#2636)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2024-12-29 12:28:27 -08:00 |
|
Yineng Zhang
|
3ccf566b0d
|
chore: bump v0.4.1.post2 (#2643)
|
2024-12-30 00:11:46 +08:00 |
|
HandH1998
|
afa0341e57
|
Update Triton configs for block fp8 kernels (#2641)
|
2024-12-29 22:53:47 +08:00 |
|
HAI
|
30828e7192
|
AMD: set weights and scaling numbers properly for block FP8 (#2637)
|
2024-12-29 03:23:39 -08:00 |
|
Ying Sheng
|
e0e09fceeb
|
[Session] Update session control interface (#2635)
|
2024-12-29 02:10:27 -08:00 |
|
Lianmin Zheng
|
9c05c6898e
|
Add llama_eagle.py (#2640)
Co-authored-by: kavioyu <kavioyu@tencent.com>
|
2024-12-29 01:45:35 -08:00 |
|
Lianmin Zheng
|
3815b23ccb
|
Clean up wrapper in flashinfer backend (#2638)
|
2024-12-29 00:45:57 -08:00 |
|
Tanjiro
|
8ee9a8501a
|
[Feature] Function Calling (#2544)
Co-authored-by: Haoyu Wang <120358163+HaoyuWang4188@users.noreply.github.com>
|
2024-12-28 21:58:52 -08:00 |
|
fzyzcjy
|
fd28640dc5
|
Add update_weights_from_tensor (#2631)
|
2024-12-28 13:30:27 -08:00 |
|
Yineng Zhang
|
7863e4368a
|
add configs for block fp8 related kernels (#2628)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2024-12-28 23:12:04 +08:00 |
|
Lianmin Zheng
|
855d0ba381
|
[CI] Fix nightly test and raise better error message (#2626)
Co-authored-by: Sangbin <rkooo567@gmail.com>
|
2024-12-27 22:16:39 -08:00 |
|
Xiaoyu Zhang
|
9254a33ad4
|
avoid fused_moe_triton padding circular import (#2624)
|
2024-12-28 14:01:35 +08:00 |
|
Lianmin Zheng
|
751e5ca273
|
[minor] clean up docs and eos id (#2622)
|
2024-12-27 11:23:46 -08:00 |
|
Yang Zheng
|
7a7ac6bea1
|
[FIX] Update EOS from config (#2475)
|
2024-12-27 10:59:56 -08:00 |
|
Yineng Zhang
|
ef5b0ff90b
|
chore: bump v0.4.1.post1 (#2616)
|
2024-12-28 00:11:06 +08:00 |
|
HandH1998
|
6e5305158c
|
update sgl_moe_align_block_size usage (#2617)
|
2024-12-28 00:01:13 +08:00 |
|
kk
|
70dc2fbe2d
|
Change extend attention kernel launch parameter for ROCm platform to … (#2610)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
|
2024-12-27 00:32:17 -08:00 |
|
kk
|
7ca751ff7d
|
Fused moe triton cfg opt for rocm (#2612)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2024-12-26 23:38:22 -08:00 |
|
HAI
|
7722c11c1d
|
Regression fix to AMD/ROCm from recent change (#2606)
|
2024-12-26 20:22:14 -08:00 |
|
fzyzcjy
|
b2ed5c8ea7
|
Tiny code cleanup in tokenizer_manager.py (#2586)
|
2024-12-26 17:53:09 -08:00 |
|
fzyzcjy
|
44f011d224
|
Super tiny typo fix (#2564)
|
2024-12-26 08:28:01 -08:00 |
|
yudian0504
|
531d6ea968
|
fix: package data missing (#2521)
|
2024-12-26 08:16:48 -08:00 |
|
Lianmin Zheng
|
dc3bee4815
|
Fix test and benchmark scripts (#2598)
|
2024-12-26 07:56:26 -08:00 |
|
fzyzcjy
|
3169e66c23
|
Fix duplicated handling of GetWeightsByNameReqInput (#2565)
|
2024-12-26 06:49:32 -08:00 |
|
Lianmin Zheng
|
773951548d
|
Fix logprob_start_len for multi modal models (#2597)
Co-authored-by: libra <lihu723@gmail.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>
|
2024-12-26 06:27:45 -08:00 |
|
Adarsh Shirawalmath
|
acb340728c
|
[Feature] Support new parameter - EBNF in xgrammar (#2526)
|
2024-12-26 05:12:41 -08:00 |
|
Sangchun Ha (Patrick)
|
08effbff35
|
Error occurs when loading the gemma model in bitsandbytes format. (#2557)
|
2024-12-26 05:10:37 -08:00 |
|