Commit Graph

1191 Commits

Author SHA1 Message Date
Lianmin Zheng
21ec66e59e Minor follow-up fixes for the logprob refactor (#2670) 2024-12-30 05:42:08 -08:00
HAI
c5210dfa38 AMD DeepSeek_V3 FP8 Numerical fix (#2667) 2024-12-30 21:31:12 +08:00
mobicham
a29dd9501d Add GemLite caching after each capture (#2669) 2024-12-30 05:27:29 -08:00
Lianmin Zheng
9c6ba2484f Refactor logprob computation to return the real logprob used in sampling (#2664) 2024-12-30 04:51:38 -08:00
Lianmin Zheng
8c3b420eec [Docs] clean up structured outputs docs (#2654) 2024-12-29 23:57:16 -08:00
HAI
e6f523b5f2 fix typo in python/sglang/srt/layers/quantization/fp8.py (#2655) 2024-12-29 23:45:02 -08:00
Lianmin Zheng
03d5fbfd44 Release 0.4.1.post3 - upload the config.json to PyPI (#2647) 2024-12-29 14:25:53 -08:00
Shi Shuai
fad29f7f52 CI: Fix unittest for engine input token ids and output token ids (#2646) 2024-12-29 13:28:59 -08:00
Shi Shuai
35bdb48557 [Feature] Get Token IDs with Engine.generate() (#2636)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2024-12-29 12:28:27 -08:00
Yineng Zhang
3ccf566b0d chore: bump v0.4.1.post2 (#2643) 2024-12-30 00:11:46 +08:00
HandH1998
afa0341e57 Update Triton configs for block fp8 kernels (#2641) 2024-12-29 22:53:47 +08:00
HAI
30828e7192 AMD: set weights and scaling numbers properly for block FP8 (#2637) 2024-12-29 03:23:39 -08:00
Ying Sheng
e0e09fceeb [Session] Update session control interface (#2635) 2024-12-29 02:10:27 -08:00
Lianmin Zheng
9c05c6898e Add llama_eagle.py (#2640)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2024-12-29 01:45:35 -08:00
Lianmin Zheng
3815b23ccb Clean up wrapper in flashinfer backend (#2638) 2024-12-29 00:45:57 -08:00
Tanjiro
8ee9a8501a [Feature] Function Calling (#2544)
Co-authored-by: Haoyu Wang <120358163+HaoyuWang4188@users.noreply.github.com>
2024-12-28 21:58:52 -08:00
fzyzcjy
fd28640dc5 Add update_weights_from_tensor (#2631) 2024-12-28 13:30:27 -08:00
Yineng Zhang
7863e4368a add configs for block fp8 related kernels (#2628)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-28 23:12:04 +08:00
Lianmin Zheng
855d0ba381 [CI] Fix nightly test and raise better error message (#2626)
Co-authored-by: Sangbin <rkooo567@gmail.com>
2024-12-27 22:16:39 -08:00
Xiaoyu Zhang
9254a33ad4 avoid fused_moe_triton padding circular import (#2624) 2024-12-28 14:01:35 +08:00
Lianmin Zheng
751e5ca273 [minor] clean up docs and eos id (#2622) 2024-12-27 11:23:46 -08:00
Yang Zheng
7a7ac6bea1 [FIX] Update EOS from config (#2475) 2024-12-27 10:59:56 -08:00
Yineng Zhang
ef5b0ff90b chore: bump v0.4.1.post1 (#2616) 2024-12-28 00:11:06 +08:00
HandH1998
6e5305158c update sgl_moe_align_block_size usage (#2617) 2024-12-28 00:01:13 +08:00
kk
70dc2fbe2d Change extend attention kernel launch parameter for ROCm platform to … (#2610)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2024-12-27 00:32:17 -08:00
kk
7ca751ff7d Fused moe triton cfg opt for rocm (#2612)
Co-authored-by: wunhuang <wunhuang@amd.com>
2024-12-26 23:38:22 -08:00
HAI
7722c11c1d Regression fix to AMD/ROCm from recent change (#2606) 2024-12-26 20:22:14 -08:00
fzyzcjy
b2ed5c8ea7 Tiny code cleanup in tokenizer_manager.py (#2586) 2024-12-26 17:53:09 -08:00
fzyzcjy
44f011d224 Super tiny typo fix (#2564) 2024-12-26 08:28:01 -08:00
yudian0504
531d6ea968 fix: package data missing (#2521) 2024-12-26 08:16:48 -08:00
Lianmin Zheng
dc3bee4815 Fix test and benchmark scripts (#2598) 2024-12-26 07:56:26 -08:00
fzyzcjy
3169e66c23 Fix duplicated handling of GetWeightsByNameReqInput (#2565) 2024-12-26 06:49:32 -08:00
Lianmin Zheng
773951548d Fix logprob_start_len for multi modal models (#2597)
Co-authored-by: libra <lihu723@gmail.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>
2024-12-26 06:27:45 -08:00
Adarsh Shirawalmath
acb340728c [Feature] Support new parameter - EBNF in xgrammar (#2526) 2024-12-26 05:12:41 -08:00
Sangchun Ha (Patrick)
08effbff35 Error occurs when loading the gemma model in bitsandbytes format. (#2557) 2024-12-26 05:10:37 -08:00
Liangsheng Yin
e7ebecf82e Fix cache hit rate when chunked prefill (#2555) 2024-12-26 03:14:28 -08:00
Xiaoyu Zhang
9a23c48456 h100 tuning fused_moe_triton for qwen2 moe (#2560) 2024-12-26 03:13:31 -08:00
Yineng Zhang
635a042623 docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
Yineng Zhang
efc52f85e2 chore: bump v0.4.1 (#2582) 2024-12-26 07:14:51 +08:00
Yineng Zhang
60e2fdcf4f use sgl-kernel moe_align_block_size (#2581)
Co-authored-by: ispobock <ispobaoke@163.com>
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-26 06:29:08 +08:00
HandH1998
53aed988cb Refactor MoE (#2575)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-12-26 00:02:14 +08:00
Ying Sheng
8a56b43175 [Bench] Flush cache before benchmarking (#2566) 2024-12-24 11:21:21 +08:00
Ke Bao
e835a50021 Reorg moe code (#2563) 2024-12-24 01:10:22 +08:00
Lianmin Zheng
23e5e50fd5 Fix gemlite import (#2553) 2024-12-22 20:21:17 -08:00
Lianmin Zheng
41b1db69b8 A better aio rwlock that guarantees the order (#2547) 2024-12-22 15:44:32 -08:00
Lianmin Zheng
8496701934 [Misc] Fix metrics, weight update lock, request logging (#2543) 2024-12-22 06:27:22 -08:00
Lei
19ba2b0ea9 Add lora_paths to v1_chat_generate_request (#2529) 2024-12-22 02:23:33 -08:00
Yineng Zhang
8f4d04e540 chore: bump v0.4.0.post2 (#2525) 2024-12-21 21:16:34 +08:00
Jerry Zhang
feb2b768ba Add integration with gemlite weight only quant (#2528) 2024-12-21 00:25:25 +08:00
Yineng Zhang
4b83db24f1 fix: continue to use flashinfer 0.1.6 temporarily (#2517) 2024-12-19 14:03:24 +08:00