Commit Graph

1463 Commits

Author SHA1 Message Date
lukec
21463e321a Expert Parallelism (EP) Support for DeepSeek V3/R1 (#3602)
Co-authored-by: laixin <xielx@shanghaitech.edu.cn>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: laixin <q865809639@gmail.com>
2025-02-26 02:29:37 -08:00
Kebe
60524920ba [Bug]: Fix maximum recursion depth triggered on exception exit (#3519) 2025-02-25 09:39:38 -08:00
IAN
107710268a [BugFix] Fix crash when receive a req with structed output in DP attention mode. (#3841) 2025-02-25 09:32:05 -08:00
who who who
4606e2a3fe Bug: fix capture_bs (#3857) 2025-02-25 08:40:35 -08:00
Nicolas Castet
127998cc41 Fix allgather ops inside cuda graphs (#3709) 2025-02-25 08:39:10 -08:00
Shenggui Li
c0bb9eb3b3 [improve] made timeout configurable (#3803) 2025-02-25 00:26:08 -08:00
Yueyang Pan
7036d6fc67 [Bug]: Add missing clamp to llavavid (#3787) 2025-02-24 19:10:15 -08:00
Chaitanya Sri Krishna Lolla
6ce9dbe828 [ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 (#3237)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-24 18:14:31 -08:00
Wang Ran (汪然)
60b771c815 Improve: fix typos (#3801)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-24 16:51:23 -08:00
Lianmin Zheng
d7934cde45 Fix CI and install docs (#3821) 2025-02-24 16:17:38 -08:00
Lianmin Zheng
62bbd34393 Revert "Extract generation_manager from tokenizer_manager" (#3829) 2025-02-24 14:49:16 -08:00
Lianmin Zheng
f2388f6b95 Revert "Rename TokenizerManager to StdOrchestrator" (#3828) 2025-02-24 14:47:59 -08:00
Lianmin Zheng
c9745ee082 Fix pandas dependency in CI (#3818) 2025-02-24 05:56:57 -08:00
laixin
1a6e97577a Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3730)
Co-authored-by: HandH1998 <1335248067@qq.com>
2025-02-24 05:43:35 -08:00
Baizhou Zhang
b110084654 Refactor flashinfer logic for deepseek v3 and fix accuracy bug (#3785) 2025-02-24 04:07:25 -08:00
Lianmin Zheng
27a46317b6 Fix dependency (#3813) 2025-02-24 03:50:58 -08:00
Zhiqiang Xie
6c7a152c5a Hierarchical Caching for SGLang (#2693)
Co-authored-by: Wenxuan Tan <wenxuan.tan@wisc.edu>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-02-23 21:56:30 -08:00
fzyzcjy
45360b2fa9 Improve: Rename TokenizerManager to StdOrchestrator (#3116) 2025-02-23 00:30:58 -08:00
fzyzcjy
3f41b18455 Improve: Extract generation_manager from tokenizer_manager (#3115) 2025-02-22 23:25:45 -08:00
Mick
45205d88a0 bench: Add MMMU benchmark for vLM (#3562) 2025-02-22 08:10:59 -08:00
fzyzcjy
9087694006 Improve: Use TypeBasedDispatcher in DetokenizerManager (#3117) 2025-02-21 19:50:46 -08:00
fzyzcjy
a3339d8cac Bug: Fix weight loader error when LM head weights are tied (#3766) 2025-02-21 17:53:12 -08:00
Chayenne
14d90617b0 Bug: fix lm head weights in Qwen models (#3777) 2025-02-21 16:49:31 -08:00
fzyzcjy
d37f95511d Improve: Tiny fix Olmo2 (#3348) 2025-02-21 16:09:35 -08:00
Zhiyu
c66b2c9cf1 Add support for nvidia modelopt fp8 kv cache (#3223) 2025-02-22 07:04:58 +08:00
simveit
20b765a26e Model: Support Qwen 72B RM model. (#3772) 2025-02-21 14:38:21 -08:00
Shenggui Li
9af0e21ef5 [bug] fixed batch api for DeepSeek V3/R1 (#3754) 2025-02-21 10:28:16 -08:00
Shi Shuai
c7c79b16cd [Fix] OpenAI API adapter tokenizer encoding (#3432) 2025-02-21 09:24:15 -08:00
Andrew Smith
1df6eabd5d feat: Add SageMaker support (#3740) 2025-02-21 19:31:09 +08:00
zixuanzhang226
0c227ee373 feat: update grouped_topk to support softmax and sigmoid (#3680) 2025-02-21 16:30:15 +08:00
HAI
5c54ef0352 AMD/ROCm: update AITER repo to ROCm/aiter (#3747) 2025-02-21 00:18:08 -08:00
aoshen524
e79f7420be [Fix] Fix bugs and refactor codes in lora for better scalability. (#3652)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-02-20 11:51:57 -08:00
chenxiaobing
d5d80ab477 [Bugfix] Fix scores mask for moe topk (#3705) 2025-02-21 02:17:23 +08:00
Ke Bao
ddcf9fe3be Optimize triton attention custom mask (#3731) 2025-02-21 00:54:41 +08:00
HAI
6252ade985 revert BLOCK and num_warps on HIP (#3722) 2025-02-20 23:30:18 +08:00
yizhang2077
1eb8eade2b add control for cutlass fp8 blockwise gemm (#3727) 2025-02-20 16:10:35 +08:00
Shi Shuai
55de40f782 [Docs]: Fix Multi-User Port Allocation Conflicts (#3601)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: simveit <simp.veitner@gmail.com>
2025-02-19 11:15:44 -08:00
Cheng Wan
6b0aeb58fd [moe] optim: reduce memory consumption in fused_moe (#3692) 2025-02-20 02:25:05 +08:00
Mick
99c1b9d2ee fix: apply cache size limit of attention mask for VisionAttention (#3657) 2025-02-19 20:16:48 +08:00
who who who
634a3561ac AMD Prefill optimize (#3665)
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-18 09:35:58 -08:00
Mick
424848d26f fix: remove dependency on latest transformers impl (#3635) 2025-02-19 01:14:11 +08:00
Ke Bao
e5ce395a6c Fix draft decode max batch size (#3676) 2025-02-18 23:03:26 +08:00
Yineng Zhang
f983213a1f update pr-test (#3663) 2025-02-18 17:23:43 +08:00
yigex
ddf39d3fce [ROCm] Optimal MOE Tuning for AMD Radeon Graphics (#3567) 2025-02-17 17:54:10 -08:00
Wen-Heng (Jack) Chung
2eab113206 [ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. (#3616)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-17 15:54:18 -08:00
Yineng Zhang
058d199d4e use transformers 4.48.3 (#3650) 2025-02-18 04:40:47 +08:00
Yineng Zhang
a5375adc3a chore: bump v0.4.3.post2 (#3645)
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-18 02:48:30 +08:00
Yineng Zhang
75d171a9c5 chore: update flashinfer v0.2.1.post2 (#3644) 2025-02-18 02:47:42 +08:00
Yineng Zhang
714f3e6362 feat: support flashinfer mla with prefix cache (#3643) 2025-02-18 02:06:43 +08:00
Xiaoyu Zhang
c38f3aed24 support multi-gpu block-gemm tuning (#3639) 2025-02-18 00:00:35 +08:00