Commit Graph

1651 Commits

Author SHA1 Message Date
Zhiyu
287427e2e6 Enable Nvidia's ModelOpt fp8 quantized models (#2535) 2025-01-06 14:54:52 -08:00
Lianmin Zheng
b8574f6953 Clean up eagle code (#2756) 2025-01-06 14:54:18 -08:00
王博伟
2855caa481 feat: add devcontainer.json for VSCode development (#2745) 2025-01-06 14:00:55 -08:00
Xu-Chen
2329e1ddd0 Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied (#2748)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
2025-01-06 13:56:28 -08:00
Ke Bao
0f3eb1d294 Support cutlass Int8 gemm (#2752) 2025-01-06 22:51:22 +08:00
Ke Bao
06dd2eab84 Remove unused var in moe_align_kernel (#2751) 2025-01-06 22:13:28 +08:00
Ke Bao
439f65809f Fix sgl-kernel cu118 compile issue (#2750) 2025-01-06 21:59:31 +08:00
Yineng Zhang
2f0d386496 chore: bump v0.4.1.post4 (#2713) 2025-01-06 01:29:54 +08:00
yizhang2077
3900a94afe Support twoshot kernel (#2688) 2025-01-06 00:47:16 +08:00
Xiaoyu Zhang
ded9fcd09a improve moe_align_kernel for deepseek v3 (#2735) 2025-01-06 00:28:22 +08:00
Yineng Zhang
bc6ad367c2 fix lint (#2733) 2025-01-05 14:45:42 +08:00
Lianmin Zheng
3a22a303d1 Revert the GLOO_SOCKET_IFNAME change (#2731) 2025-01-04 20:13:16 -08:00
libra
bdb3929dbb Refactor SchedulePolicy to improve code organization (#2571) 2025-01-04 00:05:16 +08:00
Ce Gao
f5d0865b25 feat: Support VLM in reference_hf (#2726)
Signed-off-by: Ce Gao <gaocegege@hotmail.com>
2025-01-03 22:32:30 +08:00
Ce Gao
afdee7b1a9 [Docs] fix 404 - Contributor Guide, again (#2727)
Signed-off-by: Ce Gao <gaocegege@hotmail.com>
2025-01-03 22:21:38 +08:00
Lianmin Zheng
cb34d848ac Update README.md (#2722)
Co-authored-by: Yangmin Li <2682000734@qq.com>
Co-authored-by: Mingyuan Ma <mamingyuan2001@berkeley.edu>
Co-authored-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-01-03 00:32:20 -08:00
Lianmin Zheng
0f9cc6d8d3 Fix package loss for small models (#2717)
Co-authored-by: sdli1995 < mmlmonkey@163.com>
2025-01-02 18:25:26 -08:00
yigex
c7ae474a49 [Feature, Hardware] Enable DeepseekV3 on AMD GPUs (#2601)
Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Bruce Xue <yigex@xilinx.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-01-02 16:23:19 -08:00
Lianmin Zheng
bdf946bf81 Support loading pre-sharded moe weights (#2716) 2025-01-02 15:07:37 -08:00
yukavio
8c8779cd05 [Fix] fix retract error in eagle speculative decoding (#2711)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-01-02 10:28:39 -08:00
Mick
1775b963db [Fix] fix incorrectly overwriting the port specified in ServerArgs (#2714) 2025-01-02 10:28:22 -08:00
Shi Shuai
dd2e2d275f Docs: Update documentation workflow and contribution guide (#2704)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-01-02 09:18:31 -08:00
Rodrigo Garcia
a990daff9c Included multi-node DeepSeekv3 example (#2707) 2025-01-02 22:17:03 +08:00
Yineng Zhang
ba5112ff69 feat: support moe_align_block_size_triton (#2712)
Co-authored-by: WANDY666 <1060304770@qq.com>
2025-01-02 21:47:44 +08:00
yukavio
815dce0554 Eagle speculative decoding part 4: Add EAGLE2 worker (#2150)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-01-02 03:22:34 -08:00
Lianmin Zheng
ad20b7957e Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-01-02 02:09:08 -08:00
fzyzcjy
9183c23eca Speed up update_weights_from_tensor (#2695) 2025-01-02 02:05:19 -08:00
kk
148254d4db Improve moe reduce sum kernel performance (#2705)
Co-authored-by: wunhuang <wunhuang@amd.com>
2025-01-02 01:11:06 -08:00
Xiaotong Jiang
a4d6d6f1dd [feat]: Add math eval to CI nightly run (#2663)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-01-01 15:29:35 -08:00
Shi Shuai
062c48d2bd [Docs] Add Support for Pydantic Structured Output Format (#2697) 2025-01-01 15:08:43 -08:00
kk
b6e0cfb5e1 ROCm base image update (#2692)
Co-authored-by: wunhuang <wunhuang@amd.com>
2025-01-01 12:12:19 +08:00
Chayenne
0d8d97b8e6 Doc: Rename contribution_guide.md (#2691) 2024-12-31 14:35:48 -08:00
Shi Shuai
0a765bbccc Docs: Refactor Contribution Guide (#2690) 2024-12-31 14:11:00 -08:00
Xiaoyu Zhang
286cad3ee3 h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B (#2689) 2024-12-31 23:17:36 +08:00
Ying Sheng
dc7eb01f19 [Fix] fix openai adapter (#2685) 2024-12-31 10:48:19 +00:00
Lianmin Zheng
b0524c3789 Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging (#2684)
Co-authored-by: yukavio <kavioyu@gmail.com>
2024-12-31 02:25:05 -08:00
Lianmin Zheng
6c42fa229d Update README.md (#2683) 2024-12-31 00:13:10 -08:00
Yineng Zhang
d49b13c6f8 feat: use CUDA 12.4 by default (for FA3) (#2682) 2024-12-31 15:52:09 +08:00
Yineng Zhang
bedc4c7a50 misc: update CODEOWNERS (#2680) 2024-12-31 15:04:50 +08:00
Lianmin Zheng
f44d143949 Support target model verification in the attention backend (#2678)
Co-authored-by: yukavio <kavioyu@gmail.com>
2024-12-30 22:58:55 -08:00
Yineng Zhang
b6b57fc200 minor: cleanup sgl-kernel (#2679) 2024-12-31 14:52:00 +08:00
Ke Bao
b4403985d0 Add cutlass submodule for sgl-kernel (#2676) 2024-12-31 14:28:29 +08:00
Lianmin Zheng
339c69a243 Improve the computation for time_per_output_token Prometheus metrics (#2674) 2024-12-30 21:40:14 -08:00
fzyzcjy
f707470019 CI: Update scripts to fail fast (#2672) 2024-12-30 19:04:01 -08:00
Lianmin Zheng
21ec66e59e Minor follow-up fixes for the logprob refactor (#2670) 2024-12-30 05:42:08 -08:00
HAI
c5210dfa38 AMD DeepSeek_V3 FP8 Numerical fix (#2667) 2024-12-30 21:31:12 +08:00
mobicham
a29dd9501d Add GemLite caching after each capture (#2669) 2024-12-30 05:27:29 -08:00
Lianmin Zheng
9c6ba2484f Refactor logprob computation to return the real logprob used in sampling (#2664) 2024-12-30 04:51:38 -08:00
Ke Bao
b02da24a5b Refactor sgl-kernel build (#2642) 2024-12-30 18:07:01 +08:00
Lianmin Zheng
bdd2827a80 Update structured_outputs.ipynb (#2666) 2024-12-30 00:46:41 -08:00