Commit Graph

1674 Commits

Author SHA1 Message Date
Chang Su
f290bd4332 [Bugfix] Fix embedding model hangs with --enable-metrics (#2822) 2025-01-10 13:14:51 -08:00
Pratyush Patel
8f15789314 Add more metrics to serving benchmark. (#2819) 2025-01-10 23:30:44 +08:00
Lianmin Zheng
2db03a04ca Update README.md (#2833)
Co-authored-by: Heiner <heiner@x.ai>
2025-01-10 03:49:04 -08:00
Chayenne
5cc1170552 Doc: add block-wise FP8 in dpsk model reference (#2830) 2025-01-10 00:26:59 -08:00
Xiaotong Jiang
11fffbc95a [Doc]: Deepseek reference docs (#2787) 2025-01-09 13:43:12 -08:00
sleepcoo
4f077c01b8 minor: support specifying local dataset path for gsm8k and hellaswag (#2816) 2025-01-09 22:24:42 +08:00
Lianmin Zheng
679c3bcacf Fix typo in cuda_graph_bs (#2813) 2025-01-09 03:03:24 -08:00
Yunmeng
656aed58c6 Remove vllm dependency in model config (#2809) 2025-01-09 17:51:56 +08:00
Ke Bao
b5fb4ef58a Update modelopt config and fix running issue (#2792) 2025-01-08 18:04:30 +08:00
Chayenne
2e6346fc2e Docs:Update the style of llma 3.1 405B docs (#2789) 2025-01-08 01:07:54 -08:00
mlmz
977f785dad Docs: Rewrite docs for LLama 405B and ModelSpace (#2773)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-01-08 00:02:59 -08:00
Lianmin Zheng
8a6906127a Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
2025-01-07 23:29:10 -08:00
JJJJOHNSON
694e41925e [eagle2] fix end check when target model verify (#2723) 2025-01-07 21:46:02 -08:00
Lianmin Zheng
b22f3f6475 Fix nightly accuracy tests (#2780) 2025-01-07 21:02:35 -08:00
Lianmin Zheng
6fb5768372 Disable math eval on nightly CI temporarily (#2779) 2025-01-07 18:17:34 -08:00
Zhiqiang Xie
51caee740f Host memory pool for hierarchical caching (#2771) 2025-01-07 21:38:37 +00:00
Ke Bao
58f9060efe Update int8 gemm config (#2774) 2025-01-07 19:47:37 +08:00
Lianmin Zheng
bdc1acf6cd Misc fix for min_p_sampling, --cuda-graph-bs (#2761) 2025-01-07 02:52:53 -08:00
HAI
6d08ce2aa9 Use Optional with None default (#2770) 2025-01-07 01:35:08 -08:00
Xiaoyu Zhang
380930a959 add benchmark_moe_align_blocks (#2767) 2025-01-07 14:20:50 +08:00
Lianmin Zheng
9dec582dab Remove --modelopt-config in server_args (#2758) 2025-01-06 16:35:45 -08:00
Lianmin Zheng
b01febdca0 Update README.md (#2757)
Co-authored-by: Junjie Jin <jjjjohnsonjin@gmail.com>
Co-authored-by: justdoit <24875266+coolhok@users.noreply.github.com>
2025-01-06 15:36:23 -08:00
Xingyao Wang
1acbaf1b5a Add generator-style run_batch function (#2513)
Co-authored-by: openhands <openhands@all-hands.dev>
2025-01-06 15:04:55 -08:00
Zhiyu
287427e2e6 Enable Nvidia's ModelOpt fp8 quantized models (#2535) 2025-01-06 14:54:52 -08:00
Lianmin Zheng
b8574f6953 Clean up eagle code (#2756) 2025-01-06 14:54:18 -08:00
王博伟
2855caa481 feat: add devcontainer.json for VSCode development (#2745) 2025-01-06 14:00:55 -08:00
Xu-Chen
2329e1ddd0 Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied (#2748)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
2025-01-06 13:56:28 -08:00
Ke Bao
0f3eb1d294 Support cutlass Int8 gemm (#2752) 2025-01-06 22:51:22 +08:00
Ke Bao
06dd2eab84 Remove unused var in moe_align_kernel (#2751) 2025-01-06 22:13:28 +08:00
Ke Bao
439f65809f Fix sgl-kernel cu118 compile issue (#2750) 2025-01-06 21:59:31 +08:00
Yineng Zhang
2f0d386496 chore: bump v0.4.1.post4 (#2713) 2025-01-06 01:29:54 +08:00
yizhang2077
3900a94afe Support twoshot kernel (#2688) 2025-01-06 00:47:16 +08:00
Xiaoyu Zhang
ded9fcd09a improve moe_align_kernel for deepseek v3 (#2735) 2025-01-06 00:28:22 +08:00
Yineng Zhang
bc6ad367c2 fix lint (#2733) 2025-01-05 14:45:42 +08:00
Lianmin Zheng
3a22a303d1 Revert the GLOO_SOCKET_IFNAME change (#2731) 2025-01-04 20:13:16 -08:00
libra
bdb3929dbb Refactor SchedulePolicy to improve code organization (#2571) 2025-01-04 00:05:16 +08:00
Ce Gao
f5d0865b25 feat: Support VLM in reference_hf (#2726)
Signed-off-by: Ce Gao <gaocegege@hotmail.com>
2025-01-03 22:32:30 +08:00
Ce Gao
afdee7b1a9 [Docs] fix 404 - Contributor Guide, again (#2727)
Signed-off-by: Ce Gao <gaocegege@hotmail.com>
2025-01-03 22:21:38 +08:00
Lianmin Zheng
cb34d848ac Update README.md (#2722)
Co-authored-by: Yangmin Li <2682000734@qq.com>
Co-authored-by: Mingyuan Ma <mamingyuan2001@berkeley.edu>
Co-authored-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-01-03 00:32:20 -08:00
Lianmin Zheng
0f9cc6d8d3 Fix package loss for small models (#2717)
Co-authored-by: sdli1995 < mmlmonkey@163.com>
2025-01-02 18:25:26 -08:00
yigex
c7ae474a49 [Feature, Hardware] Enable DeepseekV3 on AMD GPUs (#2601)
Co-authored-by: root <root@banff-cyxtera-s83-5.amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Bruce Xue <yigex@xilinx.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-01-02 16:23:19 -08:00
Lianmin Zheng
bdf946bf81 Support loading pre-sharded moe weights (#2716) 2025-01-02 15:07:37 -08:00
yukavio
8c8779cd05 [Fix] fix retract error in eagle speculative decoding (#2711)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-01-02 10:28:39 -08:00
Mick
1775b963db [Fix] fix incorrectly overwriting the port specified in ServerArgs (#2714) 2025-01-02 10:28:22 -08:00
Shi Shuai
dd2e2d275f Docs: Update documentation workflow and contribution guide (#2704)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-01-02 09:18:31 -08:00
Rodrigo Garcia
a990daff9c Included multi-node DeepSeekv3 example (#2707) 2025-01-02 22:17:03 +08:00
Yineng Zhang
ba5112ff69 feat: support moe_align_block_size_triton (#2712)
Co-authored-by: WANDY666 <1060304770@qq.com>
2025-01-02 21:47:44 +08:00
yukavio
815dce0554 Eagle speculative decoding part 4: Add EAGLE2 worker (#2150)
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-01-02 03:22:34 -08:00
Lianmin Zheng
ad20b7957e Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-01-02 02:09:08 -08:00
fzyzcjy
9183c23eca Speed up update_weights_from_tensor (#2695) 2025-01-02 02:05:19 -08:00