Commit Graph

109 Commits

Author SHA1 Message Date
Rodrigo Garcia
a990daff9c Included multi-node DeepSeekv3 example (#2707) 2025-01-02 22:17:03 +08:00
Lianmin Zheng
ad20b7957e Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-01-02 02:09:08 -08:00
Lianmin Zheng
8c3b420eec [Docs] clean up structured outputs docs (#2654) 2024-12-29 23:57:16 -08:00
Yineng Zhang
098d659c0e docs: update README (#2651) 2024-12-30 13:33:29 +08:00
Lzhang-hub
76d14f8cb9 add 2*h20 node serving example for deepseek v3 (#2650)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-12-30 13:04:38 +08:00
Lianmin Zheng
03d5fbfd44 Release 0.4.1.post3 - upload the config.json to PyPI (#2647) 2024-12-29 14:25:53 -08:00
Yineng Zhang
763dd55d17 docs: update README (#2644) 2024-12-30 01:24:06 +08:00
HandH1998
afa0341e57 Update Triton configs for block fp8 kernels (#2641) 2024-12-29 22:53:47 +08:00
Yineng Zhang
7863e4368a add configs for block fp8 related kernels (#2628)
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-28 23:12:04 +08:00
Ke Bao
8a2681e26a Update readme (#2625) 2024-12-28 13:39:56 +08:00
Yineng Zhang
d9e6ee382b docs: update README (#2618) 2024-12-28 00:21:53 +08:00
Lianmin Zheng
f46f394f4d Update README.md (#2605) 2024-12-26 10:58:49 -08:00
Lianmin Zheng
773951548d Fix logprob_start_len for multi modal models (#2597)
Co-authored-by: libra <lihu723@gmail.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>
2024-12-26 06:27:45 -08:00
fsygd
637de9e8ce update readme of DeepSeek V3 (#2596) 2024-12-26 21:31:56 +08:00
Xiaoyu Zhang
9a23c48456 h100 tuning fused_moe_triton for qwen2 moe (#2560) 2024-12-26 03:13:31 -08:00
Yineng Zhang
635a042623 docs: update deepseek v3 example (#2592) 2024-12-26 17:43:37 +08:00
Yineng Zhang
75ad0a143f docs: add deepseek v3 launch instructions (#2589) 2024-12-25 23:26:54 -08:00
Ke Bao
e835a50021 Reorg moe code (#2563) 2024-12-24 01:10:22 +08:00
Xiaoyu Zhang
7d672d277b [kernel optimize] benchmark write_req_to_token_pool_triton and optimize kernel (#2509) 2024-12-22 02:31:02 -08:00
bjmsong
e21026690d benchmark decoding attention kernel with cudnn (#2467)
Co-authored-by: root <bjmsong@126.com>
2024-12-17 03:31:57 -08:00
Lianmin Zheng
56198b45d9 Add a benchmark script for in-batch prefix caching (#2494) 2024-12-16 18:49:02 -08:00
Xiaoyu Zhang
a0592c059f [Benchmark] add a benchmark for hf/vllm/sglang rmsnorm (#2486) 2024-12-15 13:52:08 +08:00
bjmsong
f67723940d decoding attention kernel benchmark (#2425)
Co-authored-by: root <bjmsong@126.com>
2024-12-11 04:46:59 -08:00
Xiaoyu Zhang
3844feb9bb Add a unittest for fused_moe (#2416) 2024-12-08 22:46:10 -08:00
Lianmin Zheng
07ec07ad1f Improve torch compile for fused moe (#2327) 2024-12-03 01:58:25 -08:00
Lianmin Zheng
33deca81b5 Add more fused moe benchmark utilities (#2314) 2024-12-02 04:26:55 -08:00
Xiaoyu Zhang
262e370f78 [benchmark] Add fused_moe_triton benchmark and tuning tools (#2225)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
2024-11-29 13:36:45 -08:00
Henry Hyeonmok Ko
dbe1729395 Merged three native APIs into one: get_server_info (#2152) 2024-11-24 01:37:58 -08:00
Byron Hsu
cbedd1db1d [router] cache-aware load-balancing router v1 (#2114) 2024-11-23 08:34:48 -08:00
Xuehai Pan
62a4a339eb docs: fix module docstrings and copyright headers (#2077) 2024-11-22 22:16:53 +08:00
Lianmin Zheng
c29b98e043 Fix json benchmark (#2043) 2024-11-15 05:33:43 -08:00
DarkSharpness
954f4e6bd6 benchmark json schema (#2030) 2024-11-15 05:06:19 -08:00
Byron Hsu
f9633fa9b9 [rust] cache-aware DP - approx tree (#1934) 2024-11-10 21:57:32 -08:00
Xuehai Pan
a5e0defb5a minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926) 2024-11-06 13:46:04 +00:00
Lianmin Zheng
dd3809fad8 Fix engine unit test (#1701) 2024-10-17 09:53:32 -07:00
Ying Sheng
9c064bf78a [LoRA, Performance] Speedup multi-LoRA serving - Step 1 (#1587) 2024-10-06 10:33:44 -07:00
Theresa Barton
2c7d0a5b8b [Fix] Fix all the Huggingface paths (#1553) 2024-10-02 10:12:07 -07:00
Ying Sheng
37963394aa [Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433) 2024-09-15 12:46:04 -07:00
Lianmin Zheng
e4d68afcf0 [Minor] Many cleanup (#1357) 2024-09-09 04:14:11 -07:00
Kai-Hsun Chen
c9b75917d5 [server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
2024-09-09 02:14:25 -07:00
Yineng Zhang
62f15eea5a docs: add conclusion (#1340) 2024-09-06 04:25:14 +10:00
Yineng Zhang
79794af52d docs: highlight ttft itl and throughput (#1337) 2024-09-06 00:00:06 +10:00
Yineng Zhang
3494b32c3a docs: update README (#1336) 2024-09-05 23:39:44 +10:00
Lianmin Zheng
57d0bd91ec Improve benchmark (#1140) 2024-08-17 17:43:23 -07:00
Lianmin Zheng
5a261bd055 Fix the deadlock in multi-node tp (#1122) 2024-08-16 01:39:24 -07:00
Lianmin Zheng
326df4bab2 Use a single workspace for flashinfer (#1077) 2024-08-14 19:25:37 -07:00
Yineng Zhang
1c2b5f5240 docs: update nsys usage (#1103) 2024-08-15 01:39:15 +08:00
Liangsheng Yin
a34dd86a7d Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
2024-08-14 15:58:07 +00:00
Lianmin Zheng
a59636bb5e Update grok 1 model (#1095) 2024-08-14 04:40:44 -07:00
Meng, Peng
41bb1ab10d fix nsys cannot profile cuda kernel (#957) 2024-08-07 11:51:21 +08:00