Rodrigo Garcia
|
a990daff9c
|
Included multi-node DeepSeekv3 example (#2707)
|
2025-01-02 22:17:03 +08:00 |
|
Lianmin Zheng
|
ad20b7957e
|
Eagle speculative decoding part 3: small modifications to the general scheduler (#2709)
Co-authored-by: kavioyu <kavioyu@tencent.com>
|
2025-01-02 02:09:08 -08:00 |
|
Lianmin Zheng
|
8c3b420eec
|
[Docs] clean up structured outputs docs (#2654)
|
2024-12-29 23:57:16 -08:00 |
|
Yineng Zhang
|
098d659c0e
|
docs: update README (#2651)
|
2024-12-30 13:33:29 +08:00 |
|
Lzhang-hub
|
76d14f8cb9
|
add 2*h20 node serving example for deepseek v3 (#2650)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-12-30 13:04:38 +08:00 |
|
Lianmin Zheng
|
03d5fbfd44
|
Release 0.4.1.post3 - upload the config.json to PyPI (#2647)
|
2024-12-29 14:25:53 -08:00 |
|
Yineng Zhang
|
763dd55d17
|
docs: update README (#2644)
|
2024-12-30 01:24:06 +08:00 |
|
HandH1998
|
afa0341e57
|
Update Triton configs for block fp8 kernels (#2641)
|
2024-12-29 22:53:47 +08:00 |
|
Yineng Zhang
|
7863e4368a
|
add configs for block fp8 related kernels (#2628)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2024-12-28 23:12:04 +08:00 |
|
Ke Bao
|
8a2681e26a
|
Update readme (#2625)
|
2024-12-28 13:39:56 +08:00 |
|
Yineng Zhang
|
d9e6ee382b
|
docs: update README (#2618)
|
2024-12-28 00:21:53 +08:00 |
|
Lianmin Zheng
|
f46f394f4d
|
Update README.md (#2605)
|
2024-12-26 10:58:49 -08:00 |
|
Lianmin Zheng
|
773951548d
|
Fix logprob_start_len for multi modal models (#2597)
Co-authored-by: libra <lihu723@gmail.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Wang, Haoyu <haoyu.wang@intel.com>
|
2024-12-26 06:27:45 -08:00 |
|
fsygd
|
637de9e8ce
|
update readme of DeepSeek V3 (#2596)
|
2024-12-26 21:31:56 +08:00 |
|
Xiaoyu Zhang
|
9a23c48456
|
h100 tuning fused_moe_triton for qwen2 moe (#2560)
|
2024-12-26 03:13:31 -08:00 |
|
Yineng Zhang
|
635a042623
|
docs: update deepseek v3 example (#2592)
|
2024-12-26 17:43:37 +08:00 |
|
Yineng Zhang
|
75ad0a143f
|
docs: add deepseek v3 launch instructions (#2589)
|
2024-12-25 23:26:54 -08:00 |
|
Ke Bao
|
e835a50021
|
Reorg moe code (#2563)
|
2024-12-24 01:10:22 +08:00 |
|
Xiaoyu Zhang
|
7d672d277b
|
[kernel optimize] benchmark write_req_to_token_pool_triton and optimize kernel (#2509)
|
2024-12-22 02:31:02 -08:00 |
|
bjmsong
|
e21026690d
|
benchmark decoding attention kernel with cudnn (#2467)
Co-authored-by: root <bjmsong@126.com>
|
2024-12-17 03:31:57 -08:00 |
|
Lianmin Zheng
|
56198b45d9
|
Add a benchmark script for in-batch prefix caching (#2494)
|
2024-12-16 18:49:02 -08:00 |
|
Xiaoyu Zhang
|
a0592c059f
|
[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm (#2486)
|
2024-12-15 13:52:08 +08:00 |
|
bjmsong
|
f67723940d
|
decoding attention kernel benchmark (#2425)
Co-authored-by: root <bjmsong@126.com>
|
2024-12-11 04:46:59 -08:00 |
|
Xiaoyu Zhang
|
3844feb9bb
|
Add a unittest for fused_moe (#2416)
|
2024-12-08 22:46:10 -08:00 |
|
Lianmin Zheng
|
07ec07ad1f
|
Improve torch compile for fused moe (#2327)
|
2024-12-03 01:58:25 -08:00 |
|
Lianmin Zheng
|
33deca81b5
|
Add more fused moe benchmark utilities (#2314)
|
2024-12-02 04:26:55 -08:00 |
|
Xiaoyu Zhang
|
262e370f78
|
[benchmark] Add fused_moe_triton benchmark and tuning tools (#2225)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
|
2024-11-29 13:36:45 -08:00 |
|
Henry Hyeonmok Ko
|
dbe1729395
|
Merged three native APIs into one: get_server_info (#2152)
|
2024-11-24 01:37:58 -08:00 |
|
Byron Hsu
|
cbedd1db1d
|
[router] cache-aware load-balancing router v1 (#2114)
|
2024-11-23 08:34:48 -08:00 |
|
Xuehai Pan
|
62a4a339eb
|
docs: fix module docstrings and copyright headers (#2077)
|
2024-11-22 22:16:53 +08:00 |
|
Lianmin Zheng
|
c29b98e043
|
Fix json benchmark (#2043)
|
2024-11-15 05:33:43 -08:00 |
|
DarkSharpness
|
954f4e6bd6
|
benchmark json schema (#2030)
|
2024-11-15 05:06:19 -08:00 |
|
Byron Hsu
|
f9633fa9b9
|
[rust] cache-aware DP - approx tree (#1934)
|
2024-11-10 21:57:32 -08:00 |
|
Xuehai Pan
|
a5e0defb5a
|
minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926)
|
2024-11-06 13:46:04 +00:00 |
|
Lianmin Zheng
|
dd3809fad8
|
Fix engine unit test (#1701)
|
2024-10-17 09:53:32 -07:00 |
|
Ying Sheng
|
9c064bf78a
|
[LoRA, Performance] Speedup multi-LoRA serving - Step 1 (#1587)
|
2024-10-06 10:33:44 -07:00 |
|
Theresa Barton
|
2c7d0a5b8b
|
[Fix] Fix all the Huggingface paths (#1553)
|
2024-10-02 10:12:07 -07:00 |
|
Ying Sheng
|
37963394aa
|
[Feature] Support LoRA path renaming and add LoRA serving benchmarks (#1433)
|
2024-09-15 12:46:04 -07:00 |
|
Lianmin Zheng
|
e4d68afcf0
|
[Minor] Many cleanup (#1357)
|
2024-09-09 04:14:11 -07:00 |
|
Kai-Hsun Chen
|
c9b75917d5
|
[server] Passing model_override_args to launch_server via the CLI. (#1298)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
|
2024-09-09 02:14:25 -07:00 |
|
Yineng Zhang
|
62f15eea5a
|
docs: add conclusion (#1340)
|
2024-09-06 04:25:14 +10:00 |
|
Yineng Zhang
|
79794af52d
|
docs: highlight ttft itl and throughput (#1337)
|
2024-09-06 00:00:06 +10:00 |
|
Yineng Zhang
|
3494b32c3a
|
docs: update README (#1336)
|
2024-09-05 23:39:44 +10:00 |
|
Lianmin Zheng
|
57d0bd91ec
|
Improve benchmark (#1140)
|
2024-08-17 17:43:23 -07:00 |
|
Lianmin Zheng
|
5a261bd055
|
Fix the deadlock in multi-node tp (#1122)
|
2024-08-16 01:39:24 -07:00 |
|
Lianmin Zheng
|
326df4bab2
|
Use a single workspace for flashinfer (#1077)
|
2024-08-14 19:25:37 -07:00 |
|
Yineng Zhang
|
1c2b5f5240
|
docs: update nsys usage (#1103)
|
2024-08-15 01:39:15 +08:00 |
|
Liangsheng Yin
|
a34dd86a7d
|
Use dtype to control generate (#1082)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2024-08-14 15:58:07 +00:00 |
|
Lianmin Zheng
|
a59636bb5e
|
Update grok 1 model (#1095)
|
2024-08-14 04:40:44 -07:00 |
|
Meng, Peng
|
41bb1ab10d
|
fix nsys cannot profile cuda kernel (#957)
|
2024-08-07 11:51:21 +08:00 |
|