fzyzcjy
|
55f6005f53
|
Fix bench_one_batch_server (#6503)
|
2025-05-21 11:08:17 -07:00 |
|
fzyzcjy
|
7222e1dacc
|
Let bench_one_batch_server use sharegpt data to make expert distribution more natural (#5573)
|
2025-05-21 02:08:43 -07:00 |
|
fzyzcjy
|
ccfe5c009d
|
Support redundant experts in expert parallel (#6461)
|
2025-05-21 02:05:53 -07:00 |
|
fzyzcjy
|
a071dc4084
|
Tiny add stage assertions to DeepEPDispatcher to avoid misuse (#6467)
|
2025-05-21 02:05:05 -07:00 |
|
fzyzcjy
|
a40aecc5a3
|
Fix num_qps_per_rank computation when providing custom DeepEP configuration (#6468)
|
2025-05-21 02:04:33 -07:00 |
|
fzyzcjy
|
d6e1d28c8a
|
Refactor DeepSeek attention dispatching (#6476)
|
2025-05-21 02:03:39 -07:00 |
|
Zilin Zhu
|
7c347259ff
|
[RL] allow weight updation with dp attention enabled (#6311)
|
2025-05-21 01:58:55 -07:00 |
|
Jiajun Li
|
4024e1d2a8
|
Implement Siglip Vision model, and support BNB quantization for gemma3-mm (#5339)
|
2025-05-20 23:53:46 -07:00 |
|
HAI
|
5c0b38f369
|
aiter attention-backend (default enabled on AMD/ROCm) (#6381)
|
2025-05-20 22:52:41 -07:00 |
|
Yuan Luo
|
30ca18f423
|
Refactor group_concurrent_contiguous in NIXL (#6214)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-05-21 11:55:04 +08:00 |
|
Lianmin Zheng
|
03886917bd
|
Disable all two stream overlap on amd (#6475)
|
2025-05-20 19:06:59 -07:00 |
|
fzyzcjy
|
13feffd082
|
Fix master CI for DeepSeek (#6447)
|
2025-05-20 00:31:42 -07:00 |
|
fzyzcjy
|
e98afbe042
|
Support dispatching logical to physical experts (#6385)
|
2025-05-19 22:13:55 -07:00 |
|
JieXin Liang
|
69af3ec35f
|
[doc] add note for get_num_kv_splits in triton_backend (#6444)
|
2025-05-19 21:40:21 -07:00 |
|
PGFLMG
|
83f2d9d4ed
|
[QuickFix] fix gptq model initialize (#6429)
|
2025-05-19 21:17:10 -07:00 |
|
HAI
|
6317c5c61f
|
Address performance regression: disable multiple streams on ROCm (#6412)
|
2025-05-19 21:16:20 -07:00 |
|
fzyzcjy
|
cba1cdbc46
|
Support DeepSeek EPLB algorithm with static distributions (#6387)
|
2025-05-19 21:06:21 -07:00 |
|
fzyzcjy
|
c471d39eb9
|
Support loading weights when physical experts are different from logical experts (#6386)
|
2025-05-19 21:05:53 -07:00 |
|
fzyzcjy
|
d0443275f0
|
Refactor DeepSeek logic into atomic operations (#6326)
|
2025-05-19 21:05:30 -07:00 |
|
fzyzcjy
|
1b19df4b2a
|
Refactor communication logic of DeepSeek for extensibility and understandability (#6321)
|
2025-05-19 20:14:48 -07:00 |
|
fzyzcjy
|
f0653886a5
|
Expert distribution recording without overhead for EPLB (#4957)
|
2025-05-19 20:07:43 -07:00 |
|
Yineng Zhang
|
b146555749
|
Revert "Implement return_hidden_states for the OpenAI API (#6137)" (#6440)
|
2025-05-19 18:21:29 -07:00 |
|
Yi Zhang
|
b06215daed
|
[BUG] fix stop_profile crash (#6431)
|
2025-05-19 17:30:33 -07:00 |
|
Trevor Morris
|
7adf245ba2
|
[Metrics] Add KV events publishing (#6098)
|
2025-05-19 14:19:54 -07:00 |
|
lukec
|
844e2f227a
|
Fix nodeepgemm init (#6417)
|
2025-05-19 00:44:03 -07:00 |
|
kyle-pena-kuzco
|
4f39bcf7ab
|
Implement return_hidden_states for the OpenAI API (#6137)
|
2025-05-18 22:30:25 -07:00 |
|
fzyzcjy
|
31c9569bb8
|
Fix request id error (#6401)
|
2025-05-18 18:58:59 -07:00 |
|
Chang Su
|
1be6956d1b
|
[Bugfix] Fix field error in v1_embedding_request (#6400)
|
2025-05-18 15:58:29 -07:00 |
|
Mick
|
626ccb7d3f
|
vlm: tensor hash kernel (#5974)
|
2025-05-18 15:38:16 -07:00 |
|
fzyzcjy
|
72bfb0baf0
|
Refactor DeepSeek MoE layer to unify the two forward branches (#6325)
|
2025-05-18 15:34:36 -07:00 |
|
wangxiyu191
|
155214952b
|
refactor: Extract repeated member variables in KVCache subclasses to base class. (#6323)
|
2025-05-18 15:28:15 -07:00 |
|
Chang Su
|
ebe58d545d
|
[Misc] Implement RankZeroFilter for rank-specific logging in model_runner.py (#6333)
|
2025-05-18 15:27:13 -07:00 |
|
Chang Su
|
066cf44546
|
[OAI] Add rid tracing for v1/embeddings and fix rid type in Chat (#6397)
|
2025-05-18 13:05:38 -07:00 |
|
JieXin Liang
|
1f30c05d4a
|
[fix] fix fa3 forward_decode with spec_decode (#6395)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
|
2025-05-18 12:50:15 -07:00 |
|
doujiang24
|
9d24c3ffb0
|
chore: tiny remove duplicated code (#6392)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
|
2025-05-18 02:17:32 -07:00 |
|
Yury Sulsky
|
24161c5913
|
The Gemma template is missing a newline after the user role. (#6331)
Co-authored-by: Yury Sulsky <ysulsky@tesla.com>
|
2025-05-18 01:57:27 -07:00 |
|
libra
|
11553c1a37
|
Add pipeline parallelism for Qwen2 and Qwen3 Model (#6250)
|
2025-05-18 00:42:55 -07:00 |
|
Mick
|
01dd39bac1
|
refactor: minor refactors regarding multimodal processing (#6187)
|
2025-05-17 22:53:20 -07:00 |
|
Lianmin Zheng
|
b3f3d610fd
|
Do not use FA3 for mistral (#6379)
|
2025-05-17 19:47:34 -07:00 |
|
Yineng Zhang
|
f07c6a009b
|
chore: upgrade sgl-kernel v0.1.3 (#6377)
|
2025-05-17 19:47:05 -07:00 |
|
Lianmin Zheng
|
4bb816d444
|
Fix CI tests (#6362)
|
2025-05-17 19:16:45 -07:00 |
|
ybyang
|
c250939ecb
|
[Fix Chat API] add request id for chat/completion for tracing (#6364)
|
2025-05-17 18:58:22 -07:00 |
|
ishandhanani
|
b6909aa223
|
fix: allow launch_dummy_health_check_server to start inside of running asyncio loop (#6330)
|
2025-05-17 18:32:41 -07:00 |
|
fzyzcjy
|
f87283573e
|
Add expert distribution APIs for engine (#6290)
|
2025-05-17 18:31:51 -07:00 |
|
fzyzcjy
|
73187152a4
|
Reland tiny refactor DefaultModelLoader.Source (#6041)
|
2025-05-17 17:11:20 -07:00 |
|
fzyzcjy
|
4086566516
|
Fix expert distribution recorder and profiler command stuck forever (#6284)
|
2025-05-17 17:10:44 -07:00 |
|
fzyzcjy
|
fd08c04821
|
Support custom DeepEP tuning config (#6257)
|
2025-05-17 17:09:42 -07:00 |
|
fzyzcjy
|
26ebb849eb
|
Tiny refactor bench_serving to extract RequestFuncOutput.init_new (#6108)
|
2025-05-17 17:08:52 -07:00 |
|
fzyzcjy
|
02973cd9a4
|
Tiny refactor bench_serving to improve extensibility (#6134)
|
2025-05-17 17:07:58 -07:00 |
|
fzyzcjy
|
6d95a35abf
|
Support outputing details for bench_serving (#6107)
|
2025-05-17 17:06:52 -07:00 |
|