Commit Graph

3344 Commits

Author SHA1 Message Date
fzyzcjy
d6e1d28c8a Refactor DeepSeek attention dispatching (#6476) 2025-05-21 02:03:39 -07:00
Zilin Zhu
7c347259ff [RL] allow weight updation with dp attention enabled (#6311) 2025-05-21 01:58:55 -07:00
Zilin Zhu
669caa0a3f [router] support http2 in router (#6487) 2025-05-21 01:42:45 -07:00
Jiajun Li
4024e1d2a8 Implement Siglip Vision model, and support BNB quantization for gemma3-mm (#5339) 2025-05-20 23:53:46 -07:00
HAI
5c0b38f369 aiter attention-backend (default enabled on AMD/ROCm) (#6381) 2025-05-20 22:52:41 -07:00
Yuan Luo
30ca18f423 Refactor group_concurrent_contiguous in NIXL (#6214)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-05-21 11:55:04 +08:00
Lianmin Zheng
03886917bd Disable all two stream overlap on amd (#6475) 2025-05-20 19:06:59 -07:00
Wenxuan Tan
66324895c6 [docs] Fix torch version (#6472) 2025-05-20 10:53:14 -07:00
fzyzcjy
13feffd082 Fix master CI for DeepSeek (#6447) 2025-05-20 00:31:42 -07:00
fzyzcjy
e98afbe042 Support dispatching logical to physical experts (#6385) 2025-05-19 22:13:55 -07:00
JieXin Liang
69af3ec35f [doc] add note for get_num_kv_splits in triton_backend (#6444) 2025-05-19 21:40:21 -07:00
YanbingJiang
32cc66efa5 Update extend/decode attention kernel for CPU in sgl-kernel and add UTs (#6405)
Co-authored-by: mingfeima <mingfei.ma@intel.com>
2025-05-19 21:23:17 -07:00
PGFLMG
83f2d9d4ed [QuickFix] fix gptq model initialize (#6429) 2025-05-19 21:17:10 -07:00
HAI
6317c5c61f Address performance regression: disable multiple streams on ROCm (#6412) 2025-05-19 21:16:20 -07:00
fzyzcjy
cba1cdbc46 Support DeepSeek EPLB algorithm with static distributions (#6387) 2025-05-19 21:06:21 -07:00
fzyzcjy
c471d39eb9 Support loading weights when physical experts are different from logical experts (#6386) 2025-05-19 21:05:53 -07:00
fzyzcjy
d0443275f0 Refactor DeepSeek logic into atomic operations (#6326) 2025-05-19 21:05:30 -07:00
Liangsheng Yin
17d080b7ae Remove Cargo.lock, add it into .gitignore (#6438) 2025-05-20 12:01:32 +08:00
fzyzcjy
1b19df4b2a Refactor communication logic of DeepSeek for extensibility and understandability (#6321) 2025-05-19 20:14:48 -07:00
fzyzcjy
f0653886a5 Expert distribution recording without overhead for EPLB (#4957) 2025-05-19 20:07:43 -07:00
Yineng Zhang
b146555749 Revert "Implement return_hidden_states for the OpenAI API (#6137)" (#6440) 2025-05-19 18:21:29 -07:00
Yi Zhang
b06215daed [BUG] fix stop_profile crash (#6431) 2025-05-19 17:30:33 -07:00
Trevor Morris
7adf245ba2 [Metrics] Add KV events publishing (#6098) 2025-05-19 14:19:54 -07:00
Baizhou Zhang
299fd22f9e Fix throughput threshold for amd ci test (#6414) 2025-05-19 14:17:41 -07:00
simveit
506e5de8fe Improve supported models doc (#6430) 2025-05-20 01:43:35 +08:00
lukec
844e2f227a Fix nodeepgemm init (#6417) 2025-05-19 00:44:03 -07:00
kyle-pena-kuzco
4f39bcf7ab Implement return_hidden_states for the OpenAI API (#6137) 2025-05-18 22:30:25 -07:00
fzyzcjy
31c9569bb8 Fix request id error (#6401) 2025-05-18 18:58:59 -07:00
Chang Su
1be6956d1b [Bugfix] Fix field error in v1_embedding_request (#6400) 2025-05-18 15:58:29 -07:00
Mick
626ccb7d3f vlm: tensor hash kernel (#5974) 2025-05-18 15:38:16 -07:00
fzyzcjy
72bfb0baf0 Refactor DeepSeek MoE layer to unify the two forward branches (#6325) 2025-05-18 15:34:36 -07:00
wangxiyu191
155214952b refactor: Extract repeated member variables in KVCache subclasses to base class. (#6323) 2025-05-18 15:28:15 -07:00
Chang Su
ebe58d545d [Misc] Implement RankZeroFilter for rank-specific logging in model_runner.py (#6333) 2025-05-18 15:27:13 -07:00
Chang Su
066cf44546 [OAI] Add rid tracing for v1/embeddings and fix rid type in Chat (#6397) 2025-05-18 13:05:38 -07:00
applesaucethebun
6dc6b30637 Add missing model to doc (#6396)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-18 12:57:58 -07:00
JieXin Liang
1f30c05d4a [fix] fix fa3 forward_decode with spec_decode (#6395)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-05-18 12:50:15 -07:00
Chunyuan WU
5dd62c3a6f Add fp8 shared_expert kernel for CPU in sgl-kernel and add UT (#6339)
Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
Co-authored-by: mingfeima <mingfei.ma@intel.com>
2025-05-18 12:42:15 -07:00
fzyzcjy
f11481b921 Add 4-GPU runner tests and split existing tests (#6383) 2025-05-18 11:56:51 -07:00
doujiang24
9d24c3ffb0 chore: tiny remove duplicated code (#6392)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
2025-05-18 02:17:32 -07:00
Yury Sulsky
24161c5913 The Gemma template is missing a newline after the user role. (#6331)
Co-authored-by: Yury Sulsky <ysulsky@tesla.com>
2025-05-18 01:57:27 -07:00
Yineng Zhang
eabcf82acb feat: add long context example (#6391) 2025-05-18 01:45:17 -07:00
Sai Enduri
c47a51db7e Clean up AMD CI (#6365) 2025-05-18 01:17:28 -07:00
libra
11553c1a37 Add pipeline parallelism for Qwen2 and Qwen3 Model (#6250) 2025-05-18 00:42:55 -07:00
Mick
01dd39bac1 refactor: minor refactors regarding multimodal processing (#6187) 2025-05-17 22:53:20 -07:00
Lianmin Zheng
b3f3d610fd Do not use FA3 for mistral (#6379) 2025-05-17 19:47:34 -07:00
Yineng Zhang
f07c6a009b chore: upgrade sgl-kernel v0.1.3 (#6377) 2025-05-17 19:47:05 -07:00
Lianmin Zheng
4bb816d444 Fix CI tests (#6362) 2025-05-17 19:16:45 -07:00
ybyang
c250939ecb [Fix Chat API] add request id for chat/completion for tracing (#6364) 2025-05-17 18:58:22 -07:00
ishandhanani
b6909aa223 fix: allow launch_dummy_health_check_server to start inside of running asyncio loop (#6330) 2025-05-17 18:32:41 -07:00
fzyzcjy
f87283573e Add expert distribution APIs for engine (#6290) 2025-05-17 18:31:51 -07:00