Commit Graph

109 Commits

Author SHA1 Message Date
Netanel Haber
4cd08dc592 model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 (#9301) 2025-08-26 15:33:40 +08:00
Netanel Haber
845d12a979 model: support nvidia/Llama-3_3-Nemotron-Super-49B-v1 (#9067)
Co-authored-by: Kyle Huang <kylhuang@nvidia.com>
2025-08-17 01:48:15 -07:00
Lianmin Zheng
2c7f01bc89 Reorganize CI and test files (#9027) 2025-08-10 12:30:06 -07:00
Zheng Wengang
2d120f8b18 [Feature][Multimodal] Implement LRU cache for multimodal embeddings (#8292)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-06 23:21:40 -07:00
Lifu Huang
6210e2c4f0 Support GPU pinning for LoRA (#8697) 2025-08-06 19:39:45 -07:00
Praneth Paruchuri
d26ca84f39 Support bailing moe (#8680) 2025-08-05 20:40:34 -07:00
Lifu Huang
8675bdf246 Support limiting max loaded loras in CPU. (#8650) 2025-08-03 00:02:23 -07:00
Lifu Huang
46e9d1c7c1 Increase tolerance to address CI failures (#8643) 2025-08-01 02:32:10 -07:00
Lifu Huang
67e53b16f5 Bump transfomers to 4.54.1 to fix Gemma cache issue. (#8541) 2025-07-30 19:50:54 -07:00
Stefan He
4ad9737045 chore: bump transformer to 4.54.0 (#8416)
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
2025-07-27 21:27:25 -07:00
Lifu Huang
8abd3e77fe Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (#8261) 2025-07-23 00:32:16 -07:00
Praneth Paruchuri
83c104b188 Feat: Support for Persimmon Model (#7983) 2025-07-19 23:07:47 -07:00
Pavel Logachev
877e35d775 Add get_hidden_dim to qwen3.py for correct lora (#7312) 2025-07-19 19:31:16 -07:00
Clay
cbdfb77123 Enable FlashInfer support encoder models and add head_dim padding workaround (#6230) 2025-07-19 19:30:16 -07:00
Lifu Huang
4e3defe5a7 Support start up LoRA server without initial adapters (#8019) 2025-07-19 15:38:09 -07:00
Lifu Huang
3de617a75b Fix LoRA buffer contamination during adapter eviction (#8103) 2025-07-19 13:14:08 -07:00
Lianmin Zheng
bb0e8a32b5 Clean up server args (#8161) 2025-07-19 11:32:52 -07:00
Praneth Paruchuri
cb736df854 Support for Phi-1.5 & Phi-2 models (#7862) 2025-07-13 18:43:40 -07:00
Lifu Huang
e2ed9d049a Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (#7844) 2025-07-13 18:36:01 -07:00
Binyao Jiang
2d54d4bb64 Feat: Support Phi-3.5-MoE in SGLang (#7907) 2025-07-09 23:51:33 -07:00
Lifu Huang
01f9873048 Fix CI test OOM issue. (#7799) 2025-07-05 15:11:02 -07:00
YanbingJiang
4de0395343 Add V2-lite model test (#7390)
Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>
2025-07-03 22:25:50 -07:00
Lifu Huang
1a08358aed Improve error handling for requests with unloaded LoRA path(s) (#7642) 2025-07-01 20:05:34 -07:00
Lifu Huang
49538d111b Support dynamic LoRA loading / unloading in engine/server API (#7446) 2025-06-27 21:00:27 -07:00
Lifu Huang
2373faa317 Fix flakiness in LoRA batch test. (#7552) 2025-06-27 19:51:43 -07:00
woodx
e30ef368ab Feat/support rerank (#6058) 2025-06-16 10:50:01 -07:00
Baizhou Zhang
3b014bc13d Fix test_lora.py CI (#7061) 2025-06-10 12:24:46 -07:00
Pan Lyu
451ffe74d9 support qwen3 emebedding (#6990) 2025-06-09 01:32:49 -07:00
Marc Sun
37f1547587 [FEAT] Add transformers backend support (#5929) 2025-06-03 21:05:29 -07:00
Ravi Theja
c6a0cacc35 Update CI tests for Llama4 models (#6421) 2025-06-01 11:52:15 +08:00
ryang
a6ae3af15e Support XiaomiMiMo inference with mtp (#6059) 2025-05-22 14:14:49 -07:00
HAI
5c0b38f369 aiter attention-backend (default enabled on AMD/ROCm) (#6381) 2025-05-20 22:52:41 -07:00
Kiv Chen
5380cd7ea3 model(vlm): pixtral (#5084) 2025-05-13 00:16:10 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Lifu Huang
6e2da51561 Replace time.time() to time.perf_counter() for benchmarking. (#6178)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-11 14:32:49 -07:00
XinyuanTong
9d8ec2e67e Fix and Clean up chat-template requirement for VLM (#6114)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-05-11 00:14:09 +08:00
Qiaolin Yu
3042f1da61 Fix flaky issues of lora and add multi batch tests (#5957) 2025-05-04 13:11:40 -07:00
Qiaolin Yu
7bcd8b1cb2 Fix lora batch processing when input lora_path contains None (#5930) 2025-04-30 19:42:42 -07:00
Qiaolin Yu
8c0cfca87d Feat: support cuda graph for LoRA (#4115)
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
2025-04-28 23:30:44 -07:00
Lianmin Zheng
849c83a0c0 [CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00
DavidBao
d8fbc7c096 [feature] support for roberta embedding models (#5730) 2025-04-26 18:47:06 -07:00
Ravi Theja
7d9679b74d Add MMMU benchmark results (#4491)
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
2025-04-25 15:23:53 +08:00
Xiaoyu Zhang
bf86c5e990 restruct compressed_tensors_w8a8_fp8 (#5475) 2025-04-19 04:52:15 -07:00
woodx
3bface15e6 Feat/support encoder model (like bert) (#4887) 2025-04-17 01:50:48 -07:00
eigen
8f783c1943 [Model Support] unsloth/Phi-4-mini bnb model (#4982)
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-04-16 19:58:20 -07:00
saienduri
7f875f1293 update grok test (#5171) 2025-04-09 11:09:47 -07:00
fzyzcjy
39efad4fbc Tiny disable model that does not work (#5175) 2025-04-08 18:42:37 -07:00
Baizhou Zhang
42873eac09 [Fix] Improve Lora tests and reduce CI runtime (#4925) 2025-03-30 19:40:14 -07:00
Lianmin Zheng
4ede6770cd Fix retract for page size > 1 (#4914) 2025-03-30 02:57:15 -07:00