DiweiSun
|
029e0af31d
|
ci: enhance xeon ci (#9395)
|
2025-08-21 03:35:17 -07:00 |
|
Lifu Huang
|
b0980af89f
|
Support pinning adapter via server args. (#9249)
|
2025-08-20 16:25:01 -07:00 |
|
kousakawang
|
0fc54b971e
|
[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance (#9272)
Co-authored-by: wanghanpei <wanghanpei@bytedance.com>
|
2025-08-17 13:09:49 -07:00 |
|
Netanel Haber
|
845d12a979
|
model: support nvidia/Llama-3_3-Nemotron-Super-49B-v1 (#9067)
Co-authored-by: Kyle Huang <kylhuang@nvidia.com>
|
2025-08-17 01:48:15 -07:00 |
|
Hank Han
|
81da16f6d3
|
[CI] add deepseek w4a8 test on h20 ci (#7758)
|
2025-08-16 01:54:13 -07:00 |
|
Cheng Wan
|
295895120d
|
[6/N] MoE Refactor: Cleanup MoE-related configs (#8849)
|
2025-08-14 21:14:53 -07:00 |
|
Hongbo Xu
|
a669bc2f74
|
Replace sglang.srt.layers.quantization.scalar_types with sgl_kernel.scalar_type (#8951)
|
2025-08-13 19:41:41 -07:00 |
|
Faraz
|
f508cd3cb7
|
TRTLLM-MLA FP8 path (#8638)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
|
2025-08-11 14:02:13 -07:00 |
|
Lianmin Zheng
|
2449a0afe2
|
Refactor the docs (#9031)
|
2025-08-10 19:49:45 -07:00 |
|
Lianmin Zheng
|
b58ae7a2a0
|
Simplify frontend language (#9029)
|
2025-08-10 10:59:30 -07:00 |
|
fzyzcjy
|
442534aa44
|
Add CI for gpt-oss model on hopper (#8851)
|
2025-08-09 00:34:23 -07:00 |
|
Trevor Morris
|
a60f88b5a4
|
Add unit test for flashinfer fp4 moe (#8330)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-08-08 17:55:37 -07:00 |
|
Minglei Zhu
|
6ee6619b7a
|
add zai-org/GLM-4.5-Air-FP8 model into nightly CI (#8894)
|
2025-08-08 01:44:19 -07:00 |
|
Lifu Huang
|
6210e2c4f0
|
Support GPU pinning for LoRA (#8697)
|
2025-08-06 19:39:45 -07:00 |
|
Lifu Huang
|
8675bdf246
|
Support limiting max loaded loras in CPU. (#8650)
|
2025-08-03 00:02:23 -07:00 |
|
Lianmin Zheng
|
e314b084c5
|
[FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (#8693)
|
2025-08-02 18:43:14 -07:00 |
|
Cheng Wan
|
6c88f6c8d9
|
[5/N] MoE Refactor: Update MoE parallelism arguments (#8658)
|
2025-08-01 01:20:03 -07:00 |
|
Faraz
|
4b04998d38
|
TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
|
2025-07-31 16:03:40 -07:00 |
|
harrisonlimh
|
747dd45077
|
feat: throttle requests at scheduler based on --max_queued_requests (#7565)
|
2025-07-28 22:32:33 +08:00 |
|
Qiaolin Yu
|
2810338401
|
[feat] Support different attention backends for prefill and decode (#6338)
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-07-28 11:42:29 +08:00 |
|
fzyzcjy
|
62222bd27e
|
Minor tool for comparison of benchmark results (#7974)
|
2025-07-27 00:27:50 -07:00 |
|
Lifu Huang
|
5c705b1dce
|
Add perf tests for LoRA (#8314)
|
2025-07-26 14:55:22 -07:00 |
|
Hubert Lu
|
af4b9bae95
|
[AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (#7135)
Co-authored-by: yiakwy-xpu-ml-framework-team <961186938@qq.com>
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-07-24 23:44:28 -07:00 |
|
Hubert Lu
|
e50109f2ed
|
[AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (#7484)
|
2025-07-21 17:33:19 -07:00 |
|
Pavel Logachev
|
877e35d775
|
Add get_hidden_dim to qwen3.py for correct lora (#7312)
|
2025-07-19 19:31:16 -07:00 |
|
Lifu Huang
|
4e3defe5a7
|
Support start up LoRA server without initial adapters (#8019)
|
2025-07-19 15:38:09 -07:00 |
|
Lianmin Zheng
|
bb0e8a32b5
|
Clean up server args (#8161)
|
2025-07-19 11:32:52 -07:00 |
|
Cheng Wan
|
15ad6c9086
|
[1/N] MoE Refactor: refactor select_experts (#7966)
|
2025-07-19 00:51:15 -07:00 |
|
Hongbo Xu
|
1f76fc8747
|
[3/n] chore: decouple AWQ implementation from vLLM dependency (#8113)
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
|
2025-07-18 11:45:22 -07:00 |
|
Hank Han
|
2117f82def
|
[ci] CI supports use cached models (#7874)
|
2025-07-14 11:42:21 +00:00 |
|
Lifu Huang
|
e2ed9d049a
|
Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (#7844)
|
2025-07-13 18:36:01 -07:00 |
|
SijiaYang
|
cb9d91ea8a
|
feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (#7762)
Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com>
|
2025-07-07 14:47:21 -07:00 |
|
YanbingJiang
|
4de0395343
|
Add V2-lite model test (#7390)
Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>
|
2025-07-03 22:25:50 -07:00 |
|
Lifu Huang
|
49538d111b
|
Support dynamic LoRA loading / unloading in engine/server API (#7446)
|
2025-06-27 21:00:27 -07:00 |
|
Lifu Huang
|
2373faa317
|
Fix flakiness in LoRA batch test. (#7552)
|
2025-06-27 19:51:43 -07:00 |
|
xutizhou
|
506c4928f5
|
feat: integrate deepgemm into EPMoE (#6821)
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-06-23 01:38:58 -07:00 |
|
Stefan He
|
3774f07825
|
Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (#7099)
|
2025-06-19 00:56:37 -07:00 |
|
JieXin Liang
|
5ca07eed90
|
[fix] fix DeepGEMM blackwell input quant & ut & fix style and log (#7247)
|
2025-06-16 11:45:54 -07:00 |
|
woodx
|
e30ef368ab
|
Feat/support rerank (#6058)
|
2025-06-16 10:50:01 -07:00 |
|
Lianmin Zheng
|
f47a1b1d0f
|
Increase timeout in test/srt/test_disaggregation.py (#7175)
|
2025-06-13 23:12:14 -07:00 |
|
Quanfeng Li
|
ef32677444
|
Fix positional argument (#7093)
|
2025-06-11 18:31:13 -07:00 |
|
Baizhou Zhang
|
25a6a9aa22
|
Fix circular import in test_prefix_chunk_info.py (#7097)
|
2025-06-11 10:57:45 -07:00 |
|
Yineng Zhang
|
56ccd3c22c
|
chore: upgrade flashinfer v0.2.6.post1 jit (#6958)
Co-authored-by: alcanderian <alcanderian@gmail.com>
Co-authored-by: Qiaolin Yu <qy254@cornell.edu>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
|
2025-06-09 09:22:39 -07:00 |
|
Sai Enduri
|
77e928d00e
|
Update server timeout time in AMD CI. (#6953)
|
2025-06-07 15:10:27 -07:00 |
|
Zaili Wang
|
562f279a2d
|
[CPU] enable CI for PRs, add Dockerfile and auto build task (#6458)
Co-authored-by: diwei sun <diwei.sun@intel.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-06-05 13:43:54 -07:00 |
|
Pavani Majety
|
0df6765c83
|
[CUTLASS-FP4-MOE] Introduce CutlassMoEParams class for easy initialization of Cutlass Grouped Gems Metadata (#6887)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-06-05 13:13:14 -07:00 |
|
Marc Sun
|
37f1547587
|
[FEAT] Add transformers backend support (#5929)
|
2025-06-03 21:05:29 -07:00 |
|
Pavani Majety
|
eb38c7d1ca
|
[1/2] Add Kernel support for Cutlass based Fused FP4 MoE (#6093)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-06-02 13:48:03 -07:00 |
|
Lianmin Zheng
|
20fd53b8f6
|
Correctly abort the failed grammar requests & Improve the handling of abort (#6803)
|
2025-06-01 19:00:07 -07:00 |
|
Lianmin Zheng
|
2d72fc47cf
|
Improve profiler and integrate profiler in bench_one_batch_server (#6787)
|
2025-05-31 15:53:55 -07:00 |
|