Chengxing Xie
|
c1c7dc4534
|
feat: Add model version tracking with API endpoints and response metadata (#8795)
|
2025-08-14 12:13:46 -07:00 |
|
Hongbo Xu
|
2cc9eeab01
|
[4/n]decouple quantization implementation from vLLM dependency (#9191)
Co-authored-by: AniZpZ <aniz1905@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-08-14 12:05:46 -07:00 |
|
Xiaoyu Zhang
|
63d82a776a
|
refine mxfp4 shuffling log (#9194)
|
2025-08-14 10:57:29 -07:00 |
|
Peng Zhang
|
5aa1ebd242
|
[2/n]decouple quantization implementation from vLLM dependency (#8112)
Co-authored-by: walker-ai <yiyun.wyt@antgroup.com>
Co-authored-by: leoneo <1320612015@qq.com>
|
2025-08-14 03:19:03 -07:00 |
|
eigen
|
4dbf43601d
|
fix: zero_init buffer (#9065)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-08-14 02:39:09 -07:00 |
|
lukec
|
3d6be1fbce
|
add w8a8-fp8-block-wise H20-3e triton config (#8018)
|
2025-08-13 23:15:09 -07:00 |
|
Jun Liu
|
4063234c1a
|
Add H200 fused MoE kernel configs for DeepSeek-V3 in triton 3.3.1 (#7687)
|
2025-08-13 23:14:09 -07:00 |
|
Tommy Yang
|
83feef5b2c
|
Add H20 fused MoE kernel configs for Dpsk & Qwen3 (#7631)
|
2025-08-13 23:13:22 -07:00 |
|
Brayden Zhong
|
2871eacc05
|
Add Triton Fused MoE kernel config for E=16 on B200 (#7004)
|
2025-08-13 23:12:27 -07:00 |
|
forestlee95
|
ac15bdc194
|
Add H200 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (#8852)
|
2025-08-13 23:11:11 -07:00 |
|
Li Hui
|
d6451c3f65
|
Add A800 fused MoE kernel tuning configs for GLM4.5 and GLM4.5-Air (#8808)
|
2025-08-13 23:03:17 -07:00 |
|
pansicheng
|
733446dd36
|
fix io group (#9154)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-14 12:46:42 +08:00 |
|
wxzhoucs
|
4c22897a66
|
Feature: support qwen and llama4 reducescatter for dp attention padding (#9101)
|
2025-08-13 21:10:29 -07:00 |
|
Alex Yang
|
1bc183c6de
|
Faster weight processing (trtllm-gen moe nvfp4) (#9162)
|
2025-08-13 21:09:34 -07:00 |
|
Cheng Wan
|
b87aacb5c5
|
[DP Attention] Refactor: adding some utility functions (#9136)
|
2025-08-13 21:08:06 -07:00 |
|
Huaixin Chang
|
98457c0453
|
[Bugfix] Avoid unnecessary reduce-scatter call in prepare_mlp (#9169)
|
2025-08-13 21:04:41 -07:00 |
|
Hongbo Xu
|
a669bc2f74
|
Replace sglang.srt.layers.quantization.scalar_types with sgl_kernel.scalar_type (#8951)
|
2025-08-13 19:41:41 -07:00 |
|
Nicolas Castet
|
6b7c24712c
|
Fix broken trtllm_mha attn backend with gpt-oss (#9161)
|
2025-08-13 16:40:55 -07:00 |
|
Sundara Raman Ramachandran
|
a027a9b4b3
|
[Generative Score API] Optimization to Remove Decode. (#8840)
|
2025-08-14 05:12:24 +08:00 |
|
Lianmin Zheng
|
9e426466af
|
Clean up allocators (#9134)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-13 13:56:04 -07:00 |
|
Zhihao Liu
|
65736dc524
|
[Model] Support Qwen3ForSequenceClassification for Qwen3-Embed Model (#7957)
|
2025-08-13 11:14:54 -07:00 |
|
Yineng Zhang
|
7b56e494be
|
chore: bump v0.5.0rc1 (#9069)
|
2025-08-13 10:44:14 -07:00 |
|
Ke Bao
|
0ff6d1fce1
|
Support FA3 backend for gpt-oss (#9028)
|
2025-08-13 10:41:50 -07:00 |
|
Teng Ma
|
4a16a71c36
|
[PD] feat: mooncake use batch reg/dereg (#8910)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-08-13 09:54:34 -07:00 |
|
Francis
|
a16923efab
|
[PD] optimize kv cache transfer directly using batch transfer (#9149)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-08-13 09:54:14 -07:00 |
|
Kevin Xiang Li
|
3b3b3baf9f
|
Double vision prefill throughput by defaulting to optimal vision attention backend (#8484)
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
|
2025-08-13 02:08:30 -07:00 |
|
fzyzcjy
|
9394ed6386
|
Fix gpt-oss ~2x memory consumption issue (#9146)
|
2025-08-13 00:11:43 -07:00 |
|
Stefan He
|
930fe467bd
|
Support Triton FP8 Gemm can handle hidden_dim not divisible by 16 (#9093)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2025-08-12 21:21:55 -07:00 |
|
Elfie Guo
|
8723b4f146
|
Use FlashInfer's TRTLLM FP8 Blockscale GEMM (#8588)
|
2025-08-12 20:08:40 -07:00 |
|
huangtingwei
|
0edda32001
|
Support page first layout zero copy for mooncake store (#8651)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-12 15:59:26 -07:00 |
|
jacky.cheng
|
25caa7a8a9
|
[AMD] Support Wave attention backend with AMD GPU optimizations (#8660)
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Harsh Menon <harsh@nod-labs.com>
Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Co-authored-by: Harsh Menon <harsh@nod-labs.com>
Co-authored-by: Stanley Winata <stanley.winata@amd.com>
Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com>
Co-authored-by: Stanley Winata <stanley@nod-labs.com>
Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com>
Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com>
Co-authored-by: Ivan Butygin <ibutygin@amd.com>
|
2025-08-12 13:49:11 -07:00 |
|
ichernob
|
83123f481e
|
[Quantization] Supported w8a8 int8 quantized Gemma3 and Qwen-VL models (#8619)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
|
2025-08-12 13:31:18 -07:00 |
|
ronnie_zheng
|
48afa8f14f
|
[feat] Enable Ascend profiling on SGLang (#8610)
Co-authored-by: liyou_b <2953090824@qq.com>
|
2025-08-12 13:28:31 -07:00 |
|
Jiaqi Gu
|
c9ee738515
|
Fuse writing KV buffer into rope kernel (part 2: srt) (#9014)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-08-12 13:15:30 -07:00 |
|
Cheng Wan
|
5f5b3b2449
|
[5/n] DP Enhancement: Correct num_token_non_padded (#9107)
|
2025-08-12 12:23:46 -07:00 |
|
Chang Su
|
f2a5de284b
|
[Bugfix] Fix accuracy-test-1-gpu failure caused by builtin_tools (#9114)
|
2025-08-12 09:56:13 -07:00 |
|
fzyzcjy
|
9aea255522
|
Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077)
|
2025-08-12 01:46:40 -07:00 |
|
fzyzcjy
|
5190ba7f42
|
Fuse two kernels of hidden states padding into quantization kernel (#9005)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
2025-08-12 01:20:13 -07:00 |
|
Chang Su
|
9c83d74da3
|
bugfix: Fix the commentary msg extraction in GptOssDetector (#9097)
|
2025-08-11 23:53:10 -07:00 |
|
DarkSharpness
|
b4ac2b9c0c
|
[Fix] Fix dual chunk model default behavior (#9032)
|
2025-08-11 23:50:23 -07:00 |
|
Jianwei Dong
|
83262dcb29
|
Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization (#8766)
|
2025-08-11 23:44:40 -07:00 |
|
zixuanzhang226
|
c46c75f8c0
|
feat: add fused moe config for Qwen3-30B-A3B on B200 (#9087)
|
2025-08-11 23:25:36 -07:00 |
|
Makcum888e
|
2aaf22c46c
|
Optimization for AscendPagedTokenToKVPoolAllocator (#8293)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: VDV1985 <vladdv85@mail.ru>
|
2025-08-11 23:06:39 -07:00 |
|
Lifu Huang
|
5ded39cab2
|
Fix race condition in async lora unload (#9084)
|
2025-08-11 22:59:29 -07:00 |
|
Chang Su
|
a218490136
|
(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support (#9043)
|
2025-08-11 18:59:18 -07:00 |
|
Zhiqiang Xie
|
0eec4cb6cc
|
HiCache, add bench long context plus minor fixs (#9086)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-11 16:54:52 -07:00 |
|
Faradawn Yang
|
ff1f68252c
|
[fix] Set Radix tree root node hash to None - Nvidia Dynamo Integration (#9030)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-11 14:20:39 -07:00 |
|
Zhiqiang Xie
|
9f78f391ae
|
HiCache Storage: generate hash when inserting new nodes (#9053)
|
2025-08-11 14:18:59 -07:00 |
|
Faraz
|
f508cd3cb7
|
TRTLLM-MLA FP8 path (#8638)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
|
2025-08-11 14:02:13 -07:00 |
|
Xiaoyu Zhang
|
44e86480e8
|
fuse allreduce and residual_rmsnorm (#8731)
|
2025-08-11 13:50:53 -07:00 |
|