Commit Graph

  • 30c6e1f569 Qwen3-Next support (#10233) Yi Zhang 2025-09-11 19:11:49 +08:00
  • bfe01a5eef chore: upgrade v0.3.9.post2 sgl-kernel (#10297) Yineng Zhang 2025-09-11 04:10:29 -07:00
  • 3dd6420a4d [CI] add pyproject.toml to deepseek w4a8 ci (#10314) Hank Han 2025-09-11 17:10:50 +08:00
  • 532f998b0f chore: bump sgl-kernel 0.3.9.post2 (#10311) Yineng Zhang 2025-09-11 01:29:50 -07:00
  • de15d1405a Revert "Fix flashinfer version in sgl-kernel (#10135)" (#10310) Yineng Zhang 2025-09-11 01:27:58 -07:00
  • 37367da639 [fix CI] Fix logical condition in fused MoE layer for compressed tensor quantization (#10299) Xiaoyu Zhang 2025-09-11 14:54:09 +08:00
  • ef959d7b85 [CPU] fix OOM when mem-fraction is not set (#9090) Zaili Wang 2025-09-11 14:52:22 +08:00
  • 4aa1e69bc7 [chore]Add sgl-router to npu images (#10229) BourneSun0527 2025-09-11 14:51:16 +08:00
  • dc491b399d add flash linear attention triton kernel (#10239) Yi Zhang 2025-09-11 12:47:20 +08:00
  • 5b64f006ec [Feature] Support DeepEP normal & Redundant Experts on NPU (#9881) Even Zhou 2025-09-11 11:35:26 +08:00
  • 5b7448de77 chore: bump sgl-kernel 0.3.9.post1 (#10294) Yineng Zhang 2025-09-10 18:26:34 -07:00
  • 6d55f60e77 Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" (#10292) Yineng Zhang 2025-09-10 18:24:23 -07:00
  • 033b75f559 [Auto Sync] Update serving_base.py, serving_chat.py, servin... (20250910) (#10282) Lianmin Zheng 2025-09-10 16:58:59 -07:00
  • f3b5db6ee8 Feat: support disable tool parser (#10184) Xinyuan Tong 2025-09-10 14:03:55 -07:00
  • 2286e85e77 pass a_scale from fp8 quant result instead of hard code to 1.0f (#10241) Rain Jiang 2025-09-10 12:56:05 -07:00
  • 91b3555d2d Add tests to AMD CI for MI35x (#9662) Hubert Lu 2025-09-10 12:50:05 -07:00
  • 9e2f7252db add dual stream for qwen2_moe (#10252) Yi Zhang 2025-09-11 03:49:43 +08:00
  • 21176b0093 [Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint (#9940) Pavani Majety 2025-09-10 12:00:23 -07:00
  • 941002945b [1/2] Refactor LoRA to support backend-specific batch preprocessing. (#10251) Lifu Huang 2025-09-10 09:58:37 -07:00
  • cda7e47ce7 [router] Add PD router mmlu test (#10256) Keyang Ru 2025-09-10 08:47:24 -07:00
  • e903f695c8 Fix potential flakiness in test_lora_qwen3 (#10250) Lifu Huang 2025-09-10 01:04:39 -07:00
  • 27760fc1b6 [Auto Sync] Update io_struct.py (20250910) (#10262) Lianmin Zheng 2025-09-10 00:16:37 -07:00
  • 0ac809de33 Fix assertion typo in tp_worker.py (#9954) Seunggeun Cho 2025-09-10 14:43:50 +09:00
  • 4efe2c57c9 support vlm model spec bench (#10173) Lzhang-hub 2025-09-10 13:37:04 +08:00
  • 5be8c2f7f7 Page first direct IO kernel (#10060) huangtingwei 2025-09-10 13:35:34 +08:00
  • 737d73ed5b Fix: the default choice is wrong for flashinfer mxfp4 moe precision (#10253) Yiyu Liu 2025-09-10 00:10:38 -04:00
  • ebd0e1c18b [doc] add walkthrough for implementing and hosting a simple llama wrapper m… (#10093) Glen Liu 2025-09-10 00:05:06 -04:00
  • a1d038924b [Benchmark] Prefil-only benchmark scripts (#10240) Sundara Raman Ramachandran 2025-09-09 19:59:07 -07:00
  • dccf52f9c8 [UT for RL] Add UT to cover release/resume memory case for moe model (#8803) ryang 2025-09-10 10:25:12 +08:00
  • 676a7b51bd make --speculative-draft-model an alias of --speculative-draft-model-path (#10246) Lianmin Zheng 2025-09-09 19:12:24 -07:00
  • 15f993472c refactor(InternVL): Use gpu to preprocess the input image (#9795) Kevin Tuan 2025-09-10 10:09:04 +08:00
  • bcf1955f7e Revert "chore: upgrade v0.3.9 sgl-kernel" (#10245) Lianmin Zheng 2025-09-09 19:05:20 -07:00
  • a06bf66425 [Auto Sync] Update collector.py, startup_func_log_and_timer... (20250910) (#10242) Lianmin Zheng 2025-09-09 18:05:16 -07:00
  • bf72b80122 [Auto Sync] Update io_struct.py (20250909) (#10236) Lianmin Zheng 2025-09-09 14:15:21 -07:00
  • 8cbe1538ef Add mamba kernel (#10234) Yi Zhang 2025-09-10 03:58:43 +08:00
  • 8471e5e616 [HiCache] feat: add mooncake backend extra config (#10213) Teng Ma 2025-09-10 03:50:00 +08:00
  • 4582931ac3 Revert "Revert the changes on NCCL symmetric memory" (#10238) Lianmin Zheng 2025-09-09 12:11:49 -07:00
  • d352c29aa0 Revert the changes on NCCL symmetric memory (#10210) Lianmin Zheng 2025-09-09 11:01:33 -07:00
  • d3ee70985f chore: upgrade v0.3.9 sgl-kernel (#10220) Yineng Zhang 2025-09-09 03:16:25 -07:00
  • 71fc7b7fad [Fix] KV-cache eviction mismatch across PP ranks in DeepSeek V3/R1 (#10214) Rain H 2025-09-09 17:07:13 +08:00
  • 9ab72f9895 add variable TP Decode > Prefill size support (#9960) shaharmor98 2025-09-09 11:47:26 +03:00
  • f3817cb0b2 chore: bump v0.3.9 sgl-kernel (#10208) Yineng Zhang 2025-09-09 01:40:05 -07:00
  • 71133a0426 [Auto Sync] Update sampling_batch_info.py (20250909) (#10212) Lianmin Zheng 2025-09-09 01:29:52 -07:00
  • 2cd94dd07e tool-call(dsv3): Fixed a parse problem when there are multiple function definitions in tool_calls (#10209) Yiming 2025-09-09 15:47:28 +08:00
  • f5f6b3b4b5 Refactor fused_add_rmsnorm import logic (#10207) Shangming Cai 2025-09-09 15:23:58 +08:00
  • 94fb4e9e54 feat: support fa cute in sgl-kernel (#10205) Yineng Zhang 2025-09-09 00:14:39 -07:00
  • d1d4074c4e [CPU] Add gelu_and_mul kernel in sgl-kernel and add ut (#9300) blzheng 2025-09-09 14:23:13 +08:00
  • 718f25ae6e Explicitly export CMAKE_BUILD_PARALLEL_LEVEL (#10193) Keyang Ru 2025-09-08 22:35:27 -07:00
  • 948b01a04c [Refactor] Remove Hicache Load & Write threads (#10127) DarkSharpness 2025-09-08 22:18:50 -07:00
  • cdc56ef6c1 feat: use sgl-kernel cu129 as default (#10188) Yineng Zhang 2025-09-08 22:01:17 -07:00
  • 16ff3d4b05 Support opt model (#10165) wenhuipeng 2025-09-09 12:45:00 +08:00
  • 83d55ac51f [1/N]DP refactor: Improve dp rank scheduling in PD disaggregation mode. (#10169) Liangsheng Yin 2025-09-09 12:27:55 +08:00
  • 2fe17735a6 Updated Nvidia Jetson docs (#4422) Shakhizat Nurgaliyev 2025-09-09 08:41:21 +05:00
  • 97fff98c68 [CPU] Fix phi4-mm prompt issue in bench_serving (#9900) blzheng 2025-09-09 11:12:32 +08:00
  • ba066ca02f Update link for EAGLE speculative decoding (#10191) geray 2025-09-09 11:09:50 +08:00
  • 96784a65fd [Fix] Orphan process in data parallel (#7995) Caproni 2025-09-09 11:09:09 +08:00
  • df5407fb53 Revert "feat: add fused moe config for Qwen3-30B-A3B on B200" (#10185) Rain Jiang 2025-09-08 18:11:15 -07:00
  • 8ad700f735 Cleaning codes for speculative attention mode (#10149) Baizhou Zhang 2025-09-08 17:38:06 -07:00
  • 148022fc36 gb200: update dockerfile to latest kernel (#9522) ishandhanani 2025-09-08 17:32:36 -07:00
  • 7a40e4f4a6 fix the cutlass moe tests (#10182) Rain Jiang 2025-09-08 16:24:55 -07:00
  • 19d64f2b72 fix: resolve lint issue (#10181) Yineng Zhang 2025-09-08 15:09:55 -07:00
  • a02071a12c [Bench] feat: mooncake trace integration (#9839) Teng Ma 2025-09-09 02:50:54 +08:00
  • 45b3a6a256 Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" (#10176) Yineng Zhang 2025-09-08 11:28:15 -07:00
  • 9a18aa54c2 [fix] Relax white space rules in EBNFComposer (#9595) LukasBluebaum 2025-09-08 19:47:19 +02:00
  • 91f0fd95a4 pref: Add H20 fp8 fused MoE kernel configs for Qwen3 (#10166) Zhiy-Zhang 2025-09-09 00:57:21 +08:00
  • 8085aca791 [Bug fix] Fix ascend mla in aclgraph (#9925) alanhe151220037 2025-09-09 00:49:43 +08:00
  • 0096798ed6 [1/2] Speed up prefill mla attention (#10156) fzyzcjy 2025-09-09 00:00:33 +08:00
  • 2c2b19b18b [CI] fix ambiguous argument in testing hybrid attentions. (#10161) Liangsheng Yin 2025-09-08 18:16:52 +08:00
  • 72f9fc5f11 Monkey patch uvicorn multi worker is_alive timeout (#10159) Liangsheng Yin 2025-09-08 17:43:23 +08:00
  • ec99668ab7 [Hicache]: Add E2E CI For 3FS-KVStore (#10131) hzh0425 2025-09-08 16:54:50 +08:00
  • 78f139812a [1/N] DP-Refactor: move communicators into tokenizer_communicator_mixin (#10028) Liangsheng Yin 2025-09-08 16:27:37 +08:00
  • bfd7a18d8d update xgrammar 0.1.24 and transformers 4.56.1 (#10155) Swipe4057 2025-09-08 11:20:31 +03:00
  • 5dd8c6444b [Bug fix] Fix Gemma 2 and fix Gemma 3 multimodal with bs > 1 on NPU (#9871) ssshinigami 2025-09-08 11:19:40 +03:00
  • ee21817c6b enable llama3.1-8B on xpu (#9434) Huaiyu, Zheng 2025-09-08 13:34:20 +08:00
  • b7d1f17b8d Revert "enable auto-round quantization model (#6226)" (#10148) Yineng Zhang 2025-09-07 22:31:11 -07:00
  • c8295d2353 enable auto-round quantization model (#6226) Weiwei 2025-09-08 13:05:35 +08:00
  • b67c277f86 [Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph (#10013) Even Zhou 2025-09-08 12:50:49 +08:00
  • 8116804e4f Fix: (glm4v) Add missing field (#10147) Xinyuan Tong 2025-09-08 04:47:14 +00:00
  • 8c5930f08a Add speculator attention backend switch (#9981) cicirori 2025-09-08 06:44:36 +02:00
  • 3b99f23c44 [Bugfix] Retract not releasing enough memory when page size > 1 (#9989) Zhiqiang Xie 2025-09-07 21:41:50 -07:00
  • ee0b3c5bad [1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel, fixed) (#10108) Yuhao Yao 2025-09-08 12:39:07 +08:00
  • 6049ca209e move compile threads to an option to avoid OOM on low memory host (#10123) Rain Jiang 2025-09-07 21:36:14 -07:00
  • 7577f0e40f Add graph runner support with torch compile on CPU (#7843) Cao E 2025-09-08 12:33:58 +08:00
  • 8cda5a622c Standalone speculative decoding (#10090) Qiaolin Yu 2025-09-07 20:55:09 -07:00
  • 400d3b97ae Fix run time error in dsv3-fp8 model on mi35x (#10104) kk 2025-09-08 11:45:17 +08:00
  • 37d83c6e6d Qwen2.5-VL eagle3 infer (#8801) Lzhang-hub 2025-09-08 11:44:34 +08:00
  • 7802586cab fix the fp8 topk_config.correction_bias is none bug (#10040) Rain Jiang 2025-09-07 20:28:14 -07:00
  • bc5fc332f7 Fix slow fused add RMSNorm (#10141) fzyzcjy 2025-09-08 11:20:39 +08:00
  • c06fd64d75 Update README.md v0.5.2rc1 maxiao1 2025-09-08 03:02:07 +00:00
  • f3440adcb5 vlm: enable GLM4.1V server testing & fix video processing (#10095) Xinyuan Tong 2025-09-08 02:53:08 +00:00
  • 5a7e10fe4c [MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 (#10144) Cheng Wan 2025-09-07 19:43:59 -07:00
  • 33467c05a4 [BUG FIX] add fail check when get fail in case wait complete block (#9971) Shisong Ma 2025-09-08 09:34:04 +08:00
  • b0fcbb74d0 [DOC]: some minor updates (#10134) eigen 2025-09-07 17:58:15 -04:00
  • 76a2c86b88 Fix flashinfer version in sgl-kernel (#10135) Lianmin Zheng 2025-09-07 12:54:07 -07:00
  • e719bb0e84 [1/2] Refactor multi-tokenizer manager (#10074) Liangsheng Yin 2025-09-07 19:13:34 +08:00
  • 067246830d [Minor] fix lint in main (#10128) DarkSharpness 2025-09-07 02:36:46 -07:00
  • 617aa2b248 [Auto Sync] Update parallel_state.py (20250907) (#10126) Lianmin Zheng 2025-09-07 02:12:32 -07:00
  • 111b137964 add dataset_path for bench_one_batch_server.py (#10113) miter 2025-09-07 14:07:09 +08:00
  • 41628dc1b1 [HiCache] fix: check clear() method for storage backend (#10096) Teng Ma 2025-09-07 13:59:58 +08:00
  • a12061df4c Fix cuda graph mode in flashinfer attn backend (#10056) Ben Barsdell 2025-09-07 15:59:48 +10:00