Commit Graph

  • c9bcffd2a5 增加dcu_alloc_decode_kernel实现 v0.5.4_dev_liucong liucong 2025-11-04 20:27:27 +08:00
  • 46da95569f Merge branch 'v0.5.4_dev_linhai' into 'v0.5.4_dev' v0.5.4_dev maxiao1 2025-11-04 12:04:48 +00:00
  • ee77577211 V0.5.4 dev linhai linhai1 2025-11-04 12:04:47 +00:00
  • a9e0e668c4 Merge branch 'v0.5.4_dev_maxiao' into 'v0.5.4_dev' maxiao1 2025-11-04 09:33:25 +00:00
  • 0c5532b0c1 enable custom_allreduce maxiao1 2025-11-04 09:33:24 +00:00
  • fa3882d218 delete print info v0.5.4_dev_maxiao maxiao 2025-11-04 16:30:38 +08:00
  • 6e2b63d4c4 使用vllm custom allreduce maxiao 2025-11-04 15:28:54 +08:00
  • 785e5e900b Merge branch 'v0.5.4_dev_maxiao' into 'v0.5.4_dev' maxiao1 2025-11-03 08:30:29 +00:00
  • d2fdeac22f 调用vllm里custom all reduce maxiao 2025-11-03 16:28:21 +08:00
  • 47e4d92348 Merge branch 'v0.5.4_dev_qwennext' into 'v0.5.4_dev' maxiao1 2025-11-03 05:18:16 +00:00
  • 0fbecc4364 Merge branch 'v0.5.4_dev_maxiao' into 'v0.5.4_dev' maxiao1 2025-11-03 03:33:05 +00:00
  • 75cd34d172 change sgl_kernel WARP_SIZE to 64 maxiao1 2025-11-03 10:17:53 +08:00
  • 477fddf28d 适配qwen3-next maxiao 2025-10-30 18:03:07 +08:00
  • 8fc552638f Merge branch 'v0.5.4_dev_maxiao' into 'v0.5.4_dev' maxiao1 2025-10-29 02:09:59 +00:00
  • eb4ba1c295 update UNBALANCED_MODEL_LOADING_TIMEOUT_S=3600 maxiao1 2025-10-29 10:06:23 +08:00
  • 4b9b337b39 适配w8a8模型 maxiao1 2025-10-29 09:06:22 +08:00
  • f6528b74be 增加hipprof支持、修复异步调度中的同步问题 lizhigong 2025-10-28 16:25:06 +08:00
  • a5718531b7 关闭custom_allreduce保持正确性 maxiao1 2025-10-28 10:57:25 +08:00
  • c333f12547 补充 bench_serving.py里tpot等指标 guobj 2025-10-28 02:11:36 +00:00
  • f9a026ad2b fix fused_add_rms_norm bug maxiao 2025-10-27 10:27:57 +08:00
  • b80ae5e9ff adaptation w4a8 tp maxiao1 2025-10-25 16:33:07 +08:00
  • b091a7a5c9 adapt w4a8 marlin deepep dp ep lizhigong 2025-10-22 15:16:12 +08:00
  • 143ec5f36c adaptation w4A8 quantization lizhigong 2025-10-21 16:27:31 +08:00
  • 67510e0172 adaptation part w4A8 quantization lizhigong 2025-10-21 14:06:43 +08:00
  • b4dff7f5ef adaptation w4A8 quantization v0.5.4 lizhigong 2025-10-21 16:27:31 +08:00
  • c0352f4aab adapt w4a8 marlin deepep dp ep lizhigong 2025-10-22 15:16:12 +08:00
  • 32b1ccaf62 修改sgl-kernel下的setup_hip.py maxiao1 2025-10-25 13:11:02 +08:00
  • 251235c229 适配v0.5.4 maxiao 2025-10-25 12:16:25 +08:00
  • 1053e1be17 chore: bump SGLang version to 0.5.4 (#12027) sglang-bot 2025-10-24 09:01:40 +08:00
  • 9a71500cfb Fixed aarch64 flash-mla (#12009) nvjullin 2025-10-24 08:47:04 +08:00
  • 6d6e24bcc4 [router] Add builder pattern for RouterConfig with zero duplication (#12030) Simo Lin 2025-10-23 16:46:10 -07:00
  • 2c057fbfa8 Update Github action title for kernel build (#12029) Kangyan-Zhou 2025-10-23 13:39:40 -07:00
  • dbd9435dc1 Fix mamba radix cache eviction logic in alloc_req_slots (#11616) Roger Young 2025-10-24 04:07:43 +08:00
  • 8ae9d4bb41 Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" (#12028) b8zhong 2025-10-23 12:42:59 -07:00
  • 1c304aa9bc Log iteration # for prefill and decode (#9366) Nicolas Castet 2025-10-23 14:28:03 -05:00
  • 770529a731 model: support deepseek-ocr (#11891) Mick 2025-10-24 03:15:17 +08:00
  • 39c237f02c Add AWQ quantization support for NPU. (#10158) ErvinXie 2025-10-24 03:08:05 +08:00
  • 28b8a4064d [router][CI] Clean up imports and prints statements in sgl-router/py_test (#12024) Chang Su 2025-10-23 11:56:57 -07:00
  • 8bd26dd4e6 ci: fix night-ci with push retry mechanism (#11765) Mick 2025-10-24 02:31:05 +08:00
  • ab07cd3e5a [Auto Sync] Update test_deterministic_utils.py (20251023) (#12022) Lianmin Zheng 2025-10-23 11:20:45 -07:00
  • a98496834b Feature/nano v2 offline modelopt fp8 and nvfp4 (#12018) Netanel Haber 2025-10-23 21:16:46 +03:00
  • a4b637d87a [router] change ci names and update log level in ci (#12021) Simo Lin 2025-10-23 10:36:19 -07:00
  • 96a5e4dd79 [Feature] Support loading weights from ckpt engine worker (#11755) Teng Ma 2025-10-24 00:23:30 +08:00
  • b0b4f71679 [Fix] memory leak by overlap + retract (#11981) cctry 2025-10-23 07:59:23 -07:00
  • 6c18addb6f Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" (#12015) Liangsheng Yin 2025-10-23 21:27:58 +08:00
  • 32852fe9e9 Move memory runtime checker to mixin class (#12014) Liangsheng Yin 2025-10-23 20:53:26 +08:00
  • 53c2934dce [Router] Consolidate ConnectionMode enum to core module (#11937) Arthur Cheng 2025-10-23 05:15:49 -07:00
  • e321c97113 [router] Add comprehensive E2E tests for Response API (#11988) Keyang Ru 2025-10-23 05:13:51 -07:00
  • d6fee73d1f Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 (#11866) Netanel Haber 2025-10-23 12:29:02 +03:00
  • 36a4cad7b0 Support overlap-spec-v2 with trtllm_mla attention backend (#11821) Qiaolin Yu 2025-10-23 01:55:35 -07:00
  • 65d376b491 aiter update to v0.1.6.post1 (#12004) HAI 2025-10-22 23:53:05 -07:00
  • c23eda8589 Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend (#11985) yinghui 2025-10-22 22:44:45 -07:00
  • 138ff23187 Allow to disable batch decoding. (#11944) Jue WANG 2025-10-22 21:57:12 -07:00
  • 13fb8b5489 [CPU] Optimize FP16 decode_attention_cpu (#10652) blzheng 2025-10-23 12:39:51 +08:00
  • 81fd2b0ee0 fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 (#11965) Zhengyi Lai 2025-10-23 12:20:54 +08:00
  • 007b849b0e [CPU] misc updates (#11906) Zaili Wang 2025-10-23 12:10:05 +08:00
  • 8612811d85 Bump grace blackwell DeepEP version (#11990) fzyzcjy 2025-10-23 12:08:12 +08:00
  • e7aa4664b3 [NVIDIA] Build CUDA 13 (#11299) Johnny 2025-10-22 20:03:12 -07:00
  • 4d4feccbb2 [ROCm] Remove vLLM rope dependency & use AITER impl (#11322) b8zhong 2025-10-22 19:17:34 -07:00
  • 99c92ff24b [AMD] Support a new flag to disable quant on parallelLinear layer if required (#11811) jacky.cheng 2025-10-23 10:16:15 +08:00
  • 6ade6a02d4 [grpc] Support gRPC standard health check (#11955) Chang Su 2025-10-22 16:59:09 -07:00
  • 983ef22cf3 [Doc] Update deterministic inference flag in server_arguments.md (#11978) Baizhou Zhang 2025-10-22 16:12:15 -05:00
  • 164302c7df Implement BGE-M3 Sparse Embeddings in SGLang (#10869) Christian Bahls 2025-10-22 22:46:16 +02:00
  • 5dccf69713 [router] create worker removal step and clean up worker manager (#11921) Simo Lin 2025-10-22 13:26:06 -07:00
  • eec9e471ca [NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (#11563) jiahanc 2025-10-22 13:11:16 -07:00
  • 6d535b719f Revert "Recapture cuda graph after model weight update to resolve IMA error " (#11980) Lianmin Zheng 2025-10-22 11:50:26 -07:00
  • fdcb1d13c5 [BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… (#11977) yuho 2025-10-23 02:29:55 +08:00
  • d7e834d6ba [6/n]decouple quantization implementation from vLLM dependency (#10750) Hongbo Xu 2025-10-23 02:07:55 +08:00
  • 200a3c0bb1 [Documentation] add doc for deterministic inference (#11956) Minglei Zhu 2025-10-22 10:36:15 -07:00
  • 77258ce039 [router] Support multiple worker URLs for OpenAI router (#11723) Keyang Ru 2025-10-22 09:27:58 -07:00
  • 1d097aac87 [Fix] Remove unused import from triton_kernels_moe.py (#11967) Fan Yin 2025-10-22 21:02:57 +08:00
  • 7fceeef599 Fix flaky hicache test with mooncake backend (#11953) Shangming Cai 2025-10-22 21:00:47 +08:00
  • 88568c01eb [model] Support POINTSV15Chat (#9651) 996_icu 2025-10-22 16:58:17 +08:00
  • 904655c5fd [2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank (#10606) Hank Han 2025-10-22 16:13:31 +08:00
  • e028af6998 Fix mooncake dispatcher (#11908) Xun Sun 2025-10-22 16:11:49 +08:00
  • a0fb70e9c1 adapt w4a8 marlin deepep dp ep v0.5.3_dev lizhigong 2025-10-22 15:16:12 +08:00
  • 80b2b3207a Enable native ModelOpt quantization support (3/3) (#10154) Zhiyu 2025-10-21 21:44:29 -07:00
  • 4b65ed42cc [NVIDIA] upstream FA4 and fix cccl path (#11929) Johnny 2025-10-21 21:18:25 -07:00
  • 23afdfd1c2 [sgl-kernel] support flashmla libtorch (#11717) Fan Yin 2025-10-22 12:17:50 +08:00
  • 9d61205dac [lint] improve ruff check (#11922) Liangsheng Yin 2025-10-22 11:32:50 +08:00
  • 590bc4b7a7 [router][grpc] Fix background tasks stored with wrong id (#11945) Chang Su 2025-10-21 18:38:51 -07:00
  • 63cfe1b032 [router] Add gRPC E2E test suite (#11790) Keyang Ru 2025-10-21 17:51:21 -07:00
  • 70f6309cd4 [router][grpc] Support v1/responses API (#11926) Chang Su 2025-10-21 17:41:48 -07:00
  • 704160017d fix: resolve flashinfer 0.4.1 import (#11940) Yineng Zhang 2025-10-21 17:19:57 -07:00
  • 87a92e459a Fix openai input_text type compatibility (#11935) Keyang Ru 2025-10-21 16:10:35 -07:00
  • c461e7714d [Auto Sync] Update forward_batch_info.py (20251021) (#11934) Yineng Zhang 2025-10-21 15:52:15 -07:00
  • fde2decf8b [BugFix][Qwen3-VL]: add metadata for video in qwen3-vl (#11377) Zheng Wengang 2025-10-22 06:36:01 +08:00
  • 9792b9d7e3 chore: upgrade flashinfer 0.4.1 (#11933) Yineng Zhang 2025-10-21 14:46:31 -07:00
  • ef4a8097b8 Rename flashmla kernel options of nsa backend for better readability (#11876) Baizhou Zhang 2025-10-21 15:14:16 -05:00
  • ebff4ee648 Update sgl-kernel and remove fast hadamard depedency (#11844) Baizhou Zhang 2025-10-21 15:13:54 -05:00
  • 2b1da821b5 [NVIDIA] Add new SMs support for Spark & Thor (#11287) Serge Panev 2025-10-21 11:02:24 -07:00
  • 97710ccd1a Fix flush cache API for spec v2 (#11918) Liangsheng Yin 2025-10-21 23:01:16 +08:00
  • f3cd5d2510 [CI] Fix b200 flashinfer installation (#11915) Shangming Cai 2025-10-21 22:28:50 +08:00
  • c61b0b294c [quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n (#11702) Kai-Hsun Chen 2025-10-21 06:25:28 -07:00
  • e8640ee9be [smol] [perf] Inverse perm improvement (#11482) Vincent Zhong 2025-10-21 07:18:10 -04:00
  • d0a64c7e2c vlm: enforce pybase64 for image and str encode/decode (#10700) b8zhong 2025-10-21 04:05:32 -07:00
  • 05d3667ab9 [CI] disable glm4.1v and fix the flashinfer installation (#11902) Shangming Cai 2025-10-21 18:38:35 +08:00
  • 260fe755b6 Simplify multi-tokenizer (#11295) Zhengke Zhou 2025-10-21 16:33:29 +08:00
  • dbb16bedd5 Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] (#11416) ybyang 2025-10-21 16:27:56 +08:00
  • 848c5b8290 adaptation w4A8 quantization lizhigong 2025-10-21 16:27:31 +08:00