Commit Graph

  • eb06dbcbf8 Move rope and bmm into sgl-kernel (#4241) Lianmin Zheng 2025-03-09 18:38:15 -07:00
  • 9dfafa743c Fix test of flashinfer mla with nextn (#4237) Baizhou Zhang 2025-03-09 12:45:39 -07:00
  • f1d09a6541 Update bench speculative script (#4235) Ke Bao 2025-03-10 03:19:01 +08:00
  • df84ab2a5b update sgl-kernel 3rdparty (#4228) Yineng Zhang 2025-03-09 01:16:05 -08:00
  • 34c8898755 Check eagle server args (#4217) Ying Sheng 2025-03-09 01:10:43 -08:00
  • 0dd6cda288 Apply sgl w8a8 fp8 kernel (#3148) HandH1998 2025-03-09 16:03:32 +08:00
  • 9fb48f951f Support nextn for flashinfer mla attention backend (#4218) Baizhou Zhang 2025-03-09 00:01:54 -08:00
  • 89ccb533ad use sgl-kernel 0.0.4 (#4224) Yineng Zhang 2025-03-08 23:43:09 -08:00
  • dceb256f1b [docs] Unhide production metrics page (#4193) Stefan He 2025-03-08 23:41:40 -08:00
  • 0e90ae628a [docker] Distributed Serving with k8s Statefulset ( good example for DeepSeek-R1) (#3631) Peter Pan 2025-03-09 15:41:20 +08:00
  • 1361ab9e03 Lazily import lora backends (#4225) Lianmin Zheng 2025-03-08 23:39:26 -08:00
  • 5c7dd14ba1 chore: bump v0.0.4 for sgl-kernel (#4223) Yineng Zhang 2025-03-08 23:01:59 -08:00
  • 8abf74e3c9 Rename files in sgl kernel to avoid nested folder structure (#4213) Lianmin Zheng 2025-03-08 22:54:51 -08:00
  • ee132a4515 use latest sgl-kernel for mla test (#4222) Yineng Zhang 2025-03-08 22:27:47 -08:00
  • 79a321af55 revert pr 3628 to pass test_mla ci (#4219) Xiaoyu Zhang 2025-03-09 13:15:14 +08:00
  • 6eec3cdce6 docs(reasoning content): 📝 deepseek-r1 parser support qwq (#4124) Xihuai Wang 2025-03-09 12:14:50 +08:00
  • 48473684cc Split test_mla.py into two files (#4216) Lianmin Zheng 2025-03-08 15:40:49 -08:00
  • b3251e9f40 refine quant kernel code style (#4211) Xiaoyu Zhang 2025-03-08 21:47:35 +08:00
  • 2cadd51d11 Test no vllm custom allreduce (#4210) Lianmin Zheng 2025-03-08 05:23:06 -08:00
  • 4a893d142d Refactor Dockerfile: unify CUDA logic and reduce image size by ~2.6 GB (#3749) Kebe 2025-03-08 19:01:13 +08:00
  • 8d323e95e4 Use clang format 18 in pr-test-sgl-kernel.yml (#4203) Lianmin Zheng 2025-03-08 01:28:10 -08:00
  • 0fe7c13be1 Fix bench_serving flush cache not recognizing OPENAI_API_KEY (#4181) Mingshan 2025-03-08 17:03:38 +08:00
  • 08c4d764a5 lazy import attn backends (#4200) Lianmin Zheng 2025-03-08 00:41:35 -08:00
  • 96d0e37fa7 Revert "Minor improvement to per_tensor_quant_fp8 (#4197)" (#4198) Yineng Zhang 2025-03-07 22:57:09 -08:00
  • 90bb2be27e Minor improvement to per_tensor_quant_fp8 (#4197) Rex 2025-03-07 22:52:12 -08:00
  • b93ef5e56d Remove the vllm dependency from the moe_align function (#4164) lukec 2025-03-08 14:42:16 +08:00
  • d4017a6b63 [EAGLE] many fixes for eagle (#4195) Lianmin Zheng 2025-03-07 22:12:13 -08:00
  • d052f4c8a9 New clang format for sgl kernel (#4194) Lianmin Zheng 2025-03-07 20:21:08 -08:00
  • e1aaa79ac9 Update amd ci docker image to v0.4.3.post4-rocm630. (#4189) saienduri 2025-03-07 13:02:02 -08:00
  • 20c8119915 Fix eagle hang issue for max_new_tokens=1 (#4185) Ke Bao 2025-03-08 04:11:18 +08:00
  • 70866b6f4f use same version for ci and pyproject (#4187) Yineng Zhang 2025-03-07 10:39:55 -08:00
  • eb61f5c9af Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" (#4186) Yineng Zhang 2025-03-07 10:27:52 -08:00
  • 0beea4503f ROCm: Flex Attention Enablement with custom backends (#4178) HAI 2025-03-07 04:38:53 -08:00
  • c827c671f7 [Docs] Improve bullets appearance and grammar (#4174) Michael Yao 2025-03-07 19:16:25 +08:00
  • b55a621ffb fix int8 doc link (#4179) Yineng Zhang 2025-03-07 02:49:19 -08:00
  • ffa1b3e318 Add an example of using deepseekv3 int8 sglang. (#4177) lukec 2025-03-07 17:56:09 +08:00
  • 7e3bb52705 update release-pypi-kernel Yineng Zhang 2025-03-07 01:48:47 -08:00
  • 96263f275c chore: bump v0.0.3.post7 for sgl-kernel (#4176) Yineng Zhang 2025-03-07 01:15:34 -08:00
  • 9376ac361d Memory pool fix for upstream change about eagle (#4170) Zhiqiang Xie 2025-03-07 00:58:20 -08:00
  • 94a2b9d33e Put utils in ifndef USE_ROCM to fix CI (#4167) (#4168) Yineng Zhang 2025-03-07 00:01:17 -08:00
  • 3c3eb374b2 Remove non-existent AMD header include (#4166) Stefan He 2025-03-06 23:29:30 -08:00
  • d557319a8b [Docs] Fix links and grammar issues (#4162) Michael Yao 2025-03-07 15:14:18 +08:00
  • 95085d65e9 [Refactor] Reducing code duplication across FP8 CUDA quantization kernels (#4163) Stefan He 2025-03-06 22:58:52 -08:00
  • c7f254468f [Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) (#3888) HandH1998 2025-03-07 12:54:52 +08:00
  • 63ee26d162 Add sgl_per_token_quant_fp8 (#4089) Stefan He 2025-03-06 20:53:05 -08:00
  • ad55f17182 [quant kernel] sgl-kernel support per_tensor_quant fp8 (#3786) Xiaoyu Zhang 2025-03-07 10:05:43 +08:00
  • 361971b859 Add Support for Qwen2-VL Multi-modal Embedding Models (#3694) Pan Lyu 2025-03-07 08:46:20 +08:00
  • 13bc39c5d6 ROCm: enable trillion-parameter MoE models with INT4-FP8 single node (#4152) HAI 2025-03-06 15:33:02 -08:00
  • 9854a18a51 Hot fix small vocal eagle in docs (#4154) Chayenne 2025-03-06 15:13:26 -08:00
  • ebddb65aed Docs: add torch compile cache (#4151) Chayenne 2025-03-06 14:27:09 -08:00
  • 19fd57bcd7 [docs] fix HF reference script command (#4148) Adarsh Shirawalmath 2025-03-07 02:51:54 +05:30
  • 9c58e68b4c Release v0.4.3.post4 (#4140) Lianmin Zheng 2025-03-06 12:50:28 -08:00
  • d03b3467b8 Fix constrained generation errors by adding datasets dependency (#4142) Oliver Stanley 2025-03-06 20:07:51 +00:00
  • ab7fba0ece Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant (#4147) yinfan98 2025-03-07 03:50:07 +08:00
  • bc1534ff32 Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134) Lianmin Zheng 2025-03-06 06:13:59 -08:00
  • 3a3918121f fix bench serving bug (#4135) Lzhang-hub 2025-03-06 21:34:02 +08:00
  • 800bf018fb Update CODEOWNER (#4138) Lianmin Zheng 2025-03-06 03:42:10 -08:00
  • b16af90bc3 AMD/ROCm: update base image string (#4137) kk 2025-03-06 19:38:54 +08:00
  • 98c73d71cb [Minor] make the __init__ function of model_runner.py shorter (#4132) Lianmin Zheng 2025-03-06 01:51:12 -08:00
  • fcc2e37f69 Split the __init__ of scheduler as smaller functions. Improve the eagle tests (#4128) Lianmin Zheng 2025-03-06 00:13:20 -08:00
  • 0804dd11a0 remove unused max_jobs in setup_rocm.py (#4126) Liu Jinjie 2025-03-06 00:12:19 -08:00
  • 55dc8e4d52 Add tag suffix to nightly docker builds. (#4129) saienduri 2025-03-05 23:22:36 -08:00
  • 02e9e9f1cf Add codeowners for eagle implementations (#4131) Ying Sheng 2025-03-05 23:16:49 -08:00
  • 8f0b63139e Docs: improve EAGLE docs (#4038) simveit 2025-03-06 07:40:21 +01:00
  • b9b3b098b9 feat: support docs auto live-reload with sphinx-autobuild (#4111) samzong 2025-03-06 14:39:34 +08:00
  • aee30630d8 Add a pointer to the real KV cache pool (#4113) Zhiqiang Xie 2025-03-05 21:39:07 -08:00
  • 286e6540a6 Remove prefill-only-one-req (#4117) Lianmin Zheng 2025-03-05 20:58:48 -08:00
  • 718c391fd7 [Hoxfix] Fix incomplete token_to_kv_pool refactor (#4121) Wenxuan Tan 2025-03-05 21:32:42 -06:00
  • fc671f66c1 chore: bump v0.4.3.post3 (#4114) Yineng Zhang 2025-03-05 17:26:10 -08:00
  • 197751e9a1 fix Non-consecutive header level increase in docs/router/router.md (#4099) samzong 2025-03-06 09:02:32 +08:00
  • d2d0d061d9 fix cross-reference error and spelling mistakes (#4101) samzong 2025-03-06 08:39:02 +08:00
  • 25482edb5c Online serving benchmarks of real datasets for hierarchical KV caching (#3211) Yueyang Pan 2025-03-06 01:16:43 +01:00
  • 62b362b1f1 Debug radixcache: refactor recursive helper methods (#3029) luzengxiangcn 2025-03-06 08:11:42 +08:00
  • 44d7646371 remove testing on PR workflow change (#4110) saienduri 2025-03-05 16:03:18 -08:00
  • cd85b78f94 Create release-docker-amd-nightly.yml (#4105) saienduri 2025-03-05 14:46:26 -08:00
  • 0aaccbbfec revert deepseek docs (#4109) Yineng Zhang 2025-03-05 13:23:11 -08:00
  • 357671e216 Add examples for server token-in-token-out (#4103) Qiaolin Yu 2025-03-05 16:16:31 -05:00
  • e70fa279bc Docs: reorganize dpsk docs (#4108) Chayenne 2025-03-05 13:01:03 -08:00
  • abe74b7b59 Docs: Add DeepSeek optimization ablations documentation (#4107) Tommy Yang 2025-03-06 04:25:51 +08:00
  • 70b3c6eeb1 Add update_weights_from_disk endpoint to Engine (#4102) Jhin 2025-03-05 14:25:18 -06:00
  • ef9d3b3c2c Fix triton kernel illegal memory issue for eagle (#4100) Ke Bao 2025-03-06 03:23:53 +08:00
  • fc91d08a8f [Revision] Add fast decode plan for flashinfer mla (#4012) Baizhou Zhang 2025-03-05 11:20:41 -08:00
  • 71ab0dabe0 Fix the moe padding conditional logic (#4081) HAI 2025-03-05 10:56:51 -08:00
  • d3d4d76758 [Eagle] Refactor eagle speculative decoding (#3986) Ying Sheng 2025-03-05 08:06:07 -08:00
  • 5be8f1ed98 ROCM: AITER BLOCK GEMM (#4075) yigex 2025-03-05 19:10:49 +08:00
  • e5760bc40a bench: add dataset param for bench_multiturn (#3990) Lu Changqi 2025-03-05 17:21:37 +08:00
  • 56a724eba3 [QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization (#3790) Qubitium-ModelCloud 2025-03-05 17:11:00 +08:00
  • 583d6af71b example: add vlm to token in & out example (#3941) Mick 2025-03-05 14:18:26 +08:00
  • e074d84e5b [Minor] more code cleanup (#4077) Lianmin Zheng 2025-03-04 21:23:47 -08:00
  • 4725e3f652 Add examples for returning hidden states when using the server (#4074) Qiaolin Yu 2025-03-04 22:31:50 -05:00
  • 77a3954bf7 Simplify eagle tests and TP sync in grammar backend (#4066) Lianmin Zheng 2025-03-04 13:40:40 -08:00
  • 03b0364f76 Update nextn ci test (#4071) Ke Bao 2025-03-05 05:01:24 +08:00
  • 2dd7d0c533 Revert "Fix nightly-test CI" (#4065) Lianmin Zheng 2025-03-04 05:38:24 -08:00
  • 0d4e3228cf [Feature] Add test for speculative_token_map (#4016) William 2025-03-04 20:26:24 +08:00
  • 926f8efc0c remove unused max_jobs (#3607) Liu Jinjie 2025-03-04 04:23:39 -08:00
  • 9545bfb28a fix: support gelu_new activation function in gpt2 (#3712) Xiuyu Li 2025-03-04 04:09:52 -08:00
  • 37373ef2bb sgl-router - issues on routing and project build. (#3870) (#3948) Michael Feil 2025-03-04 04:06:30 -08:00
  • 61261b3996 [XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. (#3954) Chen Shengzhi 2025-03-04 20:05:56 +08:00
  • 19120f71f3 [Fix & Style] Refactor the grammar backend to reduce human errors and improve readability (#4030) DarkSharpness 2025-03-04 20:56:45 +09:00
  • 2415ec3896 Remove grafana dashboard's datasource uid (#4051) Kebe 2025-03-04 19:44:51 +08:00