Vincent Zhong
|
327f7b7c87
|
fix(grok): remove duplicate replicate_lm_head configuration (#9549)
|
2025-08-23 19:49:24 -07:00 |
|
Lianmin Zheng
|
86d10d220f
|
Update grok.py and tiktoken tokenizer (#9532)
|
2025-08-23 05:40:18 -07:00 |
|
Cheng Wan
|
295895120d
|
[6/N] MoE Refactor: Cleanup MoE-related configs (#8849)
|
2025-08-14 21:14:53 -07:00 |
|
Cheng Wan
|
6c88f6c8d9
|
[5/N] MoE Refactor: Update MoE parallelism arguments (#8658)
|
2025-08-01 01:20:03 -07:00 |
|
Cheng Wan
|
9effeb5bdd
|
Support EPLB in FusedMoE (#8448)
|
2025-07-29 16:02:41 -07:00 |
|
Cheng Wan
|
15ad6c9086
|
[1/N] MoE Refactor: refactor select_experts (#7966)
|
2025-07-19 00:51:15 -07:00 |
|
Yun Dai
|
2695ab0537
|
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
Co-authored-by: qingquansong <ustcsqq@gmail.com>
|
2025-04-08 09:11:35 -07:00 |
|
Lianmin Zheng
|
c6d7f8d370
|
Add some fused elementwise kernels for grok-1 (#4398)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
|
2025-03-13 13:39:10 -07:00 |
|
Qubitium-ModelCloud
|
56a724eba3
|
[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization (#3790)
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
|
2025-03-05 01:11:00 -08:00 |
|
Lianmin Zheng
|
1a8f995c46
|
remove cache configs in model definitions (#4031)
|
2025-03-03 05:00:50 -08:00 |
|
Lianmin Zheng
|
52c03f16b9
|
Add activation parameters to fused_moe (#3170)
|
2025-01-27 00:23:37 -08:00 |
|
Yineng Zhang
|
2add697d7a
|
feat: remove vllm get_rope (#2964)
|
2025-01-18 19:38:01 +08:00 |
|
Yineng Zhang
|
5dc54f1a62
|
feat: remove vllm distributed (#2907)
Co-authored-by: Zhangyi <1109276519@qq.com>
|
2025-01-17 22:31:51 +08:00 |
|
Lianmin Zheng
|
8a6906127a
|
Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm (#2784)
Co-authored-by: SangBin Cho rkooo567@gmail.com
|
2025-01-07 23:29:10 -08:00 |
|
HAI
|
6d08ce2aa9
|
Use Optional with None default (#2770)
|
2025-01-07 01:35:08 -08:00 |
|
Lianmin Zheng
|
bdf946bf81
|
Support loading pre-sharded moe weights (#2716)
|
2025-01-02 15:07:37 -08:00 |
|
Ke Bao
|
e835a50021
|
Reorg moe code (#2563)
|
2024-12-24 01:10:22 +08:00 |
|
Lianmin Zheng
|
5282a4735f
|
[Minor] Fix grok model loader (#2473)
|
2024-12-12 14:34:47 -08:00 |
|
Jerry Zhang
|
9cc733b38c
|
move apply_torchao_config_ to model_runner (#2342)
|
2024-12-04 17:26:42 -08:00 |
|
Yineng Zhang
|
85e1a6f3aa
|
Update model_loader deps and qqq quantization deps (#2220) (#2318)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2024-12-02 23:22:13 +08:00 |
|
Lianmin Zheng
|
4936be8acc
|
Revert "Revert "[FEAT] Support GGUF format"" (#2287)
|
2024-11-30 22:14:48 -08:00 |
|
Lianmin Zheng
|
7e4c6dd8da
|
Revert "[FEAT] Support GGUF format" (#2285)
|
2024-11-30 19:03:26 -08:00 |
|
Yang Zheng
|
883c955489
|
[FEAT] Support GGUF format (#2215)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
|
2024-11-30 00:44:48 -08:00 |
|
Lianmin Zheng
|
dd5eba4c88
|
Remove fused_moe_grok (#2223)
|
2024-11-27 14:28:55 -08:00 |
|
Lianmin Zheng
|
be0124bda0
|
Rename triton_fused_moe -> fused_moe_triton (#2163)
|
2024-11-24 08:12:35 -08:00 |
|
Xuehai Pan
|
62a4a339eb
|
docs: fix module docstrings and copyright headers (#2077)
|
2024-11-22 22:16:53 +08:00 |
|
Ke Bao
|
16eb33ffe2
|
Update vocab embedding deps and add TP switch (#1856)
|
2024-10-31 20:13:07 -07:00 |
|
Byron Hsu
|
56503d9bc9
|
[1/N] Remove CacheConfig import in all model files (#1658)
|
2024-10-14 09:06:34 -07:00 |
|
Lianmin Zheng
|
36d5acfca5
|
Rename InputMetadata -> ForwardBatch (#1543)
|
2024-09-30 02:41:11 -07:00 |
|
Yineng Zhang
|
b4408b0d16
|
feat: update linear deps 1/N (#1305)
|
2024-09-19 20:53:11 +08:00 |
|
Liangsheng Yin
|
70b6802982
|
Optimize conflicts between CUDA graph and vocab mask tensors (#1392)
|
2024-09-13 20:27:53 -07:00 |
|
Liangsheng Yin
|
381dd57bd6
|
Sampler cudagraph (#1253)
|
2024-08-28 18:58:52 -07:00 |
|
Lianmin Zheng
|
bf53bf5142
|
[Fix] Fix llava on multi images (#1247)
|
2024-08-28 06:33:05 -07:00 |
|
Yineng Zhang
|
f25f4dfde5
|
hotfix: revert sampler CUDA Graph (#1242)
|
2024-08-28 21:16:47 +10:00 |
|
Liangsheng Yin
|
75ce37f401
|
Move sampler into CUDA graph (#1201)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-26 07:02:50 -07:00 |
|
Lianmin Zheng
|
f6af3a6561
|
Cleanup readme, llava examples, usage examples and nccl init (#1194)
|
2024-08-24 08:02:23 -07:00 |
|
Lianmin Zheng
|
5a261bd055
|
Fix the deadlock in multi-node tp (#1122)
|
2024-08-16 01:39:24 -07:00 |
|
Lianmin Zheng
|
a59636bb5e
|
Update grok 1 model (#1095)
|
2024-08-14 04:40:44 -07:00 |
|
Yineng Zhang
|
6a38efa834
|
feat: replace all rmsnorm and silu (#1057)
|
2024-08-13 02:15:59 +10:00 |
|
Liangsheng Yin
|
87e8c090e9
|
Organize code (rename, movement) (#953)
|
2024-08-06 20:50:32 -07:00 |
|
Liangsheng Yin
|
cdcbde5fc3
|
Code structure refactor (#807)
|
2024-07-29 23:04:48 -07:00 |
|
Yineng Zhang
|
dd7e8b9421
|
chore: add copyright for srt (#790)
|
2024-07-28 23:07:12 +10:00 |
|
Liangsheng Yin
|
c9ee3d3559
|
Fix model forward grad (#628)
|
2024-07-15 22:09:09 -07:00 |
|
sglang
|
11616fc6bd
|
Minor fix in compiler & format (#545)
|
2024-06-29 23:42:14 -07:00 |
|
Ying Sheng
|
fb9296f0ed
|
Higher priority for user input of max_prefill_tokens & format (#540)
|
2024-06-12 21:48:40 -07:00 |
|
Lianmin Zheng
|
bf3e271fe0
|
Update vllm to v0.4.3 (#511)
Co-authored-by: Qubitium <417764+Qubitium@users.noreply.github.com>
Co-authored-by: ZX <zx@lbx.dev>
|
2024-06-07 12:11:31 -07:00 |
|
Ying Sheng
|
0463f7fb52
|
Support data parallelism (static) (#480)
Co-authored-by: Ying Sheng <ying.sheng@databricks.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2024-05-27 21:24:10 -07:00 |
|
Lianmin Zheng
|
09de730dee
|
Improve benchmark scripts & add more models (#484)
|
2024-05-27 14:13:26 -07:00 |
|