Files
2025-10-09 16:47:16 +08:00

2.4 KiB

This model was released on 2023-06-07 and added to Hugging Face Transformers on 2024-05-14.

JetMoe

PyTorch FlashAttention SDPA

Overview

JetMoe-8B is an 8B Mixture-of-Experts (MoE) language model developed by Yikang Shen and MyShell. JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the ModuleFormer. Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.

This model was contributed by Yikang Shen.

JetMoeConfig

autodoc JetMoeConfig

JetMoeModel

autodoc JetMoeModel - forward

JetMoeForCausalLM

autodoc JetMoeForCausalLM - forward

JetMoeForSequenceClassification

autodoc JetMoeForSequenceClassification - forward