ModelHub XC fd0614593a 初始化项目,由ModelHub XC社区提供模型
Model: Flink-ddd/MoE-Pilot-Align-2.7B
Source: Original Platform
2026-05-15 04:56:10 +08:00

language, pipeline_tag, tags, license, library_name
language pipeline_tag tags license library_name
en
text-generation
moe
alignment
apache-2.0 transformers

MoE-Pilot-Align-2.7B

Overview

MoE-Pilot-Align-2.7B is a high-performance Sparse Mixture-of-Experts (MoE) model (14.3B total / 2.7B active), meticulously adapted and fine-tuned on a standard 8-GPU SXM node.

This project is not a conventional fine-tuning exercise; it is an Infrastructure-Level Alignment showcase. It validates a production-ready pipeline for scaling MoE architectures by resolving critical bottlenecks in distributed initialization, heterogeneous memory management, and collective communication.

To achieve maximum throughput and memory efficiency during the alignment phase, this model is powered by Kernel-Align, an extreme post-training infrastructure optimized for NVIDIA and AMD GPUs.

  • Breaking the Memory Wall: Optimized for GRPO (Group Relative Policy Optimization), enabling a Group Size of G=256 on a single A100 by keeping additional VRAM usage constant (~0.5GB).
  • Extreme Sampling Latency: Integrated with FlashInfer and custom fused kernels, achieving up to 399x speedup in the rollout phase compared to native PyTorch.
  • Universal Backend: Native support for AMD (ROCm/AITER) and NVIDIA (CUDA), ensuring seamless cross-platform performance.

Engineering & Infrastructure Innovations

1. Zero-Deadlock AOT Kernel Orchestration

In high-density 8-GPU environments, simultaneous JIT (Just-in-Time) compilation of MoE fused kernels often triggers distributed deadlocks due to I/O race conditions during the NCCL handshake.

  • Innovation: Implemented an Ahead-of-Time compilation orchestrator. By decoupling CUDA/ROCm kernel building from the training lifecycle, we achieved zero-latency "Cold Starts" and eliminated synchronization timeouts.

2. Reflection-Based Parameter Metadata Injection

Standard DeepSpeed-ZeRO engines require precise expert-group tagging to avoid AssertionError during state partitioning. Hard-coded tagging is fragile and lacks portability.

  • Solution: Developed a Reflection-based Metadata Injector. At the optimizer setup stage, the system dynamically scans tensors for allreduce attributes and naming patterns to inject moe: True tags.
  • Impact: Ensured seamless compatibility between the MoE expert layers and the ZeRO-1/2 optimizer state sharding, maximizing VRAM efficiency on the 80GB A100 footprint.

3. Topology-Aware Parallelism (EP8)

  • Parallelism Strategy: Optimized for a Single-Node 8-GPU Full-Mesh topology using Expert Parallelism (EP=8).
  • Communication Tuning: Refactored the All-to-All collective primitives to match the NVLink 3.0 bi-directional bandwidth. By balancing the load across 32 total experts (4 experts per GPU), we maintained a stable throughput of 185+ TFLOPs per device.

Model Specifications

Feature Configuration
Architecture Sparse MoE (Decoder-only)
Total Parameters 14.3 Billion
Active Parameters 2.7 Billion
Experts Count 32 (Total)
Routing Strategy Top-2 Gating with Load Balancing
Precision Mixed Precision (BF16-O2)

Hardware & Software Environment

  • Compute: 1 x Node | 8 x NVIDIA A100 80GB SXM
  • Interconnect: NVLink 3.0 (600 GB/s Full-Mesh)
  • Framework: Custom Fork of Megatron-DeepSpeed
  • CUDA/ROCm Compatibility: Validated on CUDA 12.4; Architectural design supports seamless porting to AMD MI300X/ROCm environments via RCCL optimization.

Community Quantizations

Special thanks to @mradermacher for providing GGUF weights:

This repository serves as a technical benchmark for implementing production-scale MoE alignment protocols on next-generation heterogeneous compute clusters.

Description
Model synced from source: Flink-ddd/MoE-Pilot-Align-2.7B
Readme 4.2 MiB