license, language, library_name, tags, base_model, pipeline_tag
license language library_name tags base_model pipeline_tag
other
en
transformers
llama
llama-3
causal-lm
clinical
grpo
lora
merged-adapter
transformers
meta-llama/Llama-3.2-3B-Instruct text-generation

LLMasRNN GRPO Policy Epoch 001 Merged

This repository contains a merged causal language model produced by taking a LoRA adapter trained in the LLMasRNN project and merging it into the base model meta-llama/Llama-3.2-3B-Instruct.

This is the merged checkpoint for:

  • Project: LLMasRNN
  • Training stage: GRPO Phase 1
  • Epoch artifact: training/artifacts/epoch_001
  • Base model: meta-llama/Llama-3.2-3B-Instruct
  • Output repo type: merged full weights

What This Model Is

The policy model is intended for longitudinal clinical prediction workflows in the LLMasRNN project. In this training phase, the model is optimized as a memory-update / policy head using GRPO-style reinforcement learning over trajectory rollouts and rubric-based rewards.

The uploaded weights in this repository are not just the adapter. They are the full merged model weights created from:

  • Base model weights from meta-llama/Llama-3.2-3B-Instruct
  • A LoRA adapter saved at training/artifacts/epoch_001/policy_lora

Training Summary

The training run used a custom GRPO training loop implemented in training/train/policy_trainer.py and configured by training/configs/grpo_phase1.yaml.

High-level pipeline for one epoch:

  1. Collect RLN trajectories over the training split.
  2. Sample rollout candidates from the policy.
  3. Score candidates with a fixed rubric judge.
  4. Train the LoRA policy with a custom GRPO loss.

This merged checkpoint corresponds to the first completed epoch.

Base Model

  • Base model: meta-llama/Llama-3.2-3B-Instruct
  • Architecture: causal LM
  • Total parameters after merge: 3,221,924,864

LoRA Configuration

The policy was trained as a LoRA adapter before merge.

  • PEFT type: LoRA
  • Task type: CAUSAL_LM
  • Rank r: 16
  • LoRA alpha: 32
  • LoRA dropout: 0.05
  • Target modules:
    • q_proj
    • k_proj
    • v_proj
    • o_proj
  • Trainable parameters before merge: 9,175,040
  • Trainable fraction before merge: 0.2848%

GRPO / RL Training Configuration

From training/configs/grpo_phase1.yaml:

  • Framework: custom GRPO implementation
  • KL coefficient: 0.05
  • Clip range: 0.2
  • Importance clip: 5.0
  • Inner steps per epoch: 4
  • Batch size: 4 trajectory groups per update step
  • Rollouts per state / group size G: 8
  • Learning rate: 5e-6
  • Gradient clipping: 1.0
  • Warmup steps: 50

Sampling and Reward Setup

Policy rollout sampling

  • Temperature: 0.8
  • Top-p: 0.95
  • Max new tokens: 512

Predictor model

  • Model: meta-llama/Llama-3.2-3B-Instruct
  • Backend: vLLM
  • Max model length: 8192

RLN judge

  • Model: jinrui123/sft_llama3.2_3b_merged
  • Backend: vLLM
  • Max model length: 8192
  • Temperature: 0.3
  • Max tokens: 1536

Rubric judge

  • Model path during training: /data/jf44684/TrainingDataParepation/models/RubricARM-8B-Judge
  • Backend: vLLM
  • Temperature: 0.3
  • Max tokens: 1536

Reward composition

  • Downstream accuracy bonus enabled: true
  • Epsilon accuracy bonus: 0.2

Data / Run Size

From the saved epoch metadata:

  • Training split: data/splits/cleaned_df_train_100.json
  • Validation split: data/splits/cleaned_df_val_100.json
  • Number of collected trajectory steps: 390
  • Number of scored rollout groups: 390
  • Debug mode: false

Epoch 001 Metrics

Saved in training/artifacts/epoch_001/meta.json:

  • Loss: 0.436767578125
  • Policy loss: 0.435302734375
  • KL: 0.0322265625
  • Mean absolute advantage: 0.82421875
  • Mean ratio: 0.8505859375
  • Mean reward: 0.7509765625

These are training-time epoch averages over the 4 inner GRPO update steps for epoch 1.

Merge Details

This repository was created by:

  1. Loading the base model meta-llama/Llama-3.2-3B-Instruct
  2. Loading the saved LoRA adapter from training/artifacts/epoch_001/policy_lora
  3. Calling merge_and_unload() with PEFT
  4. Saving the resulting merged full model weights

The original adapter checkpoint remains the more storage-efficient representation for continued adapter-based training.

Intended Use

This checkpoint is intended for research and experimentation within the LLMasRNN project setting:

  • longitudinal clinical prediction
  • diagnosis prediction conditioned on evolving patient summaries
  • RL / GRPO policy experiments
  • ablation or evaluation of merged policy checkpoints

It is not validated for clinical deployment, medical decision support in production, or unsupervised real-world medical use.

Limitations

  • This is an epoch-1 checkpoint, not a converged final model.
  • The training objective is project-specific and depends on custom reward shaping and judge models.
  • Training and evaluation were performed on internal project data/configuration rather than a standardized public benchmark release.
  • Merged weights inherit the usage constraints and limitations of the base Llama model.

Loading

Example with Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "jinrui123/llamasrnn-grpo-epoch001-merged"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

Notes on Licensing

This repository contains merged weights derived from meta-llama/Llama-3.2-3B-Instruct. Use and redistribution must comply with the license and access terms of the original base model.

Description
Model synced from source: jinrui123/llamasrnn-grpo-epoch001-merged
Readme 32 KiB
Languages
Jinja 100%