llamasrnn-grpo-epoch001-merged/README.md

---
license: other
language:
- en
library_name: transformers
tags:
- llama
- llama-3
- causal-lm
- clinical
- grpo
- lora
- merged-adapter
- transformers
base_model: meta-llama/Llama-3.2-3B-Instruct
pipeline_tag: text-generation
---

# LLMasRNN GRPO Policy Epoch 001 Merged

This repository contains a merged causal language model produced by taking a LoRA adapter trained in the `LLMasRNN` project and merging it into the base model `meta-llama/Llama-3.2-3B-Instruct`.

This is the merged checkpoint for:

- Project: `LLMasRNN`
- Training stage: `GRPO Phase 1`
- Epoch artifact: `training/artifacts/epoch_001`
- Base model: `meta-llama/Llama-3.2-3B-Instruct`
- Output repo type: merged full weights

## What This Model Is

The policy model is intended for longitudinal clinical prediction workflows in the `LLMasRNN` project. In this training phase, the model is optimized as a memory-update / policy head using GRPO-style reinforcement learning over trajectory rollouts and rubric-based rewards.

The uploaded weights in this repository are not just the adapter. They are the full merged model weights created from:

- Base model weights from `meta-llama/Llama-3.2-3B-Instruct`
- A LoRA adapter saved at `training/artifacts/epoch_001/policy_lora`

## Training Summary

The training run used a custom GRPO training loop implemented in `training/train/policy_trainer.py` and configured by `training/configs/grpo_phase1.yaml`.

High-level pipeline for one epoch:

1. Collect RLN trajectories over the training split.
2. Sample rollout candidates from the policy.
3. Score candidates with a fixed rubric judge.
4. Train the LoRA policy with a custom GRPO loss.

This merged checkpoint corresponds to the first completed epoch.

## Base Model

- Base model: `meta-llama/Llama-3.2-3B-Instruct`
- Architecture: causal LM
- Total parameters after merge: `3,221,924,864`

## LoRA Configuration

The policy was trained as a LoRA adapter before merge.

- PEFT type: `LoRA`
- Task type: `CAUSAL_LM`
- Rank `r`: `16`
- LoRA alpha: `32`
- LoRA dropout: `0.05`
- Target modules:
  - `q_proj`
  - `k_proj`
  - `v_proj`
  - `o_proj`
- Trainable parameters before merge: `9,175,040`
- Trainable fraction before merge: `0.2848%`

## GRPO / RL Training Configuration

From `training/configs/grpo_phase1.yaml`:

- Framework: custom GRPO implementation
- KL coefficient: `0.05`
- Clip range: `0.2`
- Importance clip: `5.0`
- Inner steps per epoch: `4`
- Batch size: `4` trajectory groups per update step
- Rollouts per state / group size `G`: `8`
- Learning rate: `5e-6`
- Gradient clipping: `1.0`
- Warmup steps: `50`

## Sampling and Reward Setup

### Policy rollout sampling

- Temperature: `0.8`
- Top-p: `0.95`
- Max new tokens: `512`

### Predictor model

- Model: `meta-llama/Llama-3.2-3B-Instruct`
- Backend: `vLLM`
- Max model length: `8192`

### RLN judge

- Model: `jinrui123/sft_llama3.2_3b_merged`
- Backend: `vLLM`
- Max model length: `8192`
- Temperature: `0.3`
- Max tokens: `1536`

### Rubric judge

- Model path during training: `/data/jf44684/TrainingDataParepation/models/RubricARM-8B-Judge`
- Backend: `vLLM`
- Temperature: `0.3`
- Max tokens: `1536`

### Reward composition

- Downstream accuracy bonus enabled: `true`
- Epsilon accuracy bonus: `0.2`

## Data / Run Size

From the saved epoch metadata:

- Training split: `data/splits/cleaned_df_train_100.json`
- Validation split: `data/splits/cleaned_df_val_100.json`
- Number of collected trajectory steps: `390`
- Number of scored rollout groups: `390`
- Debug mode: `false`

## Epoch 001 Metrics

Saved in `training/artifacts/epoch_001/meta.json`:

- Loss: `0.436767578125`
- Policy loss: `0.435302734375`
- KL: `0.0322265625`
- Mean absolute advantage: `0.82421875`
- Mean ratio: `0.8505859375`
- Mean reward: `0.7509765625`

These are training-time epoch averages over the 4 inner GRPO update steps for epoch 1.

## Merge Details

This repository was created by:

1. Loading the base model `meta-llama/Llama-3.2-3B-Instruct`
2. Loading the saved LoRA adapter from `training/artifacts/epoch_001/policy_lora`
3. Calling `merge_and_unload()` with PEFT
4. Saving the resulting merged full model weights

The original adapter checkpoint remains the more storage-efficient representation for continued adapter-based training.

## Intended Use

This checkpoint is intended for research and experimentation within the `LLMasRNN` project setting:

- longitudinal clinical prediction
- diagnosis prediction conditioned on evolving patient summaries
- RL / GRPO policy experiments
- ablation or evaluation of merged policy checkpoints

It is not validated for clinical deployment, medical decision support in production, or unsupervised real-world medical use.

## Limitations

- This is an epoch-1 checkpoint, not a converged final model.
- The training objective is project-specific and depends on custom reward shaping and judge models.
- Training and evaluation were performed on internal project data/configuration rather than a standardized public benchmark release.
- Merged weights inherit the usage constraints and limitations of the base Llama model.

## Loading

Example with Transformers:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "jinrui123/llamasrnn-grpo-epoch001-merged"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
```

## Notes on Licensing

This repository contains merged weights derived from `meta-llama/Llama-3.2-3B-Instruct`. Use and redistribution must comply with the license and access terms of the original base model.