llamasrnn-grpo-epoch001-merged/README.md

---
license: other
language:
- en
library_name: transformers
tags:
- llama
- llama-3
- causal-lm
- clinical
- grpo
- lora
- merged-adapter
- transformers
base_model: meta-llama/Llama-3.2-3B-Instruct
pipeline_tag: text-generation
---

# LLMasRNN GRPO Policy Epoch 001 Merged

This repository contains a merged causal language model produced by taking a LoRA adapter trained in the `LLMasRNN` project and merging it into the base model `meta-llama/Llama-3.2-3B-Instruct`.

This is the merged checkpoint for:

- Project: `LLMasRNN`
- Training stage: `GRPO Phase 1`
- Epoch artifact: `training/artifacts/epoch_001`
- Base model: `meta-llama/Llama-3.2-3B-Instruct`
- Output repo type: merged full weights

## What This Model Is

The policy model is intended for longitudinal clinical prediction workflows in the `LLMasRNN` project. In this training phase, the model is optimized as a memory-update / policy head using GRPO-style reinforcement learning over trajectory rollouts and rubric-based rewards.

The uploaded weights in this repository are not just the adapter. They are the full merged model weights created from:

- Base model weights from `meta-llama/Llama-3.2-3B-Instruct`
- A LoRA adapter saved at `training/artifacts/epoch_001/policy_lora`

## Training Summary

The training run used a custom GRPO training loop implemented in `training/train/policy_trainer.py` and configured by `training/configs/grpo_phase1.yaml`.

High-level pipeline for one epoch:

1. Collect RLN trajectories over the training split.
2. Sample rollout candidates from the policy.
3. Score candidates with a fixed rubric judge.
4. Train the LoRA policy with a custom GRPO loss.

This merged checkpoint corresponds to the first completed epoch.

## Base Model

- Base model: `meta-llama/Llama-3.2-3B-Instruct`
- Architecture: causal LM
- Total parameters after merge: `3,221,924,864`

## LoRA Configuration

The policy was trained as a LoRA adapter before merge.

- PEFT type: `LoRA`
- Task type: `CAUSAL_LM`
- Rank `r`: `16`
- LoRA alpha: `32`
- LoRA dropout: `0.05`
- Target modules:
  - `q_proj`
  - `k_proj`
  - `v_proj`
  - `o_proj`
- Trainable parameters before merge: `9,175,040`
- Trainable fraction before merge: `0.2848%`

## GRPO / RL Training Configuration

From `training/configs/grpo_phase1.yaml`:

- Framework: custom GRPO implementation
- KL coefficient: `0.05`
- Clip range: `0.2`
- Importance clip: `5.0`
- Inner steps per epoch: `4`
- Batch size: `4` trajectory groups per update step
- Rollouts per state / group size `G`: `8`
- Learning rate: `5e-6`
- Gradient clipping: `1.0`
- Warmup steps: `50`

## Sampling and Reward Setup

### Policy rollout sampling

- Temperature: `0.8`
- Top-p: `0.95`
- Max new tokens: `512`

### Predictor model

- Model: `meta-llama/Llama-3.2-3B-Instruct`
- Backend: `vLLM`
- Max model length: `8192`

### RLN judge

- Model: `jinrui123/sft_llama3.2_3b_merged`
- Backend: `vLLM`
- Max model length: `8192`
- Temperature: `0.3`
- Max tokens: `1536`

### Rubric judge

- Model path during training: `/data/jf44684/TrainingDataParepation/models/RubricARM-8B-Judge`
- Backend: `vLLM`
- Temperature: `0.3`
- Max tokens: `1536`

### Reward composition

- Downstream accuracy bonus enabled: `true`
- Epsilon accuracy bonus: `0.2`

## Data / Run Size

From the saved epoch metadata:

- Training split: `data/splits/cleaned_df_train_100.json`
- Validation split: `data/splits/cleaned_df_val_100.json`
- Number of collected trajectory steps: `390`
- Number of scored rollout groups: `390`
- Debug mode: `false`

## Epoch 001 Metrics

Saved in `training/artifacts/epoch_001/meta.json`:

- Loss: `0.436767578125`
- Policy loss: `0.435302734375`
- KL: `0.0322265625`
- Mean absolute advantage: `0.82421875`
- Mean ratio: `0.8505859375`
- Mean reward: `0.7509765625`

These are training-time epoch averages over the 4 inner GRPO update steps for epoch 1.

## Merge Details

This repository was created by:

1. Loading the base model `meta-llama/Llama-3.2-3B-Instruct`
2. Loading the saved LoRA adapter from `training/artifacts/epoch_001/policy_lora`
3. Calling `merge_and_unload()` with PEFT
4. Saving the resulting merged full model weights

The original adapter checkpoint remains the more storage-efficient representation for continued adapter-based training.

## Intended Use

This checkpoint is intended for research and experimentation within the `LLMasRNN` project setting:

- longitudinal clinical prediction
- diagnosis prediction conditioned on evolving patient summaries
- RL / GRPO policy experiments
- ablation or evaluation of merged policy checkpoints

It is not validated for clinical deployment, medical decision support in production, or unsupervised real-world medical use.

## Limitations

- This is an epoch-1 checkpoint, not a converged final model.
- The training objective is project-specific and depends on custom reward shaping and judge models.
- Training and evaluation were performed on internal project data/configuration rather than a standardized public benchmark release.
- Merged weights inherit the usage constraints and limitations of the base Llama model.

## Loading

Example with Transformers:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "jinrui123/llamasrnn-grpo-epoch001-merged"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
```

## Notes on Licensing

This repository contains merged weights derived from `meta-llama/Llama-3.2-3B-Instruct`. Use and redistribution must comply with the license and access terms of the original base model.
初始化项目，由ModelHub XC社区提供模型 Model: jinrui123/llamasrnn-grpo-epoch001-merged Source: Original Platform 2026-05-05 12:50:53 +08:00			`---`
			`license: other`
			`language:`
			`- en`
			`library_name: transformers`
			`tags:`
			`- llama`
			`- llama-3`
			`- causal-lm`
			`- clinical`
			`- grpo`
			`- lora`
			`- merged-adapter`
			`- transformers`
			`base_model: meta-llama/Llama-3.2-3B-Instruct`
			`pipeline_tag: text-generation`
			`---`

			`# LLMasRNN GRPO Policy Epoch 001 Merged`

			This repository contains a merged causal language model produced by taking a LoRA adapter trained in the `LLMasRNN` project and merging it into the base model `meta-llama/Llama-3.2-3B-Instruct`.

			`This is the merged checkpoint for:`

			- Project: `LLMasRNN`
			- Training stage: `GRPO Phase 1`
			- Epoch artifact: `training/artifacts/epoch_001`
			- Base model: `meta-llama/Llama-3.2-3B-Instruct`
			`- Output repo type: merged full weights`

			`## What This Model Is`

			The policy model is intended for longitudinal clinical prediction workflows in the `LLMasRNN` project. In this training phase, the model is optimized as a memory-update / policy head using GRPO-style reinforcement learning over trajectory rollouts and rubric-based rewards.

			`The uploaded weights in this repository are not just the adapter. They are the full merged model weights created from:`

			- Base model weights from `meta-llama/Llama-3.2-3B-Instruct`
			- A LoRA adapter saved at `training/artifacts/epoch_001/policy_lora`

			`## Training Summary`

			The training run used a custom GRPO training loop implemented in `training/train/policy_trainer.py` and configured by `training/configs/grpo_phase1.yaml`.

			`High-level pipeline for one epoch:`

			`1. Collect RLN trajectories over the training split.`
			`2. Sample rollout candidates from the policy.`
			`3. Score candidates with a fixed rubric judge.`
			`4. Train the LoRA policy with a custom GRPO loss.`

			`This merged checkpoint corresponds to the first completed epoch.`

			`## Base Model`

			- Base model: `meta-llama/Llama-3.2-3B-Instruct`
			`- Architecture: causal LM`
			- Total parameters after merge: `3,221,924,864`

			`## LoRA Configuration`

			`The policy was trained as a LoRA adapter before merge.`

			- PEFT type: `LoRA`
			- Task type: `CAUSAL_LM`
			- Rank `r`: `16`
			- LoRA alpha: `32`
			- LoRA dropout: `0.05`
			`- Target modules:`
			- `q_proj`
			- `k_proj`
			- `v_proj`
			- `o_proj`
			- Trainable parameters before merge: `9,175,040`
			- Trainable fraction before merge: `0.2848%`

			`## GRPO / RL Training Configuration`

			From `training/configs/grpo_phase1.yaml`:

			`- Framework: custom GRPO implementation`
			- KL coefficient: `0.05`
			- Clip range: `0.2`
			- Importance clip: `5.0`
			- Inner steps per epoch: `4`
			- Batch size: `4` trajectory groups per update step
			- Rollouts per state / group size `G`: `8`
			- Learning rate: `5e-6`
			- Gradient clipping: `1.0`
			- Warmup steps: `50`

			`## Sampling and Reward Setup`

			`### Policy rollout sampling`

			- Temperature: `0.8`
			- Top-p: `0.95`
			- Max new tokens: `512`

			`### Predictor model`

			- Model: `meta-llama/Llama-3.2-3B-Instruct`
			- Backend: `vLLM`
			- Max model length: `8192`

			`### RLN judge`

			- Model: `jinrui123/sft_llama3.2_3b_merged`
			- Backend: `vLLM`
			- Max model length: `8192`
			- Temperature: `0.3`
			- Max tokens: `1536`

			`### Rubric judge`

			- Model path during training: `/data/jf44684/TrainingDataParepation/models/RubricARM-8B-Judge`
			- Backend: `vLLM`
			- Temperature: `0.3`
			- Max tokens: `1536`

			`### Reward composition`

			- Downstream accuracy bonus enabled: `true`
			- Epsilon accuracy bonus: `0.2`

			`## Data / Run Size`

			`From the saved epoch metadata:`

			- Training split: `data/splits/cleaned_df_train_100.json`
			- Validation split: `data/splits/cleaned_df_val_100.json`
			- Number of collected trajectory steps: `390`
			- Number of scored rollout groups: `390`
			- Debug mode: `false`

			`## Epoch 001 Metrics`

			Saved in `training/artifacts/epoch_001/meta.json`:

			- Loss: `0.436767578125`
			- Policy loss: `0.435302734375`
			- KL: `0.0322265625`
			- Mean absolute advantage: `0.82421875`
			- Mean ratio: `0.8505859375`
			- Mean reward: `0.7509765625`

			`These are training-time epoch averages over the 4 inner GRPO update steps for epoch 1.`

			`## Merge Details`

			`This repository was created by:`

			1. Loading the base model `meta-llama/Llama-3.2-3B-Instruct`
			2. Loading the saved LoRA adapter from `training/artifacts/epoch_001/policy_lora`
			3. Calling `merge_and_unload()` with PEFT
			`4. Saving the resulting merged full model weights`

			`The original adapter checkpoint remains the more storage-efficient representation for continued adapter-based training.`

			`## Intended Use`

			This checkpoint is intended for research and experimentation within the `LLMasRNN` project setting:

			`- longitudinal clinical prediction`
			`- diagnosis prediction conditioned on evolving patient summaries`
			`- RL / GRPO policy experiments`
			`- ablation or evaluation of merged policy checkpoints`

			`It is not validated for clinical deployment, medical decision support in production, or unsupervised real-world medical use.`

			`## Limitations`

			`- This is an epoch-1 checkpoint, not a converged final model.`
			`- The training objective is project-specific and depends on custom reward shaping and judge models.`
			`- Training and evaluation were performed on internal project data/configuration rather than a standardized public benchmark release.`
			`- Merged weights inherit the usage constraints and limitations of the base Llama model.`

			`## Loading`

			`Example with Transformers:`

			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer`

			`repo_id = "jinrui123/llamasrnn-grpo-epoch001-merged"`

			`tokenizer = AutoTokenizer.from_pretrained(repo_id)`
			`model = AutoModelForCausalLM.from_pretrained(repo_id)`
			```

			`## Notes on Licensing`

			This repository contains merged weights derived from `meta-llama/Llama-3.2-3B-Instruct`. Use and redistribution must comply with the license and access terms of the original base model.