Model: jinrui123/llamasrnn-grpo-epoch001-merged Source: Original Platform
license, language, library_name, tags, base_model, pipeline_tag
| license | language | library_name | tags | base_model | pipeline_tag | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| other |
|
transformers |
|
meta-llama/Llama-3.2-3B-Instruct | text-generation |
LLMasRNN GRPO Policy Epoch 001 Merged
This repository contains a merged causal language model produced by taking a LoRA adapter trained in the LLMasRNN project and merging it into the base model meta-llama/Llama-3.2-3B-Instruct.
This is the merged checkpoint for:
- Project:
LLMasRNN - Training stage:
GRPO Phase 1 - Epoch artifact:
training/artifacts/epoch_001 - Base model:
meta-llama/Llama-3.2-3B-Instruct - Output repo type: merged full weights
What This Model Is
The policy model is intended for longitudinal clinical prediction workflows in the LLMasRNN project. In this training phase, the model is optimized as a memory-update / policy head using GRPO-style reinforcement learning over trajectory rollouts and rubric-based rewards.
The uploaded weights in this repository are not just the adapter. They are the full merged model weights created from:
- Base model weights from
meta-llama/Llama-3.2-3B-Instruct - A LoRA adapter saved at
training/artifacts/epoch_001/policy_lora
Training Summary
The training run used a custom GRPO training loop implemented in training/train/policy_trainer.py and configured by training/configs/grpo_phase1.yaml.
High-level pipeline for one epoch:
- Collect RLN trajectories over the training split.
- Sample rollout candidates from the policy.
- Score candidates with a fixed rubric judge.
- Train the LoRA policy with a custom GRPO loss.
This merged checkpoint corresponds to the first completed epoch.
Base Model
- Base model:
meta-llama/Llama-3.2-3B-Instruct - Architecture: causal LM
- Total parameters after merge:
3,221,924,864
LoRA Configuration
The policy was trained as a LoRA adapter before merge.
- PEFT type:
LoRA - Task type:
CAUSAL_LM - Rank
r:16 - LoRA alpha:
32 - LoRA dropout:
0.05 - Target modules:
q_projk_projv_projo_proj
- Trainable parameters before merge:
9,175,040 - Trainable fraction before merge:
0.2848%
GRPO / RL Training Configuration
From training/configs/grpo_phase1.yaml:
- Framework: custom GRPO implementation
- KL coefficient:
0.05 - Clip range:
0.2 - Importance clip:
5.0 - Inner steps per epoch:
4 - Batch size:
4trajectory groups per update step - Rollouts per state / group size
G:8 - Learning rate:
5e-6 - Gradient clipping:
1.0 - Warmup steps:
50
Sampling and Reward Setup
Policy rollout sampling
- Temperature:
0.8 - Top-p:
0.95 - Max new tokens:
512
Predictor model
- Model:
meta-llama/Llama-3.2-3B-Instruct - Backend:
vLLM - Max model length:
8192
RLN judge
- Model:
jinrui123/sft_llama3.2_3b_merged - Backend:
vLLM - Max model length:
8192 - Temperature:
0.3 - Max tokens:
1536
Rubric judge
- Model path during training:
/data/jf44684/TrainingDataParepation/models/RubricARM-8B-Judge - Backend:
vLLM - Temperature:
0.3 - Max tokens:
1536
Reward composition
- Downstream accuracy bonus enabled:
true - Epsilon accuracy bonus:
0.2
Data / Run Size
From the saved epoch metadata:
- Training split:
data/splits/cleaned_df_train_100.json - Validation split:
data/splits/cleaned_df_val_100.json - Number of collected trajectory steps:
390 - Number of scored rollout groups:
390 - Debug mode:
false
Epoch 001 Metrics
Saved in training/artifacts/epoch_001/meta.json:
- Loss:
0.436767578125 - Policy loss:
0.435302734375 - KL:
0.0322265625 - Mean absolute advantage:
0.82421875 - Mean ratio:
0.8505859375 - Mean reward:
0.7509765625
These are training-time epoch averages over the 4 inner GRPO update steps for epoch 1.
Merge Details
This repository was created by:
- Loading the base model
meta-llama/Llama-3.2-3B-Instruct - Loading the saved LoRA adapter from
training/artifacts/epoch_001/policy_lora - Calling
merge_and_unload()with PEFT - Saving the resulting merged full model weights
The original adapter checkpoint remains the more storage-efficient representation for continued adapter-based training.
Intended Use
This checkpoint is intended for research and experimentation within the LLMasRNN project setting:
- longitudinal clinical prediction
- diagnosis prediction conditioned on evolving patient summaries
- RL / GRPO policy experiments
- ablation or evaluation of merged policy checkpoints
It is not validated for clinical deployment, medical decision support in production, or unsupervised real-world medical use.
Limitations
- This is an epoch-1 checkpoint, not a converged final model.
- The training objective is project-specific and depends on custom reward shaping and judge models.
- Training and evaluation were performed on internal project data/configuration rather than a standardized public benchmark release.
- Merged weights inherit the usage constraints and limitations of the base Llama model.
Loading
Example with Transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "jinrui123/llamasrnn-grpo-epoch001-merged"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
Notes on Licensing
This repository contains merged weights derived from meta-llama/Llama-3.2-3B-Instruct. Use and redistribution must comply with the license and access terms of the original base model.