jinrui123/llamasrnn-grpo-epoch001-merged

Go to file

ModelHub XC c43874c012 初始化项目，由ModelHub XC社区提供模型

Model: jinrui123/llamasrnn-grpo-epoch001-merged
Source: Original Platform

2026-05-05 12:50:53 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

model-00001-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

model-00002-of-00002.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

model.safetensors.index.json

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-05 12:50:53 +08:00

README.md

license, language, library_name, tags, base_model, pipeline_tag

license

language

library_name

LLMasRNN GRPO Policy Epoch 001 Merged

This repository contains a merged causal language model produced by taking a LoRA adapter trained in the LLMasRNN project and merging it into the base model meta-llama/Llama-3.2-3B-Instruct.

This is the merged checkpoint for:

Project: LLMasRNN
Training stage: GRPO Phase 1
Epoch artifact: training/artifacts/epoch_001
Base model: meta-llama/Llama-3.2-3B-Instruct
Output repo type: merged full weights

What This Model Is

The policy model is intended for longitudinal clinical prediction workflows in the LLMasRNN project. In this training phase, the model is optimized as a memory-update / policy head using GRPO-style reinforcement learning over trajectory rollouts and rubric-based rewards.

The uploaded weights in this repository are not just the adapter. They are the full merged model weights created from:

Base model weights from meta-llama/Llama-3.2-3B-Instruct
A LoRA adapter saved at training/artifacts/epoch_001/policy_lora

Training Summary

The training run used a custom GRPO training loop implemented in training/train/policy_trainer.py and configured by training/configs/grpo_phase1.yaml.

High-level pipeline for one epoch:

Collect RLN trajectories over the training split.
Sample rollout candidates from the policy.
Score candidates with a fixed rubric judge.
Train the LoRA policy with a custom GRPO loss.

This merged checkpoint corresponds to the first completed epoch.

Base Model

Base model: meta-llama/Llama-3.2-3B-Instruct
Architecture: causal LM
Total parameters after merge: 3,221,924,864

LoRA Configuration

The policy was trained as a LoRA adapter before merge.

PEFT type: LoRA
Task type: CAUSAL_LM
Rank r: 16
LoRA alpha: 32
LoRA dropout: 0.05
Target modules:
- q_proj
- k_proj
- v_proj
- o_proj
Trainable parameters before merge: 9,175,040
Trainable fraction before merge: 0.2848%

GRPO / RL Training Configuration

From training/configs/grpo_phase1.yaml:

Framework: custom GRPO implementation
KL coefficient: 0.05
Clip range: 0.2
Importance clip: 5.0
Inner steps per epoch: 4
Batch size: 4 trajectory groups per update step
Rollouts per state / group size G: 8
Learning rate: 5e-6
Gradient clipping: 1.0
Warmup steps: 50

Sampling and Reward Setup

Policy rollout sampling

Temperature: 0.8
Top-p: 0.95
Max new tokens: 512

Predictor model

Model: meta-llama/Llama-3.2-3B-Instruct
Backend: vLLM
Max model length: 8192

RLN judge

Model: jinrui123/sft_llama3.2_3b_merged
Backend: vLLM
Max model length: 8192
Temperature: 0.3
Max tokens: 1536

Rubric judge

Model path during training: /data/jf44684/TrainingDataParepation/models/RubricARM-8B-Judge
Backend: vLLM
Temperature: 0.3
Max tokens: 1536

Reward composition

Downstream accuracy bonus enabled: true
Epsilon accuracy bonus: 0.2

Data / Run Size

From the saved epoch metadata:

Training split: data/splits/cleaned_df_train_100.json
Validation split: data/splits/cleaned_df_val_100.json
Number of collected trajectory steps: 390
Number of scored rollout groups: 390
Debug mode: false

Epoch 001 Metrics

Saved in training/artifacts/epoch_001/meta.json:

Loss: 0.436767578125
Policy loss: 0.435302734375
KL: 0.0322265625
Mean absolute advantage: 0.82421875
Mean ratio: 0.8505859375
Mean reward: 0.7509765625

These are training-time epoch averages over the 4 inner GRPO update steps for epoch 1.

Merge Details

This repository was created by:

Loading the base model meta-llama/Llama-3.2-3B-Instruct
Loading the saved LoRA adapter from training/artifacts/epoch_001/policy_lora
Calling merge_and_unload() with PEFT
Saving the resulting merged full model weights

The original adapter checkpoint remains the more storage-efficient representation for continued adapter-based training.

Intended Use

This checkpoint is intended for research and experimentation within the LLMasRNN project setting:

longitudinal clinical prediction
diagnosis prediction conditioned on evolving patient summaries
RL / GRPO policy experiments
ablation or evaluation of merged policy checkpoints

It is not validated for clinical deployment, medical decision support in production, or unsupervised real-world medical use.

Limitations

This is an epoch-1 checkpoint, not a converged final model.
The training objective is project-specific and depends on custom reward shaping and judge models.
Training and evaluation were performed on internal project data/configuration rather than a standardized public benchmark release.
Merged weights inherit the usage constraints and limitations of the base Llama model.

Loading

Example with Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "jinrui123/llamasrnn-grpo-epoch001-merged"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

Notes on Licensing

This repository contains merged weights derived from meta-llama/Llama-3.2-3B-Instruct. Use and redistribution must comply with the license and access terms of the original base model.