2b04d5f0633c318439893166d6fc9a6994dac835
Model: Jincenzi/SocialR1-4B Source: Original Platform
license, language, base_model, tags, datasets, pipeline_tag
| license | language | base_model | tags | datasets | pipeline_tag | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mit |
|
|
|
|
text-generation |
SocialR1-4B
SocialR1-4B is a social reasoning model built on Qwen3-4B, trained with trajectory-level reinforcement learning (GRPO) using the Social-R1 framework. It enhances social reasoning capabilities by aligning reasoning processes with the Social Information Processing (SIP) theory.
📄 Paper: Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning
Highlights
- 🧠 SIP-Guided Reasoning: Enforces stage-consistent social inference — Cue Encoding → Cue Interpretation → Goal Clarification → Response Generation
- 🎯 Multi-Dimensional Reward: Combines structural reward, content reward, inference efficiency, and format reward with curriculum-style weighting
- 📊 Strong Performance: Enables a 4B-parameter model to match or outperform substantially larger baselines across static MCQ benchmarks, open-ended generation (FanToM), and interactive settings (SOTOPIA)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Jincenzi/SocialR1-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [
{"role": "user", "content": "You should first think about the reasoning process in the mind and then provide with the answer.The reasoning process and answer are enclosed within <think> </think> and <Answer> </Answer> tags, respectively."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
- Base Model: Qwen3-4B
- Training Method: Group Relative Policy Optimization (GRPO)
- Training Steps: 600
- Hardware: 8× NVIDIA A100 (80GB)
- Group Size: 5
- KL Coefficient: 0.04
- Learning Rate: 5×10⁻⁷
- Reward Design: SIP structural reward (
R_\text{struct}) + SIP content reward (R_\text{cont}) + inference efficiency (R_\text{len}) + format reward (R_\text{fmt})
Evaluation
SocialR1-4B is evaluated across three complementary settings:
- Static MCQ: ToMBench, ToMBench-Hard, SocialIQA, SimpleToM, EmoBench, MotiveBench, Hi-ToM, TactfulToM
- Open-ended Generation: FanToM
- Interactive Social Intelligence: SOTOPIA
Related Resources
| Resource | Link |
|---|---|
| Paper | arXiv:2603.09249 |
| SocialR1-8B | Jincenzi/SocialR1-8B |
Citation
@inproceedings{wu2026socialr1,
title={Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning},
author={Wu, Jincenzi and Lei, Yuxuan and Lian, Jianxun and Huang, Yitian and Zhou, Lexin and Li, Haotian and Yang, Deng and Xie, Xing and Meng, Helen},
booktitle={Arxiv},
year={2026}
}
Contact
For questions or discussions, please contact jincenziwu@gmail.com.
Description
Languages
Jinja
100%