SocialR1-4B is a social reasoning model built on Qwen3-4B, trained with trajectory-level reinforcement learning (GRPO) using the Social-R1 framework. It enhances social reasoning capabilities by aligning reasoning processes with the Social Information Processing (SIP) theory.
🎯Multi-Dimensional Reward: Combines structural reward, content reward, inference efficiency, and format reward with curriculum-style weighting
📊Strong Performance: Enables a 4B-parameter model to match or outperform substantially larger baselines across static MCQ benchmarks, open-ended generation (FanToM), and interactive settings (SOTOPIA)
Usage
fromtransformersimportAutoModelForCausalLM,AutoTokenizermodel_name="Jincenzi/SocialR1-4B"tokenizer=AutoTokenizer.from_pretrained(model_name)model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype="auto",device_map="auto")messages=[{"role":"user","content":"You should first think about the reasoning process in the mind and then provide with the answer.The reasoning process and answer are enclosed within <think> </think> and <Answer> </Answer> tags, respectively."}]text=tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)inputs=tokenizer([text],return_tensors="pt").to(model.device)outputs=model.generate(**inputs,max_new_tokens=2048)print(tokenizer.decode(outputs[0],skip_special_tokens=True))
Training Details
Base Model: Qwen3-4B
Training Method: Group Relative Policy Optimization (GRPO)
@inproceedings{wu2026socialr1,title={Social-R1: Enhancing Social Reasoning in LLMs through Trajectory-Level Reinforcement Learning},author={Wu, Jincenzi and Lei, Yuxuan and Lian, Jianxun and Huang, Yitian and Zhou, Lexin and Li, Haotian and Yang, Deng and Xie, Xing and Meng, Helen},booktitle={Arxiv},year={2026}}