初始化项目，由ModelHub XC社区提供模型

Model: IAAR-Shanghai/MemReader-4B-thinking Source: Original Platform
2026-04-29 18:11:07 +08:00
commit b7764caec3
14 changed files with 152768 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,459 @@
+---
+license: apache-2.0
+language:
+  - en
+  - zh
+pipeline_tag: text-generation
+tags:
+  - qwen3
+  - memory
+  - memory-extraction
+  - tool-calling
+  - reasoning
+  - agent
+base_model:
+  - Qwen/Qwen3-4B
+---
+
+# MemReader-4B-thinking
+
+## Introduction
+
+MemReader-4B-thinking is a 4B language model for long-term agent memory management. Instead of treating memory writing as a one-step structured extraction task, it formulates memory construction as a reasoning-and-action process: the model first evaluates whether incoming information is valuable, complete, and unambiguous, and then selects one of four memory operations:
+
+- `add_memory`: write useful and complete information into long-term memory
+- `search_memory`: retrieve historical memory for disambiguation
+- `buffer_memory`: temporarily hold incomplete but potentially valuable information
+- `ignore_memory`: discard low-value or repetitive content
+
+Built on top of [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), MemReader-4B-thinking is further optimized for memory management with supervised fine-tuning and GRPO. It is designed for long-horizon dialogue systems, personalized assistants, and agent frameworks that require low-noise, updatable, and retrievable long-term memory.
+
+## News
+
+- MemReader-4B-thinking is released as an open model for active memory management.
+- The model is designed for tool-calling workflows and memory-centric agent systems.
+- It is part of the MemReader family introduced in the paper *MemReader: Active Memory Management for Long-Term Agent Memory*.
+
+## Usage
+
+- Model ID: `IAAR-Shanghai/MemReader-4B-thinking`
+- Base model: `Qwen/Qwen3-4B`
+- Primary use: long-term memory extraction and memory management for agents
+- Inference modes: `transformers`, OpenAI-compatible serving, `vLLM`, and SGLang
+
+## Citation
+
+If you use MemReader in your research or product, please cite:
+
+```bibtex
+@misc{kang2025memreader,
+  title={MemReader: Active Memory Management for Long-Term Agent Memory},
+  author={Kang, Jingyi and Li, Chunyu and Chen, Ding and Tang, Bo and Xiong, Feiyu and Li, Zhiyu},
+  year={2026},
+  note={Manuscript in preparation}
+}
+```
+
+## Highlights
+
+- Active memory management instead of passive memory extraction
+- Explicit reasoning with thinking traces and tool calls
+- Strong performance on ambiguity resolution, knowledge update, and temporal reasoning
+- Native fit for OpenAI-style tool-calling workflows
+- Efficient local deployment with a 4B parameter footprint
+- Designed for integration with memory-centric agent systems such as MemOS
+
+## What Makes MemReader Different
+
+Most memory pipelines directly convert the current dialogue into JSON memories. In realistic settings, that approach is often insufficient:
+
+- low-value chatter can pollute memory
+- pronouns and missing references may require historical lookup
+- some information is useful but not yet complete
+- newer facts may need to update or overwrite older memory
+
+MemReader-4B-thinking reframes memory writing as active memory management. Under a ReAct-style workflow, the model reasons before acting, making memory construction closer to how practical agent systems maintain state over time.
+
+## Benchmark Performance
+
+MemReader was evaluated on LOCOMO, LongMemEval, and HaluMem. The 4B GRPO version showed especially strong gains on knowledge update, temporal reasoning, and end-to-end memory usability.
+
+### LOCOMO
+
+| Model | Single Hop | Multi Hop | Temporal | Open Domain | Overall | F1 | Avg. Token |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| MemOS (4o-mini) | 84.06% | 73.16% | 75.90% | 57.29% | 78.70% | 51.90% | 1854 |
+| MemReader-0.6B | 84.70% | 76.95% | 76.22% | 53.40% | 79.56% | 52.54% | 1976 |
+| MemReader-4B-SFT | 81.88% | 76.12% | 71.02% | 62.15% | 77.33% | 47.77% | 784 |
+| MemReader-4B-GRPO | **85.37%** | **81.44%** | 75.80% | **65.62%** | **81.42%** | 49.45% | 1950 |
+
+### LongMemEval
+
+| Model | Avg. Token | SS-User | SS-Asst | SS-Pref | Multi-Session | Knowledge Update | Temporal Reasoning | Overall |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| MemOS | 1400 | 95.71% | 67.86% | **96.67%** | 70.67% | 74.26% | 77.44% | 77.80% |
+| EverMemOS | 2800 | **97.14%** | **85.71%** | 93.33% | 73.68% | 89.74% | 77.44% | **83.00%** |
+| MemReader-0.6B | 1166 | 95.71% | 75.00% | 90.00% | **75.18%** | 82.05% | 75.90% | 80.20% |
+| MemReader-4B-SFT | 963 | 97.10% | 69.64% | 90.00% | 71.42% | 85.80% | 78.19% | 80.00% |
+| MemReader-4B-GRPO | **922** | 94.29% | 73.21% | 90.00% | 73.68% | **91.03%** | **84.21%** | **83.00%** |
+
+### HaluMem
+
+The full HaluMem table in the paper is relatively long. Below we report a compact subset of the memory extraction and memory updating results.
+
+| Model | Extraction Recall | Extraction Weighted Recall | Extraction F1 | Update Correctness | Update Hallucination | Update Omission |
+| --- | --- | --- | --- | --- | --- | --- |
+| MemOS | 74.07% | 84.81% | 79.70% | 62.11% | 0.42% | 37.48% |
+| MemReader-0.6B | 88.40% | 91.38% | 93.76% | 82.69% | 0.77% | 16.51% |
+| MemReader-4B-SFT | 93.56% | 95.49% | 96.61% | 90.78% | 0.26% | 8.74% |
+| MemReader-4B-GRPO | **96.57%** | **97.19%** | **98.21%** | **94.55%** | 0.32% | **5.12%** |
+
+These results show that stronger memory writing quality also translates into better memory updating behavior, especially on correctness and omission.
+
+## Recommended Use Cases
+
+- long-term conversational agents
+- personalized assistants
+- agent memory extraction pipelines
+- memory update and conflict resolution workflows
+- retrieval-augmented memory systems
+
+## Intended Use
+
+MemReader-4B-thinking is intended for research and production scenarios where an agent needs to convert conversational context into structured long-term memory. Typical use cases include memory extraction, ambiguity resolution with retrieval, memory update pipelines, and persistent assistant systems.
+
+The model is especially suitable when the application requires explicit control over memory-writing behavior through tool calls such as `search_memory`, `add_memory`, `buffer_memory`, and `ignore_memory`.
+
+## Model Specs
+
+- Base model: `Qwen/Qwen3-4B`
+- Parameters: 4B
+- Tensor type: BF16
+- Architecture: `Qwen3ForCausalLM`
+- Context length: 40,960 tokens
+- Primary capability: reasoning-driven memory extraction with tool calling
+
+## Quickstart
+
+### OpenAI-Compatible API Example
+
+The following example calls the model through an OpenAI-compatible endpoint with required tool calling.
+
+```python
+import json
+import requests
+
+url = "https://YOUR_ENDPOINT/v1/chat/completions"
+
+payload = {
+    "model": "IAAR-Shanghai/MemReader-4B-thinking",
+    "extra_body": {
+        "chat_template_kwargs": {
+            "enable_thinking": True
+        }
+    },
+    "messages": [
+        {
+            "role": "system",
+            "content": (
+                "You are a memory extraction agent. Your job is to analyze "
+                "conversations and decide what information is worth storing in "
+                "long-term memory.\n\n"
+                "Available actions (call exactly one per turn):\n"
+                "- search_memory: Search existing memories for context\n"
+                "- add_memory: Extract and store valuable facts, preferences, or events\n"
+                "- buffer_memory: Accumulate this turn and wait for more context\n"
+                "- ignore_memory: Nothing worth storing\n\n"
+                "Guidelines:\n"
+                "- Store specific, verifiable facts\n"
+                "- Do not store generic greetings, chitchat, or vague statements\n"
+                "- UserMemory: personal attributes or preferences about the user\n"
+                "- LongTermMemory: facts, events, or shared knowledge from the conversation\n"
+                "- If unsure whether information already exists, call search_memory first"
+            ),
+        },
+        {
+            "role": "user",
+            "content": (
+                "Please analyze the following conversation and decide what to store:\n\n"
+                "[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
+                "[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
+                "[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
+            ),
+        },
+    ],
+    "tools": [
+        {
+            "type": "function",
+            "function": {
+                "name": "search_memory",
+                "description": "Search historical memories for context.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "query": {"type": "string"}
+                    },
+                    "required": ["query"],
+                },
+            },
+        },
+        {
+            "type": "function",
+            "function": {
+                "name": "add_memory",
+                "description": "Extract and store memories.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "memory_list": {
+                            "type": "array",
+                            "items": {
+                                "type": "object",
+                                "properties": {
+                                    "key": {"type": "string"},
+                                    "memory_type": {
+                                        "type": "string",
+                                        "enum": ["LongTermMemory", "UserMemory"],
+                                    },
+                                    "value": {"type": "string"},
+                                    "tags": {
+                                        "type": "array",
+                                        "items": {"type": "string"},
+                                    },
+                                },
+                                "required": ["key", "memory_type", "value", "tags"],
+                            },
+                        },
+                        "summary": {"type": "string"},
+                    },
+                    "required": ["memory_list", "summary"],
+                },
+            },
+        },
+        {
+            "type": "function",
+            "function": {
+                "name": "buffer_memory",
+                "description": "Buffer for later processing.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "reason": {"type": "string"}
+                    },
+                    "required": ["reason"],
+                },
+            },
+        },
+        {
+            "type": "function",
+            "function": {
+                "name": "ignore_memory",
+                "description": "Ignore low-value content.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "reason": {"type": "string"}
+                    },
+                    "required": ["reason"],
+                },
+            },
+        },
+    ],
+    "tool_choice": "required",
+    "temperature": 0.2,
+    "max_tokens": 1024,
+}
+
+headers = {
+    "Authorization": "Bearer YOUR_API_KEY",
+    "Content-Type": "application/json",
+}
+
+response = requests.post(url, headers=headers, json=payload)
+print(response.text)
+```
+
+### Hugging Face Transformers Usage
+
+You can also load the model directly from Hugging Face and run memory extraction with tool calling.
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_name = "IAAR-Shanghai/MemReader-4B-thinking"
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto",
+)
+
+messages = [
+    {
+        "role": "system",
+        "content": (
+            "You are a memory extraction agent. Analyze conversations and decide "
+            "what information should be stored in long-term memory."
+        ),
+    },
+    {
+        "role": "user",
+        "content": (
+            "Please analyze the following conversation and decide what to store:\n\n"
+            "[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
+            "[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
+            "[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
+        ),
+    },
+]
+
+tools = [
+    {
+        "type": "function",
+        "function": {
+            "name": "search_memory",
+            "description": "Search historical memories for context.",
+            "parameters": {
+                "type": "object",
+                "properties": {"query": {"type": "string"}},
+                "required": ["query"],
+            },
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "add_memory",
+            "description": "Extract and store memories.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "memory_list": {
+                        "type": "array",
+                        "items": {
+                            "type": "object",
+                            "properties": {
+                                "key": {"type": "string"},
+                                "memory_type": {
+                                    "type": "string",
+                                    "enum": ["LongTermMemory", "UserMemory"],
+                                },
+                                "value": {"type": "string"},
+                                "tags": {
+                                    "type": "array",
+                                    "items": {"type": "string"},
+                                },
+                            },
+                            "required": ["key", "memory_type", "value", "tags"],
+                        },
+                    },
+                    "summary": {"type": "string"},
+                },
+                "required": ["memory_list", "summary"],
+            },
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "buffer_memory",
+            "description": "Buffer for later processing.",
+            "parameters": {
+                "type": "object",
+                "properties": {"reason": {"type": "string"}},
+                "required": ["reason"],
+            },
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "ignore_memory",
+            "description": "Ignore low-value content.",
+            "parameters": {
+                "type": "object",
+                "properties": {"reason": {"type": "string"}},
+                "required": ["reason"],
+            },
+        },
+    },
+]
+
+text = tokenizer.apply_chat_template(
+    messages,
+    tools=tools,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True,
+)
+
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
+output = tokenizer.decode(output_ids, skip_special_tokens=True)
+print(output)
+```
+
+### vLLM Usage
+
+Start an OpenAI-compatible vLLM server:
+
+```bash
+python -m vllm.entrypoints.openai.api_server \
+  --model IAAR-Shanghai/MemReader-4B-thinking \
+  --served-model-name MemReader-4B-thinking \
+  --port 8000 \
+  --tensor-parallel-size 1 \
+  --enable-auto-tool-choice \
+  --tool-call-parser hermes
+```
+
+Then send a standard chat completion request to `http://localhost:8000/v1/chat/completions`.
+
+### SGLang Usage
+
+MemReader-4B-thinking can also be deployed with SGLang through its OpenAI-compatible serving interface. Please make sure tool calling and thinking mode are enabled in your serving configuration.
+
+## Output Format
+
+MemReader-4B-thinking is trained to produce thinking traces and tool calls. A typical response looks like this:
+
+```xml
+<think>
+The conversation refers to an already known project and adds a new update:
+Michael plans to produce a Python vs Rust benchmark report this week.
+This is valuable project-state information and should be added to memory.
+</think>
+
+<tool_call>
+{"name": "add_memory", "arguments": {"memory_list": [{"key": "Rust benchmark plan", "memory_type": "LongTermMemory", "value": "Michael said the recommendation system refactoring project is still in evaluation, and he plans to produce a Python-vs-Rust benchmark report this week for the core modules under consideration for Rust rewriting.", "tags": ["project", "Rust", "benchmark", "refactoring"]}], "summary": "Added one memory about the project update and the planned benchmark report."}}
+</tool_call>
+```
+
+## Best Practices
+
+- Use `search_memory` first when the conversation contains pronouns, ellipsis, or implicit historical references.
+- Use `buffer_memory` only when the information is genuinely incomplete and cannot be resolved from history.
+- Keep tool definitions stable between training and inference.
+- For production pipelines, execute tool calls externally and feed tool responses back to the model when multi-step reasoning is needed.
+- If you want shorter outputs, reduce `max_tokens` and control whether thinking traces are exposed in your serving layer.
+
+## Limitations
+
+- The model is optimized for memory-management scenarios rather than general-purpose chatting.
+- Quality depends on the external memory schema, retrieval quality, and tool-execution loop.
+- For highly domain-specific memory schemas, additional instruction tuning may still be beneficial.
+- As with other LLMs, outputs may still contain mistakes, omissions, or unsupported inferences and should be validated in safety-critical workflows.
+
+## License Notice
+
+This model is released under the Apache-2.0 license. As it is derived from [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), users should also review and comply with the upstream base model license, usage terms, and any applicable third-party requirements before deployment.
+
+## Links
+
+- GitHub: [MemTensor/MemOS](https://github.com/MemTensor/MemOS)
+- API Documentation: [docs.openmem.net](https://docs.openmem.net/)
+- Model: [IAAR-Shanghai/MemReader-4B-thinking](https://huggingface.co/IAAR-Shanghai/MemReader-4B-thinking)
+- Base model: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)