From df76905a233d6d9c6021b8bba8c39794f4313ca7 Mon Sep 17 00:00:00 2001
From: Homy-Xu <muzhihai@sjtu.edu.cn>
Date: Sun, 5 Jul 2026 11:24:22 +0800
Subject: [PATCH] Expand benchmarks section in README.md

Added new benchmarks and metrics for evaluating agent memory systems, including accuracy, recall, robustness, and efficiency. Updated references and added descriptions for various benchmarks.
---
 README.md | 98 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 80 insertions(+), 18 deletions(-)

diff --git a/README.md b/README.md
index d770a7e..a6296c7 100644
--- a/README.md
+++ b/README.md
@@ -113,29 +113,91 @@ Systems that package memory into complex, multi-part data containers combining u
 
 The following benchmarks are used to evaluate agent memory systems, covering task effectiveness, retrieval fidelity, update robustness, long-horizon stability, and operational cost.
 
-1. **LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents**
-   Adyasha Maharana, Dong-Ho Lee, Sergey Turishcheva, et al. *ACL 2024*. [[Paper](https://arxiv.org/abs/2402.10790)]
-   - Long-conversation QA benchmark testing episodic, temporal, open-domain, and single-hop memory over multi-turn interactions (50 multi-modal chats; 9,209 tokens and 304 turns avg.)
+### Accuracy
 
-2. **LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory**
-   Di Wu, Hongwei Wang, Wenhao Yu, et al. *ICLR 2025*. [[Paper](https://arxiv.org/abs/2410.10813)]
-   - Multi-session long-memory benchmark evaluating cross-session QA and temporal knowledge updates (500 QA pairs; up to 1.5M tokens)
+Benchmarks primarily reporting exact answer correctness, task success, tool-call correctness, or state consistency.
 
-3. **LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks**
-   Yushi Bai, Shangqing Tu, Jiajie Zhang, et al. *ACL 2025 / arXiv 2024*. [[Paper](https://arxiv.org/abs/2412.15204)]
-   - Extreme long-context benchmark with 503 multiple-choice questions spanning 8K to 2M-word contexts, used for short, medium, and long context-length stability analysis
+1. **HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering** ![](https://img.shields.io/badge/-113K_QA-lightgrey) ![](https://img.shields.io/badge/-multi_hop_QA-blue) ![](https://img.shields.io/badge/-explainable_reasoning-yellowgreen) ![](https://img.shields.io/badge/-supporting_facts-orange)
+   *Zhilin Yang, Peng Qi, Saizheng Zhang, et al. EMNLP, 2018.* [[Paper](https://arxiv.org/abs/1809.09600)] [[Dataset](https://hotpotqa.github.io/)] [[Github](https://github.com/hotpotqa/hotpot)]
+   - Metrics: Exact Match, F1, Supporting Fact EM/F1, Joint EM/F1.
 
-4. **LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners**
-   Junhao Zheng, Xidi Cai, Qiuke Li, et al. *arXiv 2025*. [[Paper](https://arxiv.org/abs/2505.11942)]
-   - Evaluates sequential procedural skill transfer across structurally related database, operating system, and knowledge graph tasks (1,396 tasks sharing atomic skills)
+2. **LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks** ![](https://img.shields.io/badge/-503_QA-lightgrey) ![](https://img.shields.io/badge/-single_doc_QA-blue) ![](https://img.shields.io/badge/-multi_doc_QA-blue) ![](https://img.shields.io/badge/-long_context-purple) ![](https://img.shields.io/badge/-structured_data-orange)
+   *Yushi Bai, Shangqing Tu, Jiajie Zhang, et al. ACL 2025 / arXiv 2024.* [[Paper](https://arxiv.org/abs/2412.15204)] [[Dataset](https://huggingface.co/datasets/zai-org/LongBench-v2)] [[Github](https://github.com/THUDM/LongBench)]
+   - Metrics: Accuracy over single-document QA, multi-document QA, long in-context learning, long-dialogue history, code repository, and long structured data.
 
-5. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents**
-   Haoran Tan, Zeyu Zhang, Chen Ma, et al. *ACL 2025 (Findings)*. [[Paper](https://arxiv.org/abs/2506.21605)]
-   - Measures memory capabilities across different abstraction levels (factual vs. reflective) and noise conditions, with stress tests up to 100K sessions
+3. **MuSiQue: Multihop Questions via Single-hop Question Composition** ![](https://img.shields.io/badge/-49.6K_QA-lightgrey) ![](https://img.shields.io/badge/-multi_hop_QA-blue) ![](https://img.shields.io/badge/-connected_reasoning-yellowgreen) ![](https://img.shields.io/badge/-answerability-orange)
+   *Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal. TACL, 2022.* [[Paper](https://arxiv.org/abs/2108.00573)] [[Dataset](https://github.com/stonybrooknlp/musique)]
+   - Metrics: Answer F1, Support F1, An+Sf, Sp+Sf.
 
-6. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions**
-   Yuanzhe Hu, Yu Wang, Julian McAuley. *arXiv 2025*. [[Paper](https://arxiv.org/abs/2507.05257)]
-   - Tests four core competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting across 14 datasets with context lengths ranging from 103K to 1.44M tokens
+4. **Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents** ![](https://img.shields.io/badge/-2K_sessions-lightgrey) ![](https://img.shields.io/badge/-400_tasks-lightgrey) ![](https://img.shields.io/badge/-tool_use-purple) ![](https://img.shields.io/badge/-memory_action_alignment-green) ![](https://img.shields.io/badge/-parameter_grounding-orange)
+   *Yiting Shen, Kun Li, Wei Zhou, Songlin Hu. ACL 2026 / arXiv 2026.* [[Paper](https://arxiv.org/abs/2601.19935)] [[Dataset](https://anonymous.4open.science/r/Mem2ActBench-29AC/)] [[Github](https://github.com/Cantaloupe-M/Mem2ActBench)]
+   - Metrics: Parameter-level F1, BLEU-1, Tool Accuracy.
+
+5. **MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments** ![](https://img.shields.io/badge/-128_tasks-lightgrey) ![](https://img.shields.io/badge/-26_apps-lightgrey) ![](https://img.shields.io/badge/-mobile_GUI-purple) ![](https://img.shields.io/badge/-cross_app_workflow-green) ![](https://img.shields.io/badge/-progressive_judge-orange)
+   *Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.06075)] [[Dataset](https://memgui-bench.github.io/)] [[Github](https://github.com/lgy0404/MemGUI-Bench)]
+   - Metrics: Pass@k, task success rate, memory-task proficiency ratio, staged LLM-as-a-judge result.
+
+[⬆️top](#table-of-contents)
+
+### Recall
+
+Benchmarks primarily measuring whether relevant facts, evidence, memories, or long-context needles are retrieved and used.
+
+1. **LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents** ![](https://img.shields.io/badge/-1.9K_QA-lightgrey) ![](https://img.shields.io/badge/-10_dialogues-lightgrey) ![](https://img.shields.io/badge/-long_conversation-green) ![](https://img.shields.io/badge/-temporal_QA-orange) ![](https://img.shields.io/badge/-multimodal_dialogue-purple)
+   *Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, et al. ACL 2024.* [[Paper](https://arxiv.org/abs/2402.17753)] [[Dataset](https://github.com/snap-research/LoCoMo)] [[Project](https://snap-research.github.io/locomo/)]
+   - Metrics: F1, Recall, ROUGE, FactScore, MM-Relevance, BLEU.
+
+2. **LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory** ![](https://img.shields.io/badge/-500_queries-lightgrey) ![](https://img.shields.io/badge/-115K_to_1.5M_tokens-lightgrey) ![](https://img.shields.io/badge/-cross_session_QA-green) ![](https://img.shields.io/badge/-knowledge_update-orange) ![](https://img.shields.io/badge/-temporal_reasoning-yellowgreen)
+   *Di Wu, Hongwei Wang, Wenhao Yu, et al. ICLR 2025.* [[Paper](https://arxiv.org/abs/2410.10813)] [[Dataset](https://github.com/xiaowu0162/LongMemEval)] [[Project](https://xiaowu0162.github.io/long-mem-eval/)]
+   - Metrics: Accuracy / LLM-Judge over information extraction, multi-session reasoning, temporal reasoning, knowledge update, and abstention.
+
+3. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents** ![](https://img.shields.io/badge/-53K_questions-lightgrey) ![](https://img.shields.io/badge/-65K_sessions-lightgrey) ![](https://img.shields.io/badge/-factual_memory-green) ![](https://img.shields.io/badge/-reflective_memory-yellowgreen) ![](https://img.shields.io/badge/-capacity_test-orange) ![](https://img.shields.io/badge/-efficiency-red)
+   *Haoran Tan, Zeyu Zhang, Chen Ma, et al. ACL 2025 Findings.* [[Paper](https://arxiv.org/abs/2506.21605)] [[Dataset](https://github.com/import-myself/Membench)]
+   - Metrics: Memory Accuracy, Memory Recall, Memory Capacity, Memory Efficiency.
+
+4. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions** ![](https://img.shields.io/badge/-2,071_QA-lightgrey) ![](https://img.shields.io/badge/-2.1K_items-lightgrey) ![](https://img.shields.io/badge/-accurate_retrieval-green) ![](https://img.shields.io/badge/-test_time_learning-orange) ![](https://img.shields.io/badge/-long_range_understanding-purple) ![](https://img.shields.io/badge/-conflict_resolution-yellowgreen)
+   *Yuanzhe Hu, Yu Wang, Julian McAuley. arXiv 2025.* [[Paper](https://arxiv.org/abs/2507.05257)] [[Dataset](https://www.modelscope.cn/datasets/AI-ModelScope/MemoryAgentBench)] [[Github](https://github.com/HUST-AI-HYZ/MemoryAgentBench)]
+   - Metrics: Accuracy, Exact Match/SubEM, Recall@5, F1 / LLM-as-a-judge across accurate retrieval, test-time learning, long-range understanding, and conflict-resolution accuracy.
+
+5. **RULER: What's the Real Context Size of Your Long-Context Language Models?** ![](https://img.shields.io/badge/-13_tasks-lightgrey) ![](https://img.shields.io/badge/-synthetic_benchmark-purple) ![](https://img.shields.io/badge/-needle_in_haystack-green) ![](https://img.shields.io/badge/-multi_hop_tracing-blue) ![](https://img.shields.io/badge/-aggregation-orange)
+   *Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, et al. COLM 2024.* [[Paper](https://arxiv.org/abs/2404.06654)] [[Dataset](https://github.com/NVIDIA/RULER)]
+   - Metrics: Accuracy and effective context length over retrieval, multi-hop tracing, aggregation, and QA tasks.
+
+[⬆️top](#table-of-contents)
+
+### Robustness
+
+Benchmarks primarily stressing conflict resolution, temporal updates, failure recovery, catastrophic forgetting, or long-horizon stability.
+
+1. **StreamBench: Towards Benchmarking Continuous Improvement of Language Agents** ![](https://img.shields.io/badge/-9.7K_instances-lightgrey)  ![](https://img.shields.io/badge/-streaming_learning-orange) ![](https://img.shields.io/badge/-conflict_resolution-yellowgreen) ![](https://img.shields.io/badge/-online_feedback-green)
+   *Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, et al. NeurIPS 2024.* [[Paper](https://arxiv.org/abs/2406.08747)] [[Dataset](https://github.com/stream-bench/stream-bench)] [[Project](https://stream-bench.github.io/)]
+   - Metrics: Execution accuracy, Pass@1, API-call accuracy, diagnostic accuracy, exact match.
+
+2. **LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners** ![](https://img.shields.io/badge/-1.4K_tasks-lightgrey) ![](https://img.shields.io/badge/-database-purple) ![](https://img.shields.io/badge/-operating_system-purple) ![](https://img.shields.io/badge/-knowledge_graph-purple) ![](https://img.shields.io/badge/-skill_transfer-green) ![](https://img.shields.io/badge/-forgetting_drop-orange)
+   *Junhao Zheng, Xidi Cai, Qiuke Li, et al. arXiv 2025.* [[Paper](https://arxiv.org/abs/2505.11942)] [[Dataset](https://github.com/caixd-220529/LifelongAgentBench)] [[Project](https://caixd-220529.github.io/LifelongAgentBench/)]
+   - Metrics: Task Success Rate, transfer success, retention, forgetting drop across database, operating-system, and knowledge-graph tasks.
+
+3. **MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks** ![](https://img.shields.io/badge/-766_tasks-lightgrey) ![](https://img.shields.io/badge/-5_subsets-lightgrey) ![](https://img.shields.io/badge/-multi_session_agentic_tasks-purple) ![](https://img.shields.io/badge/-interdependent_subtasks-yellowgreen) ![](https://img.shields.io/badge/-decision_memory-green)
+   *Zexue He, Yu Wang, Churan Zhi, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.16313)] [[Dataset](https://huggingface.co/datasets/ZexueHe/memoryarena)] [[Github](https://github.com/ZexueHe/MemoryArena)]
+   - Metrics: Task Success Rate (SR), Task Progress Score (PS) over bundled shopping, progressive search, group travel planning, formal-reasoning math, and formal-reasoning physics.
+
+[⬆️top](#table-of-contents)
+
+### Efficiency
+
+Benchmarks reporting operational cost, time, step ratio, token usage, or memory-system overhead in addition to task quality.
+
+1. **MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments** ![](https://img.shields.io/badge/-128_tasks-lightgrey) ![](https://img.shields.io/badge/-step_ratio-red) ![](https://img.shields.io/badge/-time_per_step-red) ![](https://img.shields.io/badge/-cost_per_step-red) ![](https://img.shields.io/badge/-mobile_GUI-purple)
+   *Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.06075)] [[Dataset](https://memgui-bench.github.io/)] [[Github](https://github.com/lgy0404/MemGUI-Bench)]
+   - Metrics: Step Ratio, Time per Step, Cost per Step, task completion latency.
+
+2. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents** ![](https://img.shields.io/badge/-53K_questions-lightgrey) ![](https://img.shields.io/badge/-65K_sessions-lightgrey) ![](https://img.shields.io/badge/-latency-red) ![](https://img.shields.io/badge/-capacity-orange) ![](https://img.shields.io/badge/-memory_overhead-yellowgreen)
+   *Haoran Tan, Zeyu Zhang, Chen Ma, et al. ACL 2025 Findings.* [[Paper](https://arxiv.org/abs/2506.21605)] [[Dataset](https://github.com/import-myself/Membench)]
+   - Metrics: Inference time, recall-efficiency trade-off, memory capacity degradation threshold.
+
+3. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions** ![](https://img.shields.io/badge/-2.1K_items-lightgrey) ![](https://img.shields.io/badge/-context_103K_to_1.44M-lightgrey) ![](https://img.shields.io/badge/-latency-red) ![](https://img.shields.io/badge/-fragmentation-orange) ![](https://img.shields.io/badge/-retrieval_cost-yellowgreen)
+   *Yuanzhe Hu, Yu Wang, Julian McAuley. arXiv 2025.* [[Paper](https://arxiv.org/abs/2507.05257)] [[Dataset](https://www.modelscope.cn/datasets/AI-ModelScope/MemoryAgentBench)] [[Github](https://github.com/HUST-AI-HYZ/MemoryAgentBench)]
+   - Metrics: Runtime, retrieval overhead, memory fragmentation, context-length stress cost.
 
 [⬆️top](#table-of-contents)