From df76905a233d6d9c6021b8bba8c39794f4313ca7 Mon Sep 17 00:00:00 2001 From: Homy-Xu Date: Sun, 5 Jul 2026 11:24:22 +0800 Subject: [PATCH] Expand benchmarks section in README.md Added new benchmarks and metrics for evaluating agent memory systems, including accuracy, recall, robustness, and efficiency. Updated references and added descriptions for various benchmarks. --- README.md | 98 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 80 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index d770a7e..a6296c7 100644 --- a/README.md +++ b/README.md @@ -113,29 +113,91 @@ Systems that package memory into complex, multi-part data containers combining u The following benchmarks are used to evaluate agent memory systems, covering task effectiveness, retrieval fidelity, update robustness, long-horizon stability, and operational cost. -1. **LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents** - Adyasha Maharana, Dong-Ho Lee, Sergey Turishcheva, et al. *ACL 2024*. [[Paper](https://arxiv.org/abs/2402.10790)] - - Long-conversation QA benchmark testing episodic, temporal, open-domain, and single-hop memory over multi-turn interactions (50 multi-modal chats; 9,209 tokens and 304 turns avg.) +### Accuracy -2. **LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory** - Di Wu, Hongwei Wang, Wenhao Yu, et al. *ICLR 2025*. [[Paper](https://arxiv.org/abs/2410.10813)] - - Multi-session long-memory benchmark evaluating cross-session QA and temporal knowledge updates (500 QA pairs; up to 1.5M tokens) +Benchmarks primarily reporting exact answer correctness, task success, tool-call correctness, or state consistency. -3. **LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks** - Yushi Bai, Shangqing Tu, Jiajie Zhang, et al. *ACL 2025 / arXiv 2024*. [[Paper](https://arxiv.org/abs/2412.15204)] - - Extreme long-context benchmark with 503 multiple-choice questions spanning 8K to 2M-word contexts, used for short, medium, and long context-length stability analysis +1. **HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering** ![](https://img.shields.io/badge/-113K_QA-lightgrey) ![](https://img.shields.io/badge/-multi_hop_QA-blue) ![](https://img.shields.io/badge/-explainable_reasoning-yellowgreen) ![](https://img.shields.io/badge/-supporting_facts-orange) + *Zhilin Yang, Peng Qi, Saizheng Zhang, et al. EMNLP, 2018.* [[Paper](https://arxiv.org/abs/1809.09600)] [[Dataset](https://hotpotqa.github.io/)] [[Github](https://github.com/hotpotqa/hotpot)] + - Metrics: Exact Match, F1, Supporting Fact EM/F1, Joint EM/F1. -4. **LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners** - Junhao Zheng, Xidi Cai, Qiuke Li, et al. *arXiv 2025*. [[Paper](https://arxiv.org/abs/2505.11942)] - - Evaluates sequential procedural skill transfer across structurally related database, operating system, and knowledge graph tasks (1,396 tasks sharing atomic skills) +2. **LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks** ![](https://img.shields.io/badge/-503_QA-lightgrey) ![](https://img.shields.io/badge/-single_doc_QA-blue) ![](https://img.shields.io/badge/-multi_doc_QA-blue) ![](https://img.shields.io/badge/-long_context-purple) ![](https://img.shields.io/badge/-structured_data-orange) + *Yushi Bai, Shangqing Tu, Jiajie Zhang, et al. ACL 2025 / arXiv 2024.* [[Paper](https://arxiv.org/abs/2412.15204)] [[Dataset](https://huggingface.co/datasets/zai-org/LongBench-v2)] [[Github](https://github.com/THUDM/LongBench)] + - Metrics: Accuracy over single-document QA, multi-document QA, long in-context learning, long-dialogue history, code repository, and long structured data. -5. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents** - Haoran Tan, Zeyu Zhang, Chen Ma, et al. *ACL 2025 (Findings)*. [[Paper](https://arxiv.org/abs/2506.21605)] - - Measures memory capabilities across different abstraction levels (factual vs. reflective) and noise conditions, with stress tests up to 100K sessions +3. **MuSiQue: Multihop Questions via Single-hop Question Composition** ![](https://img.shields.io/badge/-49.6K_QA-lightgrey) ![](https://img.shields.io/badge/-multi_hop_QA-blue) ![](https://img.shields.io/badge/-connected_reasoning-yellowgreen) ![](https://img.shields.io/badge/-answerability-orange) + *Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal. TACL, 2022.* [[Paper](https://arxiv.org/abs/2108.00573)] [[Dataset](https://github.com/stonybrooknlp/musique)] + - Metrics: Answer F1, Support F1, An+Sf, Sp+Sf. -6. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions** - Yuanzhe Hu, Yu Wang, Julian McAuley. *arXiv 2025*. [[Paper](https://arxiv.org/abs/2507.05257)] - - Tests four core competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting across 14 datasets with context lengths ranging from 103K to 1.44M tokens +4. **Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents** ![](https://img.shields.io/badge/-2K_sessions-lightgrey) ![](https://img.shields.io/badge/-400_tasks-lightgrey) ![](https://img.shields.io/badge/-tool_use-purple) ![](https://img.shields.io/badge/-memory_action_alignment-green) ![](https://img.shields.io/badge/-parameter_grounding-orange) + *Yiting Shen, Kun Li, Wei Zhou, Songlin Hu. ACL 2026 / arXiv 2026.* [[Paper](https://arxiv.org/abs/2601.19935)] [[Dataset](https://anonymous.4open.science/r/Mem2ActBench-29AC/)] [[Github](https://github.com/Cantaloupe-M/Mem2ActBench)] + - Metrics: Parameter-level F1, BLEU-1, Tool Accuracy. + +5. **MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments** ![](https://img.shields.io/badge/-128_tasks-lightgrey) ![](https://img.shields.io/badge/-26_apps-lightgrey) ![](https://img.shields.io/badge/-mobile_GUI-purple) ![](https://img.shields.io/badge/-cross_app_workflow-green) ![](https://img.shields.io/badge/-progressive_judge-orange) + *Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.06075)] [[Dataset](https://memgui-bench.github.io/)] [[Github](https://github.com/lgy0404/MemGUI-Bench)] + - Metrics: Pass@k, task success rate, memory-task proficiency ratio, staged LLM-as-a-judge result. + +[⬆️top](#table-of-contents) + +### Recall + +Benchmarks primarily measuring whether relevant facts, evidence, memories, or long-context needles are retrieved and used. + +1. **LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents** ![](https://img.shields.io/badge/-1.9K_QA-lightgrey) ![](https://img.shields.io/badge/-10_dialogues-lightgrey) ![](https://img.shields.io/badge/-long_conversation-green) ![](https://img.shields.io/badge/-temporal_QA-orange) ![](https://img.shields.io/badge/-multimodal_dialogue-purple) + *Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, et al. ACL 2024.* [[Paper](https://arxiv.org/abs/2402.17753)] [[Dataset](https://github.com/snap-research/LoCoMo)] [[Project](https://snap-research.github.io/locomo/)] + - Metrics: F1, Recall, ROUGE, FactScore, MM-Relevance, BLEU. + +2. **LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory** ![](https://img.shields.io/badge/-500_queries-lightgrey) ![](https://img.shields.io/badge/-115K_to_1.5M_tokens-lightgrey) ![](https://img.shields.io/badge/-cross_session_QA-green) ![](https://img.shields.io/badge/-knowledge_update-orange) ![](https://img.shields.io/badge/-temporal_reasoning-yellowgreen) + *Di Wu, Hongwei Wang, Wenhao Yu, et al. ICLR 2025.* [[Paper](https://arxiv.org/abs/2410.10813)] [[Dataset](https://github.com/xiaowu0162/LongMemEval)] [[Project](https://xiaowu0162.github.io/long-mem-eval/)] + - Metrics: Accuracy / LLM-Judge over information extraction, multi-session reasoning, temporal reasoning, knowledge update, and abstention. + +3. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents** ![](https://img.shields.io/badge/-53K_questions-lightgrey) ![](https://img.shields.io/badge/-65K_sessions-lightgrey) ![](https://img.shields.io/badge/-factual_memory-green) ![](https://img.shields.io/badge/-reflective_memory-yellowgreen) ![](https://img.shields.io/badge/-capacity_test-orange) ![](https://img.shields.io/badge/-efficiency-red) + *Haoran Tan, Zeyu Zhang, Chen Ma, et al. ACL 2025 Findings.* [[Paper](https://arxiv.org/abs/2506.21605)] [[Dataset](https://github.com/import-myself/Membench)] + - Metrics: Memory Accuracy, Memory Recall, Memory Capacity, Memory Efficiency. + +4. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions** ![](https://img.shields.io/badge/-2,071_QA-lightgrey) ![](https://img.shields.io/badge/-2.1K_items-lightgrey) ![](https://img.shields.io/badge/-accurate_retrieval-green) ![](https://img.shields.io/badge/-test_time_learning-orange) ![](https://img.shields.io/badge/-long_range_understanding-purple) ![](https://img.shields.io/badge/-conflict_resolution-yellowgreen) + *Yuanzhe Hu, Yu Wang, Julian McAuley. arXiv 2025.* [[Paper](https://arxiv.org/abs/2507.05257)] [[Dataset](https://www.modelscope.cn/datasets/AI-ModelScope/MemoryAgentBench)] [[Github](https://github.com/HUST-AI-HYZ/MemoryAgentBench)] + - Metrics: Accuracy, Exact Match/SubEM, Recall@5, F1 / LLM-as-a-judge across accurate retrieval, test-time learning, long-range understanding, and conflict-resolution accuracy. + +5. **RULER: What's the Real Context Size of Your Long-Context Language Models?** ![](https://img.shields.io/badge/-13_tasks-lightgrey) ![](https://img.shields.io/badge/-synthetic_benchmark-purple) ![](https://img.shields.io/badge/-needle_in_haystack-green) ![](https://img.shields.io/badge/-multi_hop_tracing-blue) ![](https://img.shields.io/badge/-aggregation-orange) + *Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, et al. COLM 2024.* [[Paper](https://arxiv.org/abs/2404.06654)] [[Dataset](https://github.com/NVIDIA/RULER)] + - Metrics: Accuracy and effective context length over retrieval, multi-hop tracing, aggregation, and QA tasks. + +[⬆️top](#table-of-contents) + +### Robustness + +Benchmarks primarily stressing conflict resolution, temporal updates, failure recovery, catastrophic forgetting, or long-horizon stability. + +1. **StreamBench: Towards Benchmarking Continuous Improvement of Language Agents** ![](https://img.shields.io/badge/-9.7K_instances-lightgrey) ![](https://img.shields.io/badge/-streaming_learning-orange) ![](https://img.shields.io/badge/-conflict_resolution-yellowgreen) ![](https://img.shields.io/badge/-online_feedback-green) + *Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, et al. NeurIPS 2024.* [[Paper](https://arxiv.org/abs/2406.08747)] [[Dataset](https://github.com/stream-bench/stream-bench)] [[Project](https://stream-bench.github.io/)] + - Metrics: Execution accuracy, Pass@1, API-call accuracy, diagnostic accuracy, exact match. + +2. **LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners** ![](https://img.shields.io/badge/-1.4K_tasks-lightgrey) ![](https://img.shields.io/badge/-database-purple) ![](https://img.shields.io/badge/-operating_system-purple) ![](https://img.shields.io/badge/-knowledge_graph-purple) ![](https://img.shields.io/badge/-skill_transfer-green) ![](https://img.shields.io/badge/-forgetting_drop-orange) + *Junhao Zheng, Xidi Cai, Qiuke Li, et al. arXiv 2025.* [[Paper](https://arxiv.org/abs/2505.11942)] [[Dataset](https://github.com/caixd-220529/LifelongAgentBench)] [[Project](https://caixd-220529.github.io/LifelongAgentBench/)] + - Metrics: Task Success Rate, transfer success, retention, forgetting drop across database, operating-system, and knowledge-graph tasks. + +3. **MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks** ![](https://img.shields.io/badge/-766_tasks-lightgrey) ![](https://img.shields.io/badge/-5_subsets-lightgrey) ![](https://img.shields.io/badge/-multi_session_agentic_tasks-purple) ![](https://img.shields.io/badge/-interdependent_subtasks-yellowgreen) ![](https://img.shields.io/badge/-decision_memory-green) + *Zexue He, Yu Wang, Churan Zhi, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.16313)] [[Dataset](https://huggingface.co/datasets/ZexueHe/memoryarena)] [[Github](https://github.com/ZexueHe/MemoryArena)] + - Metrics: Task Success Rate (SR), Task Progress Score (PS) over bundled shopping, progressive search, group travel planning, formal-reasoning math, and formal-reasoning physics. + +[⬆️top](#table-of-contents) + +### Efficiency + +Benchmarks reporting operational cost, time, step ratio, token usage, or memory-system overhead in addition to task quality. + +1. **MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments** ![](https://img.shields.io/badge/-128_tasks-lightgrey) ![](https://img.shields.io/badge/-step_ratio-red) ![](https://img.shields.io/badge/-time_per_step-red) ![](https://img.shields.io/badge/-cost_per_step-red) ![](https://img.shields.io/badge/-mobile_GUI-purple) + *Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.06075)] [[Dataset](https://memgui-bench.github.io/)] [[Github](https://github.com/lgy0404/MemGUI-Bench)] + - Metrics: Step Ratio, Time per Step, Cost per Step, task completion latency. + +2. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents** ![](https://img.shields.io/badge/-53K_questions-lightgrey) ![](https://img.shields.io/badge/-65K_sessions-lightgrey) ![](https://img.shields.io/badge/-latency-red) ![](https://img.shields.io/badge/-capacity-orange) ![](https://img.shields.io/badge/-memory_overhead-yellowgreen) + *Haoran Tan, Zeyu Zhang, Chen Ma, et al. ACL 2025 Findings.* [[Paper](https://arxiv.org/abs/2506.21605)] [[Dataset](https://github.com/import-myself/Membench)] + - Metrics: Inference time, recall-efficiency trade-off, memory capacity degradation threshold. + +3. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions** ![](https://img.shields.io/badge/-2.1K_items-lightgrey) ![](https://img.shields.io/badge/-context_103K_to_1.44M-lightgrey) ![](https://img.shields.io/badge/-latency-red) ![](https://img.shields.io/badge/-fragmentation-orange) ![](https://img.shields.io/badge/-retrieval_cost-yellowgreen) + *Yuanzhe Hu, Yu Wang, Julian McAuley. arXiv 2025.* [[Paper](https://arxiv.org/abs/2507.05257)] [[Dataset](https://www.modelscope.cn/datasets/AI-ModelScope/MemoryAgentBench)] [[Github](https://github.com/HUST-AI-HYZ/MemoryAgentBench)] + - Metrics: Runtime, retrieval overhead, memory fragmentation, context-length stress cost. [⬆️top](#table-of-contents)