Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 80 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,29 +113,91 @@ Systems that package memory into complex, multi-part data containers combining u

The following benchmarks are used to evaluate agent memory systems, covering task effectiveness, retrieval fidelity, update robustness, long-horizon stability, and operational cost.

1. **LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents**
Adyasha Maharana, Dong-Ho Lee, Sergey Turishcheva, et al. *ACL 2024*. [[Paper](https://arxiv.org/abs/2402.10790)]
- Long-conversation QA benchmark testing episodic, temporal, open-domain, and single-hop memory over multi-turn interactions (50 multi-modal chats; 9,209 tokens and 304 turns avg.)
### Accuracy

2. **LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory**
Di Wu, Hongwei Wang, Wenhao Yu, et al. *ICLR 2025*. [[Paper](https://arxiv.org/abs/2410.10813)]
- Multi-session long-memory benchmark evaluating cross-session QA and temporal knowledge updates (500 QA pairs; up to 1.5M tokens)
Benchmarks primarily reporting exact answer correctness, task success, tool-call correctness, or state consistency.

3. **LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks**
Yushi Bai, Shangqing Tu, Jiajie Zhang, et al. *ACL 2025 / arXiv 2024*. [[Paper](https://arxiv.org/abs/2412.15204)]
- Extreme long-context benchmark with 503 multiple-choice questions spanning 8K to 2M-word contexts, used for short, medium, and long context-length stability analysis
1. **HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering** ![](https://img.shields.io/badge/-113K_QA-lightgrey) ![](https://img.shields.io/badge/-multi_hop_QA-blue) ![](https://img.shields.io/badge/-explainable_reasoning-yellowgreen) ![](https://img.shields.io/badge/-supporting_facts-orange)
*Zhilin Yang, Peng Qi, Saizheng Zhang, et al. EMNLP, 2018.* [[Paper](https://arxiv.org/abs/1809.09600)] [[Dataset](https://hotpotqa.github.io/)] [[Github](https://github.com/hotpotqa/hotpot)]
- Metrics: Exact Match, F1, Supporting Fact EM/F1, Joint EM/F1.

4. **LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners**
Junhao Zheng, Xidi Cai, Qiuke Li, et al. *arXiv 2025*. [[Paper](https://arxiv.org/abs/2505.11942)]
- Evaluates sequential procedural skill transfer across structurally related database, operating system, and knowledge graph tasks (1,396 tasks sharing atomic skills)
2. **LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks** ![](https://img.shields.io/badge/-503_QA-lightgrey) ![](https://img.shields.io/badge/-single_doc_QA-blue) ![](https://img.shields.io/badge/-multi_doc_QA-blue) ![](https://img.shields.io/badge/-long_context-purple) ![](https://img.shields.io/badge/-structured_data-orange)
*Yushi Bai, Shangqing Tu, Jiajie Zhang, et al. ACL 2025 / arXiv 2024.* [[Paper](https://arxiv.org/abs/2412.15204)] [[Dataset](https://huggingface.co/datasets/zai-org/LongBench-v2)] [[Github](https://github.com/THUDM/LongBench)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only retain ACL 2025

- Metrics: Accuracy over single-document QA, multi-document QA, long in-context learning, long-dialogue history, code repository, and long structured data.

5. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents**
Haoran Tan, Zeyu Zhang, Chen Ma, et al. *ACL 2025 (Findings)*. [[Paper](https://arxiv.org/abs/2506.21605)]
- Measures memory capabilities across different abstraction levels (factual vs. reflective) and noise conditions, with stress tests up to 100K sessions
3. **MuSiQue: Multihop Questions via Single-hop Question Composition** ![](https://img.shields.io/badge/-49.6K_QA-lightgrey) ![](https://img.shields.io/badge/-multi_hop_QA-blue) ![](https://img.shields.io/badge/-connected_reasoning-yellowgreen) ![](https://img.shields.io/badge/-answerability-orange)
*Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal. TACL, 2022.* [[Paper](https://arxiv.org/abs/2108.00573)] [[Dataset](https://github.com/stonybrooknlp/musique)]
- Metrics: Answer F1, Support F1, An+Sf, Sp+Sf.

6. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions**
Yuanzhe Hu, Yu Wang, Julian McAuley. *arXiv 2025*. [[Paper](https://arxiv.org/abs/2507.05257)]
- Tests four core competencies: accurate retrieval, test-time learning, long-range understanding, and selective forgetting across 14 datasets with context lengths ranging from 103K to 1.44M tokens
4. **Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents** ![](https://img.shields.io/badge/-2K_sessions-lightgrey) ![](https://img.shields.io/badge/-400_tasks-lightgrey) ![](https://img.shields.io/badge/-tool_use-purple) ![](https://img.shields.io/badge/-memory_action_alignment-green) ![](https://img.shields.io/badge/-parameter_grounding-orange)
*Yiting Shen, Kun Li, Wei Zhou, Songlin Hu. ACL 2026 / arXiv 2026.* [[Paper](https://arxiv.org/abs/2601.19935)] [[Dataset](https://anonymous.4open.science/r/Mem2ActBench-29AC/)] [[Github](https://github.com/Cantaloupe-M/Mem2ActBench)]
- Metrics: Parameter-level F1, BLEU-1, Tool Accuracy.

5. **MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments** ![](https://img.shields.io/badge/-128_tasks-lightgrey) ![](https://img.shields.io/badge/-26_apps-lightgrey) ![](https://img.shields.io/badge/-mobile_GUI-purple) ![](https://img.shields.io/badge/-cross_app_workflow-green) ![](https://img.shields.io/badge/-progressive_judge-orange)
*Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.06075)] [[Dataset](https://memgui-bench.github.io/)] [[Github](https://github.com/lgy0404/MemGUI-Bench)]
- Metrics: Pass@k, task success rate, memory-task proficiency ratio, staged LLM-as-a-judge result.

[⬆️top](#table-of-contents)

### Recall

Benchmarks primarily measuring whether relevant facts, evidence, memories, or long-context needles are retrieved and used.

1. **LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents** ![](https://img.shields.io/badge/-1.9K_QA-lightgrey) ![](https://img.shields.io/badge/-10_dialogues-lightgrey) ![](https://img.shields.io/badge/-long_conversation-green) ![](https://img.shields.io/badge/-temporal_QA-orange) ![](https://img.shields.io/badge/-multimodal_dialogue-purple)
*Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, et al. ACL 2024.* [[Paper](https://arxiv.org/abs/2402.17753)] [[Dataset](https://github.com/snap-research/LoCoMo)] [[Project](https://snap-research.github.io/locomo/)]
- Metrics: F1, Recall, ROUGE, FactScore, MM-Relevance, BLEU.

2. **LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory** ![](https://img.shields.io/badge/-500_queries-lightgrey) ![](https://img.shields.io/badge/-115K_to_1.5M_tokens-lightgrey) ![](https://img.shields.io/badge/-cross_session_QA-green) ![](https://img.shields.io/badge/-knowledge_update-orange) ![](https://img.shields.io/badge/-temporal_reasoning-yellowgreen)
*Di Wu, Hongwei Wang, Wenhao Yu, et al. ICLR 2025.* [[Paper](https://arxiv.org/abs/2410.10813)] [[Dataset](https://github.com/xiaowu0162/LongMemEval)] [[Project](https://xiaowu0162.github.io/long-mem-eval/)]
- Metrics: Accuracy / LLM-Judge over information extraction, multi-session reasoning, temporal reasoning, knowledge update, and abstention.

3. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents** ![](https://img.shields.io/badge/-53K_questions-lightgrey) ![](https://img.shields.io/badge/-65K_sessions-lightgrey) ![](https://img.shields.io/badge/-factual_memory-green) ![](https://img.shields.io/badge/-reflective_memory-yellowgreen) ![](https://img.shields.io/badge/-capacity_test-orange) ![](https://img.shields.io/badge/-efficiency-red)
*Haoran Tan, Zeyu Zhang, Chen Ma, et al. ACL 2025 Findings.* [[Paper](https://arxiv.org/abs/2506.21605)] [[Dataset](https://github.com/import-myself/Membench)]
- Metrics: Memory Accuracy, Memory Recall, Memory Capacity, Memory Efficiency.

4. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions** ![](https://img.shields.io/badge/-2,071_QA-lightgrey) ![](https://img.shields.io/badge/-2.1K_items-lightgrey) ![](https://img.shields.io/badge/-accurate_retrieval-green) ![](https://img.shields.io/badge/-test_time_learning-orange) ![](https://img.shields.io/badge/-long_range_understanding-purple) ![](https://img.shields.io/badge/-conflict_resolution-yellowgreen)
*Yuanzhe Hu, Yu Wang, Julian McAuley. arXiv 2025.* [[Paper](https://arxiv.org/abs/2507.05257)] [[Dataset](https://www.modelscope.cn/datasets/AI-ModelScope/MemoryAgentBench)] [[Github](https://github.com/HUST-AI-HYZ/MemoryAgentBench)]
- Metrics: Accuracy, Exact Match/SubEM, Recall@5, F1 / LLM-as-a-judge across accurate retrieval, test-time learning, long-range understanding, and conflict-resolution accuracy.

5. **RULER: What's the Real Context Size of Your Long-Context Language Models?** ![](https://img.shields.io/badge/-13_tasks-lightgrey) ![](https://img.shields.io/badge/-synthetic_benchmark-purple) ![](https://img.shields.io/badge/-needle_in_haystack-green) ![](https://img.shields.io/badge/-multi_hop_tracing-blue) ![](https://img.shields.io/badge/-aggregation-orange)
*Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, et al. COLM 2024.* [[Paper](https://arxiv.org/abs/2404.06654)] [[Dataset](https://github.com/NVIDIA/RULER)]
- Metrics: Accuracy and effective context length over retrieval, multi-hop tracing, aggregation, and QA tasks.

[⬆️top](#table-of-contents)

### Robustness

Benchmarks primarily stressing conflict resolution, temporal updates, failure recovery, catastrophic forgetting, or long-horizon stability.

1. **StreamBench: Towards Benchmarking Continuous Improvement of Language Agents** ![](https://img.shields.io/badge/-9.7K_instances-lightgrey) ![](https://img.shields.io/badge/-streaming_learning-orange) ![](https://img.shields.io/badge/-conflict_resolution-yellowgreen) ![](https://img.shields.io/badge/-online_feedback-green)
*Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, et al. NeurIPS 2024.* [[Paper](https://arxiv.org/abs/2406.08747)] [[Dataset](https://github.com/stream-bench/stream-bench)] [[Project](https://stream-bench.github.io/)]
- Metrics: Execution accuracy, Pass@1, API-call accuracy, diagnostic accuracy, exact match.

2. **LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners** ![](https://img.shields.io/badge/-1.4K_tasks-lightgrey) ![](https://img.shields.io/badge/-database-purple) ![](https://img.shields.io/badge/-operating_system-purple) ![](https://img.shields.io/badge/-knowledge_graph-purple) ![](https://img.shields.io/badge/-skill_transfer-green) ![](https://img.shields.io/badge/-forgetting_drop-orange)
*Junhao Zheng, Xidi Cai, Qiuke Li, et al. arXiv 2025.* [[Paper](https://arxiv.org/abs/2505.11942)] [[Dataset](https://github.com/caixd-220529/LifelongAgentBench)] [[Project](https://caixd-220529.github.io/LifelongAgentBench/)]
- Metrics: Task Success Rate, transfer success, retention, forgetting drop across database, operating-system, and knowledge-graph tasks.

3. **MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks** ![](https://img.shields.io/badge/-766_tasks-lightgrey) ![](https://img.shields.io/badge/-5_subsets-lightgrey) ![](https://img.shields.io/badge/-multi_session_agentic_tasks-purple) ![](https://img.shields.io/badge/-interdependent_subtasks-yellowgreen) ![](https://img.shields.io/badge/-decision_memory-green)
*Zexue He, Yu Wang, Churan Zhi, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.16313)] [[Dataset](https://huggingface.co/datasets/ZexueHe/memoryarena)] [[Github](https://github.com/ZexueHe/MemoryArena)]
- Metrics: Task Success Rate (SR), Task Progress Score (PS) over bundled shopping, progressive search, group travel planning, formal-reasoning math, and formal-reasoning physics.

[⬆️top](#table-of-contents)

### Efficiency

Benchmarks reporting operational cost, time, step ratio, token usage, or memory-system overhead in addition to task quality.

1. **MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments** ![](https://img.shields.io/badge/-128_tasks-lightgrey) ![](https://img.shields.io/badge/-step_ratio-red) ![](https://img.shields.io/badge/-time_per_step-red) ![](https://img.shields.io/badge/-cost_per_step-red) ![](https://img.shields.io/badge/-mobile_GUI-purple)
*Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, et al. arXiv 2026.* [[Paper](https://arxiv.org/abs/2602.06075)] [[Dataset](https://memgui-bench.github.io/)] [[Github](https://github.com/lgy0404/MemGUI-Bench)]
- Metrics: Step Ratio, Time per Step, Cost per Step, task completion latency.

2. **MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents** ![](https://img.shields.io/badge/-53K_questions-lightgrey) ![](https://img.shields.io/badge/-65K_sessions-lightgrey) ![](https://img.shields.io/badge/-latency-red) ![](https://img.shields.io/badge/-capacity-orange) ![](https://img.shields.io/badge/-memory_overhead-yellowgreen)
*Haoran Tan, Zeyu Zhang, Chen Ma, et al. ACL 2025 Findings.* [[Paper](https://arxiv.org/abs/2506.21605)] [[Dataset](https://github.com/import-myself/Membench)]
- Metrics: Inference time, recall-efficiency trade-off, memory capacity degradation threshold.

3. **MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions** ![](https://img.shields.io/badge/-2.1K_items-lightgrey) ![](https://img.shields.io/badge/-context_103K_to_1.44M-lightgrey) ![](https://img.shields.io/badge/-latency-red) ![](https://img.shields.io/badge/-fragmentation-orange) ![](https://img.shields.io/badge/-retrieval_cost-yellowgreen)
*Yuanzhe Hu, Yu Wang, Julian McAuley. arXiv 2025.* [[Paper](https://arxiv.org/abs/2507.05257)] [[Dataset](https://www.modelscope.cn/datasets/AI-ModelScope/MemoryAgentBench)] [[Github](https://github.com/HUST-AI-HYZ/MemoryAgentBench)]
- Metrics: Runtime, retrieval overhead, memory fragmentation, context-length stress cost.

[⬆️top](#table-of-contents)

Expand Down