Skip to content

Expand benchmarks section in README.md#2

Open
Homy-Xu wants to merge 1 commit into
OpenDataBox:mainfrom
Homy-Xu:patch-1
Open

Expand benchmarks section in README.md#2
Homy-Xu wants to merge 1 commit into
OpenDataBox:mainfrom
Homy-Xu:patch-1

Conversation

@Homy-Xu

@Homy-Xu Homy-Xu commented Jul 5, 2026

Copy link
Copy Markdown

Added new benchmarks and metrics for evaluating agent memory systems, including accuracy, recall, robustness, and efficiency. Updated references and added descriptions for various benchmarks.

Added new benchmarks and metrics for evaluating agent memory systems, including accuracy, recall, robustness, and efficiency. Updated references and added descriptions for various benchmarks.
Comment thread README.md
Junhao Zheng, Xidi Cai, Qiuke Li, et al. *arXiv 2025*. [[Paper](https://arxiv.org/abs/2505.11942)]
- Evaluates sequential procedural skill transfer across structurally related database, operating system, and knowledge graph tasks (1,396 tasks sharing atomic skills)
2. **LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks** ![](https://img.shields.io/badge/-503_QA-lightgrey) ![](https://img.shields.io/badge/-single_doc_QA-blue) ![](https://img.shields.io/badge/-multi_doc_QA-blue) ![](https://img.shields.io/badge/-long_context-purple) ![](https://img.shields.io/badge/-structured_data-orange)
*Yushi Bai, Shangqing Tu, Jiajie Zhang, et al. ACL 2025 / arXiv 2024.* [[Paper](https://arxiv.org/abs/2412.15204)] [[Dataset](https://huggingface.co/datasets/zai-org/LongBench-v2)] [[Github](https://github.com/THUDM/LongBench)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only retain ACL 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants