Summary
evaluation/src/agent_runner.py::_wrap_prompt hard-codes a Chinese instruction head/tail around every task prompt, regardless of the dataset language. When running the English split (task_lite_clean_en + filesys_en), each task is therefore sent to the agent as a bilingual prompt: Chinese wrapper → English task → English [Note] → Chinese wrapper.
I'd like to confirm whether the public leaderboard numbers were produced with this same Chinese wrapper, since it appears to bias models toward Chinese output.
Where
evaluation/src/agent_runner.py, _wrap_prompt (identical on main and master):
head = (
"【重要要求 1:工作目录】\n"
f"本轮测试允许访问的工作目录是:{os.path.abspath(work_dir)}\n"
"你只能在该目录下使用相对路径读写文件;禁止访问工作目录以外的位置。\n"
...
)
tail = (
"\n【重要要求 2:输出路径列表】\n"
"在最后一步,请仅输出一个 Python 列表(list[str])...\n"
)
Questions
- Were the leaderboard results (e.g. ClaudeCode + GLM-5.1 = 52.6 on Lite) produced with this same Chinese wrapper applied to English tasks, or with an English wrapper?
- For the English dataset, is _wrap_prompt intended to emit English instructions? Is there a language switch we're missing, or should the wrapper be localized to match the task language?
Summary
evaluation/src/agent_runner.py::_wrap_prompthard-codes a Chinese instruction head/tail around every task prompt, regardless of the dataset language. When running the English split (task_lite_clean_en+filesys_en), each task is therefore sent to the agent as a bilingual prompt: Chinese wrapper → English task → English [Note] → Chinese wrapper.I'd like to confirm whether the public leaderboard numbers were produced with this same Chinese wrapper, since it appears to bias models toward Chinese output.
Where
evaluation/src/agent_runner.py,_wrap_prompt(identical onmainandmaster):Questions