Is Chinese instructions used for the leaderboard runs?

### Summary
`evaluation/src/agent_runner.py::_wrap_prompt` hard-codes a **Chinese** instruction head/tail around **every** task prompt, regardless of the dataset language. When running the **English** split (`task_lite_clean_en` + `filesys_en`), each task is therefore sent to the agent as a **bilingual** prompt: *Chinese wrapper → English task → English [Note] → Chinese wrapper*.

I'd like to confirm whether the public leaderboard numbers were produced with this same Chinese wrapper, since it appears to bias models toward Chinese output.

### Where
`evaluation/src/agent_runner.py`, `_wrap_prompt` (identical on `main` and `master`):

  ```python
  head = (
      "【重要要求 1：工作目录】\n"
      f"本轮测试允许访问的工作目录是：{os.path.abspath(work_dir)}\n"
      "你只能在该目录下使用相对路径读写文件；禁止访问工作目录以外的位置。\n"
      ...
  )
  tail = (
      "\n【重要要求 2：输出路径列表】\n"
      "在最后一步，请仅输出一个 Python 列表（list[str]）...\n"
  )
```
Questions

1. Were the leaderboard results (e.g. ClaudeCode + GLM-5.1 = 52.6 on Lite) produced with this same Chinese wrapper applied to English tasks, or with an English wrapper?
2. For the English dataset, is _wrap_prompt intended to emit English instructions? Is there a language switch we're missing, or should the wrapper be localized to match the task language?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is Chinese instructions used for the leaderboard runs? #10

Summary

Where

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Is Chinese instructions used for the leaderboard runs? #10

Description

Summary

Where

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions