Thinkbench

A benchmark for autonomous coding agents, built by Thinkwright. 72 diverse, real-shaped tasks across five dimensions — not homogeneous puzzles, and not all greenfield. Each task is the kind of work people actually hand a coding agent.

The five dimensions

dimension	n	what it tests
implement	15	build a project from scratch from a spec (greenfield)
bug-fix	15	find and fix a planted bug in working code
feature-add	15	add a capability to existing code without breaking it
repair-to-green	15	a library ships with multiple interacting bugs + a failing test suite; make it green
ambiguous-spec	12	observed, not scored — a deliberately vague brief; we watch how each model interprets it

The first four are graded by a held-out behavioral grader. The fifth is the interesting one: real specs are underspecified, so instead of forcing a pass/fail we observe how a model fills the gaps — its assumptions, completeness, and where two models diverge.

Methodology

Each model runs each task in a fresh workspace through an autonomous agent loop with read_file / write_file / run_command tools. It sees only brief.txt (and setup/ starter code for non-greenfield tasks). When it stops, the held-out grader grade.py is dropped in and run with a timeout; it prints a JSON scorecard.

Fixed-denominator scoring — an empty or broken solution scores 0.0, never a misleading partial. A per-task score is passed / total checks.
Graders are held out — never shown to the model; a reference/ solution is included only to self-test each grader (every grader scores its reference 1.0).
Hard bar — graded tasks are calibrated so a naive solution lands ~0.4–0.8 and only a careful one reaches 1.0 (subtle, often interacting, bugs and edge-case graders).

Results — MiniMax M3 vs GLM 5.2

First head-to-head (both at thinking-disabled parity, 3 trials/task, cache-aware cost). Full tables — human and machine-readable — in results/minimax-m3-vs-glm-5.2/ (RESULTS.md + results.json). Each benchmark run gets its own folder under results/.

model	full-pass (60 graded)	mean score	avg latency	total cost
GLM 5.2	92%	0.976	80s	$18.47
MiniMax M3	84%	0.961	45s	$6.67

By dimension — modify-existing-code is largely solved by both; the separation is in greenfield builds:

dimension	M3	GLM
implement	0.844 (40% solve)	0.902 (67%)
bug-fix	0.999	1.000
feature-add	1.000	1.000
repair-to-green	1.000	1.000

Takeaway: GLM 5.2 is the more reliable solver (higher full-pass, zero package-delivery failures, perfect on modify-code) and earns its premium on hard greenfield work. MiniMax M3 is ~2.8× cheaper and ~1.8× faster, and statistically tied on modify-code work (0.999–1.000 across bug-fix/feature/repair) — the value pick for the bulk of real work, with one soft spot on hard greenfield builds.

Total spend to produce this benchmark and its results: $48.88 across 605 metered runs (two full runs + iteration/correction overhead).

Evaluate your own model

Each task is self-contained and standard-library-only.

Give your agent tasks/<slug>/brief.txt (and copy in tasks/<slug>/setup/ if present) in a fresh working directory.
Have it produce a package importable as <slug> (the brief says so).
From that directory, run the held-out grader: python3 grade.py — it prints a JSON scorecard with score, passed, total, and per-check detail.

For ambiguous-spec tasks there is no grader — read what the model built and compare.

Run the included runner

The Rust runner can drive OpenAI-compatible chat-completions models through the same agent loop used for this run.

cd runner
cargo run -- --list
FIREWORKS_API_KEY=... THINKBENCH_TRIALS=1 cargo run -- minimax-m3 glm-5.2

Raw runner output is written under results/runs/<run_id>/ and is ignored by git. Aggregate a raw run with:

python3 tools/analyze_run.py results/runs/<run_id> <run-name>

Safety note

This benchmark intentionally evaluates autonomous coding agents with shell access. The runner clears the environment for commands and graders so provider keys are not passed into child processes, and saved workspaces skip symlinks instead of following them. That is not a full sandbox. Run untrusted models and task suites on a disposable machine or container with no credentials, no private source trees, and no sensitive files in the workspace.

Layout

tasks/<slug>/      brief.txt + grade.py + setup/ (where applicable) + reference/   (graded)
                   brief.txt only                                                  (ambiguous-spec)
manifest.json      machine-readable task index (slug, type, brief, graded behaviors)
TASKS.md           human-readable catalog grouped by dimension
results/<run>/     per-run results, one folder per benchmark run, e.g.
                     results/minimax-m3-vs-glm-5.2/RESULTS.md   (human: per-type + per-task)
                     results/minimax-m3-vs-glm-5.2/results.json (machine-readable)
tools/             build_dataset.py (regenerate the catalog), analyze_run.py (score a run)
LICENSE / NOTICE   Apache-2.0, © Thinkwright

To regenerate the root task catalog and an ignored downloadable bundle:

python3 tools/build_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinkbench

The five dimensions

Methodology

Results — MiniMax M3 vs GLM 5.2

Evaluate your own model

Run the included runner

Safety note

Layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results/minimax-m3-vs-glm-5.2		results/minimax-m3-vs-glm-5.2
runner		runner
tasks		tasks
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
TASKS.md		TASKS.md
manifest.json		manifest.json

Folders and files

Latest commit

History

Repository files navigation

Thinkbench

The five dimensions

Methodology

Results — MiniMax M3 vs GLM 5.2

Evaluate your own model

Run the included runner

Safety note

Layout

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages