fix: add Python code extraction from tool_calls in PythonCodeExtraction validator by akihikokuroda · Pull Request #1353 · generative-computing/mellea

akihikokuroda · 2026-06-26T21:09:58Z

Pull Request

Issue

Description

Extends the PythonCodeExtraction validator to extract and analyze Python code from LLM tool_calls, not just response body text. When tool_use=True is enabled in model requests, Python code is often placed in tool arguments—this change ensures such code is captured and validated.

Changes

New function _extract_code_from_tool_call() in mellea/stdlib/requirements/python_reqs.py
- Extracts Python code from ModelToolCall objects with intelligent tool name matching
- Supports tools named "python", "python_tool", or "python" + execution keywords (executor, interpreter, runner)
- Tries common code field names in priority order: code, script, command, source
- Includes detailed debug logging for filtering and extraction diagnostics
Enhanced _has_python_code_listing() to check both text blocks and tool_calls
- Collects code from both sources with priority scoring (+10 for text, +5 for tool_calls)
- Improves clarity in docstring and comments about scoring rationale

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code was added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)
Comprehensive test coverage in test/stdlib/requirements/test_python_tools.py
- Tests for tool name pattern matching (Python, python_tool, python_executor, etc.)
- Tests for field extraction (code, script, command, source)
- Tests for whitespace handling and non-string field rejection
- Tests for integration with ImportRestrictions and OutputSizeLimit validators
- Tests for cross-requirement validation scenarios

Attribution

AI coding assistants used

Adding a new component, requirement, sampling strategy, or tool?

If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.

Component
Requirement
Sampling Strategy
Tool

NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

psschwei

Reviewed the diff against the live python_tool() / ModelToolCall definitions and ran the suite locally (all 248 requirements tests pass, ruff/mypy clean). This is solid and correct for the real path — the "python" name-match plus code-first field priority lines up exactly with what python_tool() produces (interpreter.py:1045, run_python(code: str)).

A few minor, non-blocking nits left inline. Approving with nits.

psschwei · 2026-06-26T21:58:48Z

+        pass
+    elif tool_name_lower == "python_tool":
+        # "python_tool" from mellea is a strong match
+        pass


Dead special-case (minor). No mellea tool is ever named "python_tool". The factory is the function python_tool(), but it sets name="python" by default (interpreter.py:1045). So this branch never fires for real mellea output, and the comment "from mellea" is inaccurate. The standalone "python" branch above already covers the live case. Suggest dropping this elif, or folding python_tool into the keyword set below if you want belt-and-suspenders coverage of custom names.

"python_tool" is taken out from the condition. Related tests are upated.

psschwei · 2026-06-26T21:58:48Z

+    # Only attempt extraction from tools that explicitly handle Python
+    if not hasattr(tool_call, "name"):
+        logger.debug("Tool call missing 'name' attribute, skipping extraction")
+        return None


Unreachable guard (minor). ModelToolCall is a dataclass with required field name: str (mellea/core/base.py:1364), so hasattr(tool_call, "name") is always True — this branch is dead code that can't be covered (relevant since the 100% coverage box is unchecked). Same applies to the hasattr(tool_call, "args") and tool_call.args is None checks below (args: Mapping[str, Any], required and non-optional). Either drop them, or — if you want runtime resilience against duck-typed inputs — loosen the param type annotation to justify keeping them.

Took it out.

psschwei · 2026-06-26T21:58:48Z

+
+    # Step 2: Fallback to tool_calls if no text code found
+    if not all_blocks and last_output.tool_calls:
+        for tool_name, tool_call in last_output.tool_calls.items():


Trivial: tool_name (the dict key) is unused here — _extract_code_from_tool_call reads tool_call.name instead. In practice the key always equals tool_call.name everywhere the dict is built (backends/utils.py:150, backends/ollama.py:687, etc.), so this is correct, but for tool_call in last_output.tool_calls.values(): would be cleaner.

tool_name is taken out

psschwei · 2026-06-26T21:58:48Z

+                # Score tool_call code with +5 bonus (vs +10 for text python blocks).
+                # This keeps tool_calls below text-visible code in priority, while above
+                # generic text blocks. Text blocks are preferred since they're visible to users.
+                all_blocks.append((extracted, _score_code_block(extracted) + 5))


Test gap (optional). The +5 comment makes a specific ordering claim — tool-call code sits below text-python (+10) and above generic-text (no bonus). test_text_code_takes_priority_over_tool_calls covers the text-vs-tool case, but nothing directly exercises the tool-vs-generic-text ordering. Consider a test that asserts the relative ranking, since the scoring constants are easy to drift.

2 tests are added for this.

psschwei · 2026-06-26T21:58:48Z

+        content = last_output.value
+
+        # Look for code blocks with python specifier
+        import re


Pre-existing nit (carried over, not introduced here): import re inside the function. Since you're already restructuring this block, a module-level import would be slightly cleaner. Non-blocking.

Import is moved.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda · 2026-06-27T00:19:01Z

All comments are addressed. Thanks!

psschwei

small nit on the comment, otherwise lgtm

psschwei · 2026-06-29T12:26:48Z

+                # Score tool_call code with +5 bonus (vs +10 for text python blocks).
+                # This keeps tool_calls below text-visible code in priority, while above
+                # generic text blocks. Text blocks are preferred since they're visible to users.


The "while above generic text blocks" claim is incorrect. Tool_calls are only scored when text produces zero blocks (the if not all_blocks and last_output.tool_calls: guard on line 176). So a generic text block scoring +0 still wins over a tool_call scoring +5, because tool_calls are never scored when any text block exists. The +5 only arbitrates among tool_calls themselves; it never competes with text scores.

This makes the behavior two-tier (any text block > tool_calls), not the three-tier ordering the comment describes. test_generic_text_blocks_fallback_tier actually confirms the two-tier behavior (generic import os beats tool_call import sys), so it contradicts this comment.

Suggest rewording to match the real gating:

Suggested change

# Score tool_call code with +5 bonus (vs +10 for text python blocks).

# This keeps tool_calls below text-visible code in priority, while above

# generic text blocks. Text blocks are preferred since they're visible to users.

# Tool_calls are a fallback: only reached when text yields no code blocks

# (see the `if not all_blocks` guard above). The +5 here only orders

# multiple tool_calls relative to each other; it never competes with text

# scores. Text is always preferred since it's visible to users.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

jakelorocco

Checking tools / tool inputs seems particularly fragile. I think we ought to reason out the approach Mellea should take to these situations. Should we provide an interface for requirement checking tool requests / tool request outputs?

This seems reasonable for python-based tools, but we should figure out a larger approach and take an opinionated stance on what Mellea does.

akihikokuroda · 2026-06-29T14:19:29Z

Jake. You're right that the heuristic field matching feels fragile as a long-term solution.

Here's how we're thinking about the scope:

Current implementation (this PR): We're scoping this to Python requirements only, using conservative heuristics:

Only match tools with "python" in their name (case-insensitive)
Try common field names in priority order: code, script, command, source
Fall back to tool_calls only if no text code is found
Text extraction always takes priority (user-visible content)

This is good enough for the MVP because Python requirements are the only place we currently extract code from tool_calls, and the pattern matches the common tools we see (e.g., Claude Code's Python tool).

Larger design question: You're right that we should establish an opinionated interface for how Mellea validates tool requests/outputs. This is worth its own issue because it affects:

Other tool types (shell, SQL, etc.)
Tool capability discovery
Whether tools should advertise their interface (registry pattern, metadata, etc.)

Next steps: Let's create a follow-up issue to design a proper tool capability interface. In the meantime, this PR is Python-requirements-specific and documents the heuristic approach in the docstring.

akihikokuroda · 2026-06-29T15:01:27Z

@jakelorocco I opened the follow up issue #1356

akihikokuroda added 2 commits June 26, 2026 15:42

add tool use code extraction

2618bca

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

self review

d2fe734

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda requested a review from a team as a code owner June 26, 2026 21:09

akihikokuroda requested review from jakelorocco, markstur and psschwei June 26, 2026 21:09

akihikokuroda changed the title ~~bug: add Python code extraction from tool_calls in PythonCodeExtraction validator~~ fix: add Python code extraction from tool_calls in PythonCodeExtraction validator Jun 26, 2026

github-actions Bot added the bug Something isn't working label Jun 26, 2026

psschwei reviewed Jun 26, 2026

View reviewed changes

review comments

7177d27

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda requested a review from psschwei June 27, 2026 00:18

psschwei approved these changes Jun 29, 2026

View reviewed changes

review comment

256bd27

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

jakelorocco reviewed Jun 29, 2026

View reviewed changes

akihikokuroda mentioned this pull request Jun 29, 2026

Design opinionated interface for tool capability validation in requirements #1356

Closed

13 tasks

akihikokuroda requested a review from jakelorocco June 29, 2026 15:01

jakelorocco approved these changes Jun 29, 2026

View reviewed changes

akihikokuroda added this pull request to the merge queue Jun 29, 2026

Merged via the queue into generative-computing:main with commit 514bd9d Jun 29, 2026
9 checks passed

akihikokuroda deleted the pythonextraction branch June 29, 2026 16:16

-                # Score tool_call code with +5 bonus (vs +10 for text python blocks).
-                # This keeps tool_calls below text-visible code in priority, while above
-                # generic text blocks. Text blocks are preferred since they're visible to users.
+                # Tool_calls are a fallback: only reached when text yields no code blocks
+                # (see the `if not all_blocks` guard above). The +5 here only orders
+                # multiple tool_calls relative to each other; it never competes with text
+                # scores. Text is always preferred since it's visible to users.

Uh oh!

Conversation

akihikokuroda commented Jun 26, 2026

Pull Request

Issue

Description

Testing

Attribution

Adding a new component, requirement, sampling strategy, or tool?

Uh oh!

psschwei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akihikokuroda commented Jun 27, 2026

Uh oh!

psschwei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

akihikokuroda commented Jun 29, 2026

Uh oh!

akihikokuroda commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants