Skip to content

fix: add Python code extraction from tool_calls in PythonCodeExtraction validator #1353

Merged
akihikokuroda merged 4 commits into
generative-computing:mainfrom
akihikokuroda:pythonextraction
Jun 29, 2026
Merged

fix: add Python code extraction from tool_calls in PythonCodeExtraction validator #1353
akihikokuroda merged 4 commits into
generative-computing:mainfrom
akihikokuroda:pythonextraction

Conversation

@akihikokuroda

Copy link
Copy Markdown
Member

Pull Request

Issue

Fixes #1352

Description

Extends the PythonCodeExtraction validator to extract and analyze Python code from LLM tool_calls, not just response body text. When tool_use=True is enabled in model requests, Python code is often placed in tool arguments—this change ensures such code is captured and validated.

Changes

  • New function _extract_code_from_tool_call() in mellea/stdlib/requirements/python_reqs.py
    • Extracts Python code from ModelToolCall objects with intelligent tool name matching
    • Supports tools named "python", "python_tool", or "python" + execution keywords (executor, interpreter, runner)
    • Tries common code field names in priority order: code, script, command, source
    • Includes detailed debug logging for filtering and extraction diagnostics
  • Enhanced _has_python_code_listing() to check both text blocks and tool_calls
    • Collects code from both sources with priority scoring (+10 for text, +5 for tool_calls)
    • Improves clarity in docstring and comments about scoring rationale

Testing

  • Tests added to the respective file if code was changed

  • New code has 100% coverage if code was added

  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

  • Comprehensive test coverage in test/stdlib/requirements/test_python_tools.py

    • Tests for tool name pattern matching (Python, python_tool, python_executor, etc.)
    • Tests for field extraction (code, script, command, source)
    • Tests for whitespace handling and non-string field rejection
    • Tests for integration with ImportRestrictions and OutputSizeLimit validators
    • Tests for cross-requirement validation scenarios

Attribution

  • AI coding assistants used

Adding a new component, requirement, sampling strategy, or tool?

If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.

  • Component
  • Requirement
  • Sampling Strategy
  • Tool

NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda akihikokuroda requested a review from a team as a code owner June 26, 2026 21:09
@akihikokuroda akihikokuroda changed the title bug: add Python code extraction from tool_calls in PythonCodeExtraction validator fix: add Python code extraction from tool_calls in PythonCodeExtraction validator Jun 26, 2026
@github-actions github-actions Bot added the bug Something isn't working label Jun 26, 2026

@psschwei psschwei left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff against the live python_tool() / ModelToolCall definitions and ran the suite locally (all 248 requirements tests pass, ruff/mypy clean). This is solid and correct for the real path — the "python" name-match plus code-first field priority lines up exactly with what python_tool() produces (interpreter.py:1045, run_python(code: str)).

A few minor, non-blocking nits left inline. Approving with nits.

pass
elif tool_name_lower == "python_tool":
# "python_tool" from mellea is a strong match
pass

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead special-case (minor). No mellea tool is ever named "python_tool". The factory is the function python_tool(), but it sets name="python" by default (interpreter.py:1045). So this branch never fires for real mellea output, and the comment "from mellea" is inaccurate. The standalone "python" branch above already covers the live case. Suggest dropping this elif, or folding python_tool into the keyword set below if you want belt-and-suspenders coverage of custom names.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"python_tool" is taken out from the condition. Related tests are upated.

# Only attempt extraction from tools that explicitly handle Python
if not hasattr(tool_call, "name"):
logger.debug("Tool call missing 'name' attribute, skipping extraction")
return None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreachable guard (minor). ModelToolCall is a dataclass with required field name: str (mellea/core/base.py:1364), so hasattr(tool_call, "name") is always True — this branch is dead code that can't be covered (relevant since the 100% coverage box is unchecked). Same applies to the hasattr(tool_call, "args") and tool_call.args is None checks below (args: Mapping[str, Any], required and non-optional). Either drop them, or — if you want runtime resilience against duck-typed inputs — loosen the param type annotation to justify keeping them.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took it out.


# Step 2: Fallback to tool_calls if no text code found
if not all_blocks and last_output.tool_calls:
for tool_name, tool_call in last_output.tool_calls.items():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial: tool_name (the dict key) is unused here — _extract_code_from_tool_call reads tool_call.name instead. In practice the key always equals tool_call.name everywhere the dict is built (backends/utils.py:150, backends/ollama.py:687, etc.), so this is correct, but for tool_call in last_output.tool_calls.values(): would be cleaner.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool_name is taken out

# Score tool_call code with +5 bonus (vs +10 for text python blocks).
# This keeps tool_calls below text-visible code in priority, while above
# generic text blocks. Text blocks are preferred since they're visible to users.
all_blocks.append((extracted, _score_code_block(extracted) + 5))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test gap (optional). The +5 comment makes a specific ordering claim — tool-call code sits below text-python (+10) and above generic-text (no bonus). test_text_code_takes_priority_over_tool_calls covers the text-vs-tool case, but nothing directly exercises the tool-vs-generic-text ordering. Consider a test that asserts the relative ranking, since the scoring constants are easy to drift.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 tests are added for this.

content = last_output.value

# Look for code blocks with python specifier
import re

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing nit (carried over, not introduced here): import re inside the function. Since you're already restructuring this block, a module-level import would be slightly cleaner. Non-blocking.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import is moved.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda akihikokuroda requested a review from psschwei June 27, 2026 00:18
@akihikokuroda

Copy link
Copy Markdown
Member Author

All comments are addressed. Thanks!

@psschwei psschwei left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit on the comment, otherwise lgtm

Comment on lines +180 to +182
# Score tool_call code with +5 bonus (vs +10 for text python blocks).
# This keeps tool_calls below text-visible code in priority, while above
# generic text blocks. Text blocks are preferred since they're visible to users.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "while above generic text blocks" claim is incorrect. Tool_calls are only scored when text produces zero blocks (the if not all_blocks and last_output.tool_calls: guard on line 176). So a generic text block scoring +0 still wins over a tool_call scoring +5, because tool_calls are never scored when any text block exists. The +5 only arbitrates among tool_calls themselves; it never competes with text scores.

This makes the behavior two-tier (any text block > tool_calls), not the three-tier ordering the comment describes. test_generic_text_blocks_fallback_tier actually confirms the two-tier behavior (generic import os beats tool_call import sys), so it contradicts this comment.

Suggest rewording to match the real gating:

Suggested change
# Score tool_call code with +5 bonus (vs +10 for text python blocks).
# This keeps tool_calls below text-visible code in priority, while above
# generic text blocks. Text blocks are preferred since they're visible to users.
# Tool_calls are a fallback: only reached when text yields no code blocks
# (see the `if not all_blocks` guard above). The +5 here only orders
# multiple tool_calls relative to each other; it never competes with text
# scores. Text is always preferred since it's visible to users.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

@jakelorocco jakelorocco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking tools / tool inputs seems particularly fragile. I think we ought to reason out the approach Mellea should take to these situations. Should we provide an interface for requirement checking tool requests / tool request outputs?

This seems reasonable for python-based tools, but we should figure out a larger approach and take an opinionated stance on what Mellea does.

@akihikokuroda

Copy link
Copy Markdown
Member Author

Jake. You're right that the heuristic field matching feels fragile as a long-term solution.

Here's how we're thinking about the scope:

Current implementation (this PR): We're scoping this to Python requirements only, using conservative heuristics:

  • Only match tools with "python" in their name (case-insensitive)
  • Try common field names in priority order: code, script, command, source
  • Fall back to tool_calls only if no text code is found
  • Text extraction always takes priority (user-visible content)

This is good enough for the MVP because Python requirements are the only place we currently extract code from tool_calls, and the pattern matches the common tools we see (e.g., Claude Code's Python tool).

Larger design question: You're right that we should establish an opinionated interface for how Mellea validates tool requests/outputs. This is worth its own issue because it affects:

  • Other tool types (shell, SQL, etc.)
  • Tool capability discovery
  • Whether tools should advertise their interface (registry pattern, metadata, etc.)

Next steps: Let's create a follow-up issue to design a proper tool capability interface. In the meantime, this PR is Python-requirements-specific and documents the heuristic approach in the docstring.

@akihikokuroda

Copy link
Copy Markdown
Member Author

@jakelorocco I opened the follow up issue #1356

@akihikokuroda akihikokuroda added this pull request to the merge queue Jun 29, 2026
Merged via the queue into generative-computing:main with commit 514bd9d Jun 29, 2026
9 checks passed
@akihikokuroda akihikokuroda deleted the pythonextraction branch June 29, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Support Python code extraction from tool_calls in PythonCodeExtraction validator

3 participants