fix: add Python code extraction from tool_calls in PythonCodeExtraction validator #1353
Conversation
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
psschwei
left a comment
There was a problem hiding this comment.
Reviewed the diff against the live python_tool() / ModelToolCall definitions and ran the suite locally (all 248 requirements tests pass, ruff/mypy clean). This is solid and correct for the real path — the "python" name-match plus code-first field priority lines up exactly with what python_tool() produces (interpreter.py:1045, run_python(code: str)).
A few minor, non-blocking nits left inline. Approving with nits.
| pass | ||
| elif tool_name_lower == "python_tool": | ||
| # "python_tool" from mellea is a strong match | ||
| pass |
There was a problem hiding this comment.
Dead special-case (minor). No mellea tool is ever named "python_tool". The factory is the function python_tool(), but it sets name="python" by default (interpreter.py:1045). So this branch never fires for real mellea output, and the comment "from mellea" is inaccurate. The standalone "python" branch above already covers the live case. Suggest dropping this elif, or folding python_tool into the keyword set below if you want belt-and-suspenders coverage of custom names.
There was a problem hiding this comment.
"python_tool" is taken out from the condition. Related tests are upated.
| # Only attempt extraction from tools that explicitly handle Python | ||
| if not hasattr(tool_call, "name"): | ||
| logger.debug("Tool call missing 'name' attribute, skipping extraction") | ||
| return None |
There was a problem hiding this comment.
Unreachable guard (minor). ModelToolCall is a dataclass with required field name: str (mellea/core/base.py:1364), so hasattr(tool_call, "name") is always True — this branch is dead code that can't be covered (relevant since the 100% coverage box is unchecked). Same applies to the hasattr(tool_call, "args") and tool_call.args is None checks below (args: Mapping[str, Any], required and non-optional). Either drop them, or — if you want runtime resilience against duck-typed inputs — loosen the param type annotation to justify keeping them.
|
|
||
| # Step 2: Fallback to tool_calls if no text code found | ||
| if not all_blocks and last_output.tool_calls: | ||
| for tool_name, tool_call in last_output.tool_calls.items(): |
There was a problem hiding this comment.
Trivial: tool_name (the dict key) is unused here — _extract_code_from_tool_call reads tool_call.name instead. In practice the key always equals tool_call.name everywhere the dict is built (backends/utils.py:150, backends/ollama.py:687, etc.), so this is correct, but for tool_call in last_output.tool_calls.values(): would be cleaner.
There was a problem hiding this comment.
tool_name is taken out
| # Score tool_call code with +5 bonus (vs +10 for text python blocks). | ||
| # This keeps tool_calls below text-visible code in priority, while above | ||
| # generic text blocks. Text blocks are preferred since they're visible to users. | ||
| all_blocks.append((extracted, _score_code_block(extracted) + 5)) |
There was a problem hiding this comment.
Test gap (optional). The +5 comment makes a specific ordering claim — tool-call code sits below text-python (+10) and above generic-text (no bonus). test_text_code_takes_priority_over_tool_calls covers the text-vs-tool case, but nothing directly exercises the tool-vs-generic-text ordering. Consider a test that asserts the relative ranking, since the scoring constants are easy to drift.
There was a problem hiding this comment.
2 tests are added for this.
| content = last_output.value | ||
|
|
||
| # Look for code blocks with python specifier | ||
| import re |
There was a problem hiding this comment.
Pre-existing nit (carried over, not introduced here): import re inside the function. Since you're already restructuring this block, a module-level import would be slightly cleaner. Non-blocking.
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
|
All comments are addressed. Thanks! |
psschwei
left a comment
There was a problem hiding this comment.
small nit on the comment, otherwise lgtm
| # Score tool_call code with +5 bonus (vs +10 for text python blocks). | ||
| # This keeps tool_calls below text-visible code in priority, while above | ||
| # generic text blocks. Text blocks are preferred since they're visible to users. |
There was a problem hiding this comment.
The "while above generic text blocks" claim is incorrect. Tool_calls are only scored when text produces zero blocks (the if not all_blocks and last_output.tool_calls: guard on line 176). So a generic text block scoring +0 still wins over a tool_call scoring +5, because tool_calls are never scored when any text block exists. The +5 only arbitrates among tool_calls themselves; it never competes with text scores.
This makes the behavior two-tier (any text block > tool_calls), not the three-tier ordering the comment describes. test_generic_text_blocks_fallback_tier actually confirms the two-tier behavior (generic import os beats tool_call import sys), so it contradicts this comment.
Suggest rewording to match the real gating:
| # Score tool_call code with +5 bonus (vs +10 for text python blocks). | |
| # This keeps tool_calls below text-visible code in priority, while above | |
| # generic text blocks. Text blocks are preferred since they're visible to users. | |
| # Tool_calls are a fallback: only reached when text yields no code blocks | |
| # (see the `if not all_blocks` guard above). The +5 here only orders | |
| # multiple tool_calls relative to each other; it never competes with text | |
| # scores. Text is always preferred since it's visible to users. |
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
jakelorocco
left a comment
There was a problem hiding this comment.
Checking tools / tool inputs seems particularly fragile. I think we ought to reason out the approach Mellea should take to these situations. Should we provide an interface for requirement checking tool requests / tool request outputs?
This seems reasonable for python-based tools, but we should figure out a larger approach and take an opinionated stance on what Mellea does.
|
Jake. You're right that the heuristic field matching feels fragile as a long-term solution. Here's how we're thinking about the scope: Current implementation (this PR): We're scoping this to Python requirements only, using conservative heuristics:
This is good enough for the MVP because Python requirements are the only place we currently extract code from tool_calls, and the pattern matches the common tools we see (e.g., Claude Code's Python tool). Larger design question: You're right that we should establish an opinionated interface for how Mellea validates tool requests/outputs. This is worth its own issue because it affects:
Next steps: Let's create a follow-up issue to design a proper tool capability interface. In the meantime, this PR is Python-requirements-specific and documents the heuristic approach in the docstring. |
|
@jakelorocco I opened the follow up issue #1356 |
514bd9d
Pull Request
Issue
Fixes #1352
Description
Extends the PythonCodeExtraction validator to extract and analyze Python code from LLM tool_calls, not just response body text. When tool_use=True is enabled in model requests, Python code is often placed in tool arguments—this change ensures such code is captured and validated.
Changes
Testing
Tests added to the respective file if code was changed
New code has 100% coverage if code was added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)
Comprehensive test coverage in test/stdlib/requirements/test_python_tools.py
Attribution
Adding a new component, requirement, sampling strategy, or tool?
If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.
NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.