fix(eval): improve rubric text normalization for judge-garbled output#6080
Open
tottenjordan wants to merge 3 commits into
Open
fix(eval): improve rubric text normalization for judge-garbled output#6080tottenjordan wants to merge 3 commits into
tottenjordan wants to merge 3 commits into
Conversation
Replace _normalize_text's simple lower().strip() with NFKC unicode normalization, smart-quote/dash translation, and markdown artifact stripping. Add substring fallback with uniqueness guard to convert_auto_rater_response_to_score for cases where normalization alone isn't sufficient. Fixes google#6072
Author
|
@surajksharma07 PR is up per your suggestion in #6072. Includes the NFKC normalization, smart-char mapping, and uniqueness guard on the substring fallback. 46 tests pass (44 existing + 2 new). |
Collaborator
|
/adk-pr-analyze |
Collaborator
|
Hi @tottenjordan , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #6072
_normalize_textcurrently only does.lower().strip(), so judge-model garbling (markdown bullets, smart quotes, bold formatting, extra whitespace) causes exact rubric match failures. Rubric scores get silently dropped with only a warning log.Changes:
_normalize_textwith NFKC unicode normalization, smart-quote/dash translation, and markdown artifact strippingconvert_auto_rater_response_to_score— accepts a match only when exactly one rubric candidate matches, preventing ambiguous cross-matchingGarbling patterns handled:
- The response correctly uses toolsthe response correctly uses tools* **The response correctly uses tools**the response correctly uses tools"The response correctly uses tools"(smart quotes)the response correctly uses tools— The response correctly uses tools(em dash)the response correctly uses tools– The response correctly uses tools(en dash)the response correctly uses tools• The response correctly uses tools(unicode bullet)the response correctly uses toolsThe response correctly uses tools(double spaces)the response correctly uses toolsThe response… uses tools(ellipsis)the response... uses toolsréponse(accented chars)réponse(preserved)Per @surajksharma07's suggestion in #6072: uses NFKC normalization instead of ascii-ignore (preserves non-English rubrics), and adds uniqueness guard on the substring fallback.
Validation
test_rubric_based_evaluator.pygepa-run-8fb68a8f52-20260611-115752) with 4 rubric-based criteria, gemini-2.5-pro judge — zero "not found in rubrics" warnings across all generationsTest plan
pytest tests/unittests/evaluation/test_rubric_based_evaluator.py -v— all 46 passTestNormalizeTextcovers all garbling patterns from issueTestSubstringFallbackUniquenessGuardverifies unique match accepted, ambiguous match rejected