Skip to content

fix(eval): improve rubric text normalization for judge-garbled output#6080

Open
tottenjordan wants to merge 3 commits into
google:mainfrom
tottenjordan:fix/rubric-text-normalization
Open

fix(eval): improve rubric text normalization for judge-garbled output#6080
tottenjordan wants to merge 3 commits into
google:mainfrom
tottenjordan:fix/rubric-text-normalization

Conversation

@tottenjordan

Copy link
Copy Markdown

Summary

Fixes #6072

_normalize_text currently only does .lower().strip(), so judge-model garbling (markdown bullets, smart quotes, bold formatting, extra whitespace) causes exact rubric match failures. Rubric scores get silently dropped with only a warning log.

Changes:

  • Replace _normalize_text with NFKC unicode normalization, smart-quote/dash translation, and markdown artifact stripping
  • Add substring fallback with uniqueness guard to convert_auto_rater_response_to_score — accepts a match only when exactly one rubric candidate matches, preventing ambiguous cross-matching

Garbling patterns handled:

Input Normalized Match
- The response correctly uses tools the response correctly uses tools
* **The response correctly uses tools** the response correctly uses tools
"The response correctly uses tools" (smart quotes) the response correctly uses tools
— The response correctly uses tools (em dash) the response correctly uses tools
– The response correctly uses tools (en dash) the response correctly uses tools
• The response correctly uses tools (unicode bullet) the response correctly uses tools
The response correctly uses tools (double spaces) the response correctly uses tools
The response… uses tools (ellipsis) the response... uses tools
réponse (accented chars) réponse (preserved)

Per @surajksharma07's suggestion in #6072: uses NFKC normalization instead of ascii-ignore (preserves non-English rubrics), and adds uniqueness guard on the substring fallback.

Validation

  • Unit tests: 46 tests pass (44 existing + 2 new) in test_rubric_based_evaluator.py
  • E2E pipeline: Ran full GEPA optimization pipeline (gepa-run-8fb68a8f52-20260611-115752) with 4 rubric-based criteria, gemini-2.5-pro judge — zero "not found in rubrics" warnings across all generations

Test plan

  • pytest tests/unittests/evaluation/test_rubric_based_evaluator.py -v — all 46 pass
  • Parametrized TestNormalizeText covers all garbling patterns from issue
  • TestSubstringFallbackUniquenessGuard verifies unique match accepted, ambiguous match rejected
  • All existing tests unchanged and passing

Replace _normalize_text's simple lower().strip() with NFKC unicode
normalization, smart-quote/dash translation, and markdown artifact
stripping. Add substring fallback with uniqueness guard to
convert_auto_rater_response_to_score for cases where normalization
alone isn't sufficient.

Fixes google#6072
@tottenjordan

Copy link
Copy Markdown
Author

@surajksharma07 PR is up per your suggestion in #6072. Includes the NFKC normalization, smart-char mapping, and uniqueness guard on the substring fallback. 46 tests pass (44 existing + 2 new).

@rohityan rohityan added the eval [Component] This issue is related to evaluation label Jun 11, 2026
@rohityan

Copy link
Copy Markdown
Collaborator

/adk-pr-analyze

@rohityan

Copy link
Copy Markdown
Collaborator

Hi @tottenjordan , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors.

@rohityan rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RubricBasedEvaluator _normalize_text too basic — fails on judge model markdown output

2 participants